Helmet detection error analysis in football videos using Amazon SageMaker

The National Football League (NFL) is America’s most popular sports league. Founded in 1920, the NFL developed the model for the successful modern sports league and is committed to advancing progress in the diagnosis, prevention, and treatment of sports-related injuries. Health and safety efforts include support for independent medical research and engineering advancements in addition to a commitment to better protect players and make the game safer. This includes enhancements to medical protocols and improvements to how our game is taught and played. For more information about the NFL’s health and safety efforts, see NFL Player Health and Safety.

We have partnered with AWS to develop the Digital Athlete program, where we use AWS machine learning (ML) services to identify potential risks coming from helmet-to-helmet, helmet-to-shoulder and other body parts, and helmet-to-ground collisions. As of this writing, there is no automated way to identify these collisions. An expert needs to review hours of game footage to visually identify impacts and compare that to the actual collisions reported during the game. Our team, in collaboration with AWS Professional Services and BioCore, is developing computer vision algorithms to analyze All-22 videos using Amazon SageMaker to help shape the future of American football and its players.

We planned to accomplish this objective in three steps: detect helmets, track detected helmets, and identify impacts to tracked helmets on the field. The tracking and impact detection workflows are beyond the scope of this post. This discussion focuses on helmet detection even under challenging conditions such as when players are obscured by other players for several frames and when video quality and video zoom effects change as the cameras track the action.

In this post, we discuss how state-of-the-art object detection model metrics don’t provide the full picture of where detection goes wrong, and how that motivated us to create a custom visualization for the entire play that shows the full story of helmet detection performance as a function of time within the play. This visualization has significantly improved our understanding of when and how our helmet detection algorithms fail.

Detection challenge

The challenges of a helmet detector model with respect to team play are three-fold:

  • Helmet size is small compared to the image size in a typical clip of sideline or end zone view
  • Precise detection is important to subsequently track the same helmet in future clips to correctly identify an impact, if any
  • State-of-the-art object detection metrics collected from models don’t provide the full picture in the context of game plays

To address the first two challenges, we considered object detection algorithms that work well on relatively smaller objects and emphasize more on accuracy than speed.

To address the third challenge, we introduced a custom visualization technique that focused on some of the shortcomings of the conventional model metrics, specifically the following:

  • A frame-wise error analysis that captures missed and false detections
  • A visual summary of stacked true positives, false positives, and false negatives per frame over time to assess model performance for the entire play

Dataset and modeling

We recently announced a Kaggle competition (NFL 1st and Future – Impact Detection) for ML experts around the world to contribute towards NFL research addressing the need for a computer vision system to detect on-field helmet impacts as part of the Digital Athlete platform. In this post, we use static images from the competition data as an example to build a helmet detection model. We used Amazon SageMaker Ground Truth to create the computer vision dataset that is as accurate as possible to build a solid platform.

We used the Kaggle API to download the data within the SageMaker notebook instance. For instructions on creating a notebook instance, see Create a Notebook Instance. We used an ml.P3.2xlarge instance with one GPU and 50 GB EBS volume for better data manipulation and training. For more information about instance types, see Available Instance Types.

We started with some basic EDA to explore the static images and corresponding annotations. The labeled image dataset consists of 9,947 labeled images (with 4,958 sideline and 4,989 end zone) and a CSV file named image_labels.csv that contains the labeled bounding boxes for all images. The labeled file contains 193,736 helmets (114,986 sideline and 78,750 end zone) with 9,825 unique plays.

There are five different helmet labels, including Blurred, Sideline, Partial, and Difficult. The following table summarizes each label’s percentage of occurrence.

Helmet label type Percentage of occurrence
Helmet 66.98%
Helmet-Blurred 17.31%
Helmet-Sideline 7.76%
Helmet-Partial 4.55%
Helmet-Difficult 3.39%

We considered all Helmet types to be the same for simplicity and did an 80/20 split to train and test in the modeling phase.

Next, we used FasterRCNN with ResNet50 FPN as our helmet detection model and used a pretrained model based on COCO data within a PyTorch framework. For more information about object detection in TorchVision, see TorchVision Object Detection Finetuning Tutorial. The network seemed like an ideal choice because it detects objects of relatively smaller size and has performed very well in multiple standard object detection competitions. The goal was not to build an award-winning helmet detection model, but to identify errors in specific images within an entire play with a relatively high-performing model. 

Model performance metrics

We trained the model using the default PyTorch Conda environment pytorch_p36 within a SageMaker notebook instance. The Average Precision (AP) @[IoU=0.50:0.95] for the test set at the end of 10 epochs was 0.498, and Average Recall @@[IoU=0.50:0.95] was 0.56 and deemed excellent as an object detector.

We took the saved model and evaluated frame by frame on an entire play (for example, 57583_000082_Endzone). We used annotation labels for the entire play to evaluate frame by frame. The following graph is a plot of precision vs. recall for all the frames with mAP of 93.12% using object detection metrics package.
The following graph is a plot of precision vs. recall for all the frames with mAP of 93.12% using object detection metrics.
As evident from the plot, this is an excellent model and only fails if the helmet is either blurred or too difficult to detect even with expert eyes.

Next, we calculated the number of true positives, false positives, and false negatives for each frame of the 57583_000082_Endzone play. To match the predicted detection with ground truth annotations, we only considered predictions with scores higher than 0.9 and 0.25 IoU threshold between ground truth and the predicted bounding boxes. The conflicts between multiple detections for the same ground truth bounding boxes were resolved using a confidence score. Essentially, we only considered the highest confidence detections for multiple detections.

The number of ground truth helmets in each frame can vary between 18–22 for 57583_000082_Endzone, whereas our model predicted anywhere between 15–23 helmets. Therefore, even though our model is an excellent one, it did miss some helmets and made wrong predictions. Because false negatives or missed detections are more important for proper tracking of the players, we looked into the frames where we got too many false negatives.

The following image shows an example where the model predicted every helmet correctly (depicted by the cyan boxes).

This next image shows where the model missed a few helmets (depicted by red boxes) and made wrong predictions (depicted by blue boxes).

To identify where and why a model is underperforming, it’s imperative to calculate the precision, recall, and F1-score for each frame and for the overall play. We got a precision of 0.97, recall of 0.93, and F1-score of 0.95 for the overall play, which definitely doesn’t provide the full picture of errors in a team play context. The following plot shows several false positives, false negatives on the right y-axis and precision, recall on the left y-axis against the individual frame number. It’s clear that our model did an excellent job overall except in the frames between approximately 100–300, where typically tackling happens in football plays. Unfortunately, most impacts or collisions happen in these frame ranges, and therefore we dug deeper into the error cases.
Unfortunately, most impacts or collisions happen in these frame ranges, and therefore we dug deeper into the error cases.
The following plot is a stacked bar representation of true positives (green area), false negatives (red area), and false positives (blue area) against individual frame numbers. The black bold line represents the total number of ground truth helmets in each frame. The dotted vertical black line represents the snap frame. An ideal helmet detector should detect each and every helmet in each frame, thereby covering the entire area with green. However, as you can see in the visualization, our model had limitations, which are clearly depicted both qualitatively and quantitatively in the visualization.
However, as you can see in the visualization, our model had limitations.
Therefore, this novel visualization gives us a tool to distinguish between an excellent helmet detector and a perfect helmet detector. It also provides a quick visual summary that allows us to compare the performance of the detector in different plays and quickly identify the temporal location and type of error the models are propagating. This can further be leveraged to assess improved helmet detector models after retraining.

To improve the helmet detector model, we could retrain the model using additional frames that are harder to detect into the training set, train for longer epochs, apply hyperparameter tuning, implement additional augmentation techniques, or incorporate other modeling strategies. At every step, we can use this stacked bar plot as a tool to assess the model quality in a team game perspective because it provides a visual summary that depicts where and how models are failing to perform against a perfect benchmark.

Prerequisites

To reproduce this analysis in your own environment, you must complete the following prerequisites:

  1. Create an AWS account.
  2. Create a SageMaker instance.

It’s recommended to use an instance with GPU support, for example ml.p3.2xlarge. The EBS volume size should be around 50 GB in order to store all necessary data.

  1. Download the data from Kaggle using the Kaggle API.

Refer to the API credentials to retrieve and save the kaggle.json file on SageMaker within /home/ec2-user/.kaggle. For security reasons, make sure to change modes for accidental other users. See the following code:

pip install kaggle
mkdir /home/ec2-user/.kaggle
mv kaggle.json /home/ec2-user/.kaggle
chmod 600 ~/.kaggle/kaggle.json
kaggle competitions download -c nfl-impact-detection

Building the helmet detection model

The following code snippet shows the custom dataset class for helmets:

class DatasetHelmet(Dataset):

    def __init__(self, marking, image_ids, transforms=None, test=False):
        super().__init__()
        self.image_ids = image_ids
        self.marking = marking
        self.transforms = transforms
        self.test = test

    def __getitem__(self, index: int):
        image_id = self.image_ids[index]
        image, boxes, labels = self.load_image_and_boxes(index)
        num_boxes = len(boxes)
        if num_boxes > 0:
            target = {}
            new_boxes = torch.as_tensor(boxes, dtype=torch.float32)
            # there is only one class
            labels = torch.ones((num_boxes,), dtype=torch.int64)
            area = (new_boxes[:, 3] - new_boxes[:, 1]) * (new_boxes[:, 2] - new_boxes[:, 0])
            # suppose all instances are not crowd 
            iscrowd = torch.zeros((num_boxes,), dtype=torch.int64)

            target['boxes'] = new_boxes
            target['labels'] = labels
            target['image_id'] = torch.tensor([index])
            target["area"] = area
            target["iscrowd"] = iscrowd
        else:
            target = {}

        if self.transforms is not None:
            image, target = self.transforms(image, target)
        return image, target

    def __len__(self) -> int:
        return self.image_ids.shape[0]

    def load_image_and_boxes(self, index):
        image_id = self.image_ids[index]
        TRAIN_ROOT_PATH = args.train + "images"
        image = cv2.imread(f'{TRAIN_ROOT_PATH}/{image_id}', cv2.IMREAD_COLOR).copy().astype(np.float32)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
        image /= 255.0
        records = self.marking[self.marking['image'] == image_id]
        boxes = records[['left', 'top', 'width', 'height']].values
        boxes[:, 2] = boxes[:, 0] + boxes[:, 2]
        boxes[:, 3] = boxes[:, 1] + boxes[:, 3]
        labels = records['label'].values
        return image, boxes, labels

The following code shows the main training function:

def main(args):
#     Read images label csv file    
image_labels = pd.read_csv('/home/ec2-user/SageMaker/helmet_detection/input/image_labels.csv'
    # #     Split annotations into train and validation
    np.random.seed(0)
    image_names = np.random.permutation(image_labels.image.unique())
    valid_image_len = int(len(image_names)*0.2)
    images_valid = image_names[:valid_image_len]
    images_train = image_names[valid_image_len:]    
    logging.info(f"images_valid {images_valid}, n images_train {images_train}")
    # Define train and validation datasets and data loaders
    TRAIN_ROOT_PATH = args.train 

    train_dataset = DatasetHelmet(
        image_ids=images_train,
        marking=image_labels,
        transforms=get_transform(train=True),
        test=False,
    )
    validation_dataset = DatasetHelmet(
        image_ids=images_valid,
        marking=image_labels,
        transforms=get_transform(train=False),
        test=True,
    )    
   data_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=True, num_workers=1,
        collate_fn=utils_torchvision.collate_fn
    )
    data_loader_valid = torch.utils.data.DataLoader(
        validation_dataset, batch_size=args.batch_size, shuffle=False, num_workers=1,
        collate_fn=utils_torchvision.collate_fn
    )
    print(f"We have {len(train_dataset)} images for training and {len(validation_dataset)} for validation")
    
    # Set up model
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    ## Our dataset has two classes only - helmet and not helmet
    num_classes = 2
    ## Get the model using our helper function
    model = get_model(num_classes)
    print(f"Loaded model")

    # Set up training
    start_epoch = 0
    end_epoch = args.epochs
    params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.SGD(params, lr=0.005,
                                momentum=0.9, weight_decay=0.0005)
    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                                   step_size=3,
                                                   gamma=0.1)
    print(f"Loaded model parameters")

    ## if retraining from a checkpoint file
    if args.retrain:
        
        checkpoint = torch.load(os.path.join(args.model_dir, "model_checkpoint.pt"))
        model.load_state_dict(checkpoint['model_state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        start_epoch = checkpoint['epoch'] + 1
        end_epoch = start_epoch + args.epochs
        print('nLoaded checkpoint from epoch %d.n' % start_epoch)
       
    print(start_epoch, end_epoch)

    # Train model
    loss_epoch = []
    
    for epoch in range(start_epoch, end_epoch):
        # train for one epoch, printing every 1 iterations
        print(f"Training epoch {epoch}")
        train_one_epoch(model, optimizer, data_loader, data_loader_valid, device, epoch, loss_epoch, print_freq=1)

        # update the learning rate
        lr_scheduler.step()

        # evaluate on the test dataset
        evaluate(model, data_loader_valid, device=device, print_freq=1)
        # save checkpoint model after each epoch
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict()
            }, os.path.join(args.model_dir, "model_checkpoint.pt"))
        

    # Save final model
    torch.save(model.state_dict(), os.path.join(args.model_dir, "model_helmet_frcnn.pt"))
    loss_df = pd.DataFrame(loss_epoch, columns=["train_loss", "val_loss"])
    loss_df.reset_index(inplace=True)
    loss_df = loss_df.rename(columns = {'index':'Epoch'})
    print(loss_df)
    loss_df.to_csv (os.path.join(args.model_dir, "loss_epoch.csv"), index = False, header=True)

Evaluating helmet detection model

Use the saved model to run predictions on an entire play. The following code is an example function to run evaluations:

def run_detection_eval_video(video_in, gtfile_name, model_path, full_video=True, subset_video=60, conf_thres=0.9, iou_threshold = 0.5):
    """ Run detection on video

    Args:
        video_in: Input video path
        gtfile_name: Ground Truth annotation json file name
        model_path: Location of the pretrained model.pt 
        full_video: Bool to indicate whether to run the whole video, default = False
        subset_video: Number of frames to run detection on
        conf_thres = Only consider detections with score higher than conf_thres, default = 0.9
        iou_threshold = Match detection with ground trurh if iou is higher than iou_threshold, default = 0.5
    Returns:
        Predicted detection for all the frames in a video, evaluation for detection, a dataframe with bounding boxes for false negatives and false positives
        df_predictions (pandas.DataFrame): prediction of detected object for all frames 
          with columns ['frame_id', 'class_id', 'score', 'x1', 'y1', 'x2', 'y2']
        eval_results (pandas.DataFrame): Count of total number of objects in gt and det, and tp, fn, fp for all frames
          with columns ['frame_id', 'num_object_gt', 'num_object_det', 'tp', 'fn', 'fp']
        fns (pandas.DataFrame): False negative records in a Pandas Dataframe for all frames
          with columns ['frame_id','class_id','x1','y1','x2','y2'], 
          return empty if no false negatives 
        fps (pandas.DataFrame): False positive records in a Pandas Dataframe for all frames
          with columns ['frame_id','class_id', 'score', 'x1','y1','x2','y2'], 
          return empty if no false positives 

    """
    # Capture the input video
    vid = cv2.VideoCapture(video_in)

    # Get video title
    vid_title = os.path.splitext(os.path.basename(video_in))[0]

    # Get total number of frames
    num_frames = vid.get(cv2.CAP_PROP_FRAME_COUNT)

    # load model 
    num_classes = 2
    model = ObjectDetector.load_custom_model(model_path=model_path, num_classes=num_classes)
    print("Pretrained model loaded")

    # Get GT annotations
    gt_labels = pd.read_csv('/home/ec2-user/SageMaker/helmet_detection/input/train_labels.csv')
    video = os.path.basename(video_in)
    print("Processing video: ",video)
    labels = gt_labels[gt_labels['video']==video]

    # if running for the whole video, then change the size of subset_video with total number of frames 
    if full_video:
        subset_video = int(num_frames)   

    df_predictions = [] # predictions for whole video
    eval_results = [] # detection evaluations for the whole video 
    fns = [] # false negative detections for the whole video 
    fps = [] # false positive detections for the whole video 

    for i in range(subset_video): 

        ret, frame = vid.read()
        print("Processing frame#: {} running detection and evaluation for videos".format(i+1))

        # Get detection for this frame
        list_frame = [frame]
        dataset_frame = FramesDataset(list_frame)
        prediction = ObjectDetector.run_detection(dataset_frame, model)
        df_prediction = ObjectDetector.to_dataframe_highconf(prediction, conf_thres, i)
        df_predictions.append(df_prediction)

        # Get label for this frame
        cur_label = labels[labels['frame']==i+1] # get this frame's record
        cur_boxes = cur_label[['left','width','top','height']].values
        gt = ObjectDetector.get_gt_frame(i+1, cur_boxes)

        # Evaluate detection for this frame
        eval_result, fn, fp = ObjectDetector.evaluate_detections_iou(gt, df_prediction, iou_threshold)
        eval_results.append(eval_result)
        if fn is not None:
            fns.append(fn)
        if fp is not None:
            fps.append(fp)

    # Concatenate predictions, evaluation resutls, fns and fps for all frames of the video
    df_predictions = pd.concat(df_predictions)
    eval_results = pd.concat(eval_results)
    
    # Concatenate fns if not empty, otherwise create an empty dataframe
    if not fns:
        fns = pd.DataFrame()
    else:
        fns = pd.concat(fns)
        
    # Concatenate fps if not empty, otherwise create an empty dataframe
    if not fps:
        fps = pd.DataFrame()
    else:
        fps = pd.concat(fps)

    return df_predictions, eval_results, fns, fps

After we have evaluation results saved in a Pandas DataFrame, we can use the following code snippet to plot the stacked bar figure we described earlier:

pal = ["g","r","b"]
plt.figure(figsize=(12,8))
plt.stackplot(eval_det['frame_id'], eval_det['tp'], eval_det['fn'], eval_det['fp'], 
              labels=['TP','FN','FP'], colors=pal)
plt.plot(eval_det['frame_id'], eval_det['num_object_gt'], color='k', linewidth=6, label='Total Helmets')
plt.legend(loc='best', fontsize=12)
plt.xlabel('Frame ID', fontsize=12)
plt.ylabel(' # of TPs, FNs, FPs', fontsize=12)
plt.axvline(x=snap_time, color='k', linestyle='--')
plt.savefig('/home/ec2-user/SageMaker/helmet_detection/output/stacked.png')

Conclusion

In this post, we showed how we used Amazon SageMaker to build a helmet detector model, ran error analysis on a team play context, and improved the detector model with better precision in the frames where it matters the most. With the visualization tool that we created, we could qualitatively and quantitatively assess the model accuracy in the entire play context. Furthermore, we could introduce additional training images and improve the model accuracy as depicted by both traditional state-of-the-art object detector metrics and our custom visualization.

With a near-perfect helmet detector model, our team is ready for the next step, which is tracking the players on the ground and detecting impacts using computer vision techniques. This will be discussed in a future post.

Readers are welcome to check out the Kaggle competition website and should be able to reproduce the results presented here with the code included in the post.


About the Authors

Sam Huddleston is a Sr. Data Scientist at Biocore LLC, who serves as the Technology Lead for the NFL’s Digital Athlete program. Biocore is a team of world-class engineers based in Charlottesville, Virginia, that provides research, testing, biomechanics expertise, modeling and other engineering services to clients dedicated to the understanding and reduction of injury.

 

 

 

Jayeeta Ghosh is a Data Scientist who works on AI/ML projects for AWS customers and helps solve customer business problems across industries using deep learning and cloud expertise.

Read More

Explaining Bundesliga Match Facts xGoals using Amazon SageMaker Clarify

One of the most exciting AWS re:Invent 2020 announcements was a new Amazon SageMaker feature, purpose built to help detect bias in machine learning (ML) models and explain model predictions: Amazon SageMaker Clarify. In today’s world where predictions are made by ML algorithms at scale, it’s increasingly important for large tech organizations to be able to explain to their customers why they made a certain decision based on an ML model’s prediction. Crucially, this can be seen as a direct move away from the underlying models being closed boxes for which we can observe the inputs and outputs, but not the internal workings. This not only opens up avenues of further analysis, so as to iterate and further improve on model configurations, but also provides previously unseen levels of model prediction analysis to customers.

One particularly interesting use case for Clarify is from the Deutsche Fußball Liga (DFL) on Bundesliga Match Facts powered by AWS, with the goal of uncovering interesting insights into the xGoals model predictions. Bundesliga Match Facts powered by AWS provides a more engaging fan experience during soccer matches for Bundesliga fans around the world. It gives viewers information on the difficulty of a shot, the performance of their favorite players, and can illustrate the offensive and defensive trends of their team.

With Clarify, the DFL can now interactively explain what some of the key underlying features are in determining what led the ML model to predict a certain xGoals value. An xGoal (short for Expected Goals) is the calculated probability of a player scoring a goal when shooting from any position on the pitch. Knowing respective feature attributions and explaining outcomes helps in model debugging, which in turn results in higher-quality predictions. Perhaps most importantly, this additional level of transparency helps build confidence and trust in your ML models, opening up countless opportunities for cooperation and innovation moving forward. Better interpretability leads to better adoption. Without further ado, let’s dive in!

Bundesliga Match Facts

Bundesliga Match Facts powered by AWS provides advanced real-time statistics and in-depth insights, generated live from official match data, for Bundesliga matches. These statistics are delivered to viewers via national and international broadcasters, as well as DFL’s platforms, channels, and apps. Through this, over 500 million Bundesliga fans around the world gain more advanced insights into players, teams, and the league, and are delivered a more personalized experience and the next generation of statistics.

With the Bundesliga Match Fact xGoals, the DFL can assess the probability of a player scoring a goal when shooting from any position on the field. The goal probability is calculated in real time for every shot to give viewers insight into the difficulty of a shot and the likelihood of a goal. The higher the xGoals value (with all values lying between 0–1), the greater the likelihood of a goal. In this post, we take a closer look at this xGoals metric, diving into the inner workings of the underlying ML model in order to determine why it makes certain predictions, both for individual shots and across entire football seasons’ worth of data.

Preparing and examining the training data

The Bundesliga xGoals ML model goes beyond previous xGoals models in that it combines shot-at-goal event data with high-precision data obtained from advanced tracking technology with a 25-Hz frame rate. With real-time ball and player positions, a bespoke model can determine an array of additional features such as the angle to the goal, the distance of a player to the goal, a player’s speed, the number of defenders in the line of shot, and goalkeeper coverage, to name just a few. We used the area under the ROC curve (AUC) as the objective metric for our training job, and trained the xGoals model on over 40,000 historical shots at goals in the Bundesliga since 2017, using the Amazon SageMaker XGBoost algorithm. For more information on the xGoals training process with the Amazon SageMaker Python SDK and XGBoost hyperparameter optimization, see The tech behind the Bundesliga Match Facts xGoals: How machine learning is driving data-driven insights in soccer.

When we look at a few of the rows of the original training dataset, we get an idea of the types of features we’re dealing with; a mix of binary, categorical, and continuous values across a large dataset of attempted shots at goal. The following screenshot shows 8 of the 17 features used for both model training and explainability processing.

SageMaker Clarify

SageMaker has been instrumental in allowing novice data scientists and seasoned ML academics alike to prepare datasets, build and train custom models, and later deploy them into production across a wide array of industry verticals, including healthcare, media and entertainment, and finance.

Like most ML tools, it was missing a way of diving deeper and explaining the results of said models, or investigating training datasets for potential bias. That has all changed with the announcement of Clarify, which offers you the ability to detect bias and implement model explainability in a repeatable and scalable manner.

Lack of explainability can often create a barrier for organizations to adopt ML. Theoretical approaches for overcoming this lack of model explainability have undeniably matured in recent years, with one standout framework becoming a crucial tool in the world of explainable AI: SHAP (SHapley Additive Explanations). Although a full explanation of this method is beyond the scope of this post, at its core SHAP builds out model explanations by posing the following question: “How does a prediction change when a certain feature is removed from our model?” The SHAP values are the answer to this question—they directly compute the contribution of a feature’s effect on a prediction in terms of both magnitude and direction. With its roots in coalition game theory, SHAP values aim to characterize the feature values of a data instance as players in a coalition, and subsequently tells us how to fairly distribute the payout (the prediction) among the various features. An elegant feature of the SHAP framework is that it’s both model agnostic and highly scalable, working on both simple linear models and deep, complex neural networks with hundreds of layers.

Explaining Bundesliga xGoals model behavior with Clarify

Now that we’ve introduced our dataset and ML explainability, we can start to initialize our Clarify processor, which computes our desired SHAP values. All the arguments in this processor are generic and are related only to your current production environment and the AWS resources at your disposal.

First, let’s define the Clarify processing job, along with the SageMaker session, AWS Identity and Access Management (IAM) execution role, and Amazon Simple Storage Service (Amazon S3) bucket with the following code:

from sagemaker import clarify
import os 

session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
region = session.boto_region_name

prefix = ‘sagemaker/dfl-tracking-data-xgb’ 

clarify_processor = clarify.SageMakerClarifyProcessor(role=role,
								instance_count=1, 
								instance_type=’ml.c5.xlarge’, 
								sagemaker_session=session, 
								max_runtime_in_seconds=1200*30, 
								volume_size_in_gb=100*10)

We can save the CSV training file to Amazon S3, and then specify the training data and results path for the Clarify job as follows:

DATA_LAKE_OBSERVED_BUCKET = ‘sts-openmatchdatalake-dev’
DATA_PREFIX = ‘sagemaker_input’
MODEL_TYPE = ‘observed’
TRAIN_TARGET_FINAL = ‘train-clarify-dfl-job.csv’

csv_train_data_s3_path = os.path.join(
				“s3://”, 
				DATA_LAKE_OBSERVED_BUCKET, 
				DATA_PREFIX, 
				MODEL_TYPE, 
				TRAIN_TARGET_FINAL
				)

RESULT_FILE_NAME = ‘dfl-clarify-explainability-results’ 

analysis_result_path = ‘s3://{}/{}/{}’.format(bucket, prefix, RESULT_FILE_NAME)

Now that we have instantiated the Clarify processor and defined our explainability training dataset, we can start to specify our problem-specific experimental configuration:

BASELINE = [-1, 61.91, 25.88, 16.80, 15.52, 3.41, 2.63, 1,
     -1, 1, 2.0, 3.0, 2.0, 0.0, 12.50, 0.46, 0.68]

COLUMN_HEADERS = [“target”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”,
		     “9”, “10”, “11”, “12”, “13”, “14”, “15”, “16”, “17”]

NBR_SAMPLES = 1000
AGG_METHOD = “mean_abs”
TARGET_NAME = ‘target’
MODEL_NAME = ‘sagemaker-xgboost-201014-0704-0010-a28f221a’

The following are important input parameters to note, as seen in the preceding relevant code snippet:

  • BASELINE – These baselines are crucial for calculating our model explanations. There is a baseline value for each feature. For our experiments, we use the average for continuous numerical features and the mode for categorical features. For more information, see SHAP Baselines for Explainability.
  • NBR_SAMPLES – The number of samples to be used in the SHAP algorithm.
  • AGG_METHOD – The aggregation method used to compute global SHAP values, which in our case is the mean of absolute SHAP values for all instances.
  • TARGET_NAME – The name of the target feature that the underlying XGBoost model is trying to predict.
  • MODEL_NAME – The (previously) trained SageMaker XGBoost model endpoint name.

We directly pass the important parameters into our clarify.ModelConfig, clarify.SHAPConfig, and clarify.DataConfig instances. Running the following code sets the processing job in motion:

model_config = clarify.ModelConfig(model_name=MODEL_NAME, 
					     instance_type=’ml.c5.xlarge’, 
					     instance_count=1, 
  	 				     accept_type=’text/csv’)

shap_config = clarify.SHAPConfig(baseline=[BASELINE], 
					   num_samples=NBR_SAMPLES, 
				       agg_method=AGG_METHOD, 
					   use_logit=False, 
					   save_local_shap_values=True)

explainability_data_config = clarify.DataConfig(
     s3_data_input_path=csv_train_data_s3_path,
     s3_output_path=analysis_result_path, 
     label=TARGET_NAME, 
     headers=COLUMN_HEADERS, 
     dataset_type=’text/csv’)

clarify_processor.run_explainability(data_config=explainability_data_config, 
						 model_config=model_config,
						 explainability_config=shap_config)

Global explanations

After we run our Clarify explainability analysis over the entirety of our xGoals training set, we can quickly and easily view the global SHAP values and their distribution for each feature, thereby allowing us to map how either positive or negative changes in the value of a given feature affects the final prediction. We use the open-source SHAP library to plot the SHAP values that are computed inside our processing job.

The following plot is an example of a global explanation, which allows us to understand the model and its feature combinations in aggregate over multiple data points. The features AngleToGoal, DistanceToGoal, and DistanceToGoalClosest play the most important roles in predicting our target variable, namely whether a goal is scored or not.

The following plot is an example of a global explanation, which allows us to understand the model and its feature combinations in aggregate over multiple data points

This type of plot can go even further, providing us with more context than the bar chart, a greater level of insight into the SHAP value distribution for each feature (allowing you to map how changes in the value of a given feature affect the final prediction), and the positive and negative relationships of the predictors with the target variable. Every data point in the following plots represents a single attempt at a goal.

Every data point in the following plots represents a single attempt at a goal.

As suggested by the vertical axis on the right side of the plot, a red data point indicates a higher value of the feature, and a blue data point indicates a lower value. The positive and negative impact on the goal prediction value is shown on the x-axis, derived from our SHAP values. From this you can logically infer, for example, that an increase in the angle to goal leads to higher log odds for prediction (which is associated with True predictions for a goal being scored or not).

It’s worth noting that for regions that have an increased vertical dispersion of results, we simply have a higher concentration of data points that are overlapping, which gives us a sense of the distribution of the Shapley values per feature.

The features are ordered according to their importance, from top to bottom. When we compare this plot across the three seasons (2017–2018, 2018–2019, and 2019–2020), we see little to no change in both the feature importance and their associated SHAP value distribution. The same is true across all the individual clubs in the Bundesliga competition, with only a handful of clubs deviating from the norm.

Although none of our match events were penalties (all having a feature value =1), it must still be included in the Clarify processing job because it was also included in the original XGBoost model training. We need to have consistency between the two feature sets for model training and Clarify processing.

xGoals feature dependence

We can dive even deeper and look at the SHAP feature dependence plots, arguably the simplest global interpretation. We simply select a feature and then plot the feature value on the x-axis and the corresponding SHAP value on the y-axis. The following plot shows that relationship for our most important features:

  • AngleToGoal – Small angles (< 25) decrease the likelihood of there being a goal, whereas larger angles increase it.
  • DistanceToGoal – There is a steep drop (mimicking a logarithmically decreasing function) in the likelihood of a goal occurring as you move further away from the goal center. Beyond a certain distance, it has no impact on the SHAP value; all other things being equal, a shot from 20 meters is just as likely to go in as it is from 40 meters. This observation could perhaps be explained by the fact that players within this range are only going to be taking a shot for some special reason that would increase their chances of goal; be it the keeper being off of their line or there being no defenders nearby to close the player down and block the shot.
  • DistanceToGoalClosest – Unsurprisingly, a large correlation exists here with DistanceToGoal, but with far more of a linear relationship: the SHAP value decreases monotonically as the distance to the closest point of the goal increases.

When we take a closer look at two of our (less influential) categorical variables, we see that, all other things being equal, a header invariably decreases the likelihood of a goal, whereas a freekick increases it. Given the vertical dispersion around the 0 SHAP value for FootShot=Yes and FreeKick=No, there is nothing to conclude about their effects on goal predictions.

When we take a closer look at two of our (less influential) categorical variables, we see that, all other things being equal, a header invariably decreases the likelihood of a goal, whereas a freekick increases it.

xGoals feature interaction

We can improve the dependence plots by highlighting the interaction between different features—the additional affect, after we take into account the individual feature effects. We use the Shapley interaction index from game theory to compute the SHAP interaction values for all features to acquire one matrix per instance with dimensions F X F, where F is the number of features. With this interaction index, we can then color the SHAP feature dependence plot with the strongest interaction.

For example, suppose we want to know how the variables DistanceToGoal and PressureSum interact, and the affect they have on the SHAP value for the DistanceToGoal. PressureSum is calculated by simply summing all the individual pressures of opposing players on the shooter. We can see a negative relationship between the DistanceToGoal and the target variable, with the likelihood of a goal increasing as we get closer to the goal. Unsurprisingly, a strong inverse relationship exists between DistanceToGoal and PressureSum for those match events with a high goal prediction; as the former decreases, the latter rises.

Nearly all goals that are scored close to the goal are hit with an angle greater than 45 degrees. As you move further away from the goal, the angle reduces. This makes sense; how often is it that you see someone score a goal from the sideline when 40 meters out?

Nearly all goals that are scored close to the goal are hit with an angle greater than 45 degrees.

Keeping in mind that, based on the preceding results, a high angle to goal increases the likelihood of scoring a goal, we can look at the SHAP value of the number of defenders and determine that this is only the case when only one or two defenders are near the attacker.

A high angle to goal increases the likelihood of scoring a goal

Looking back closely at our initial global summary plot, we can see some uncertainty (represented by the dense clustering around the zero SHAP value mark) for the features PressureSum and PressureMax. We can use interaction plots to deep dive into these values and try to unpack and identify what is causing this.

Upon inspection we see that, even for the two most important features, they have a very minimal effect on changing the SHAP value of PressureSum. The key takeaway here is that when little to no pressure is on a player, a low DistanceToGoal increases the likelihood of a goal, while the inverse is true for when there is a lot of pressure close to the goal: the player is less likely to score. These affects are again reversed for the AngleToGoal: as the pressure increases, we see an increased AngleToGoal decreasing the SHAP value of PressureSum. It’s reassuring to have our feature interaction plots confirm our preconceived ideas of the game, as well as quantify the various powers at play.

Upon inspection we see that, even for the two most important features, they have a very minimal effect on changing the SHAP value of PressureSum.

Unsurprisingly, few headers were scored with an angle less than 25. More interestingly, however, when comparing the affects that a header or FootShot has on the likelihood of a goal being scored, we see that for any given angle in the range 25–75, a header reduces it. This can be simplified as follows: if your favorite player has the ball at their feet while at a wide angle to the goal, they’re more likely to score it than if the ball is soaring through the air!

Conversely, for angles greater than 25, a player moving at a slow speed towards the goal reduces the likelihood of a goal compared to a player moving at a greater speed. As we can see from both plots, a noticeable divide exists between the impact that AngleToGoal < 25 and AngleToGoal > 25 have on the goal prediction. We can start to see the value in using SHAP values to analyze seasons’ worth of data, because we have quickly identified a universal trend in the data.

We can start to see the value in using SHAP values to analyze seasons’ worth of data, because we have quickly identified a universal trend in the data.

Local explanations

Our analysis so far has focused solely on explainability results for the entire dataset—global explanations—so we now explore some particularly interesting matches and their goal events, looking at what is referred to as a local explanation.

When we look back at one of the most interesting games of the 2019–2020 season, where Bayer 04 Leverkusen beat Borussia Dortmund in a 4–3 thriller on February 8, 2020, we can look at the varying affects each feature has on the xGoals values (the model output value we see on the horizontal axis). We see how, starting from the bottom and working our way up, the features start to have an ever-increasing impact on the final prediction, with some extreme cases showcasing how AngleToGoal, DistanceToGoalClosest, and DistanceToGoal really have the final say in our XGBoost model’s probability prediction. The dashed lines are those match events in which a goal occurred.

The dashed lines are those match events in which a goal occurred.

When we look at the sixth goal of the game, scored by Leon Bailey, which the model predicted with relative ease, we can see that many of the (key) feature values are exceeding their average, and contributing toward increasing the likelihood of a goal, as reflected in the relatively high xGoals value of 0.36 in the following force plot.

When we look at the sixth goal of the game, scored by Leon Bailey, which the model predicted with relative ease

The base value that we see is the average xGoals value across every attempted shot in the Bundesliga in the past three seasons sits at 0.0871! The XGBoost model starts its prediction at this baseline, with positive and negative forces that either increase or decrease the prediction. In the plot, a feature’s SHAP value serves as an arrow that pushes to increase (positive value) or decrease (negative value) the prediction value. In the preceding case, none of the features are capable of counteracting the high AngleToGoal (56.37), low AmountOfDefenders (1.0), and low DistanceToGoal (6.63) for this shot at goal. All qualitative descriptions (such as small, low, and large) are in relation to the average values across the dataset for each respective feature.

At the other extreme, there are certain goals that our XGBoost model can’t predict and the SHAP values can’t explain. Voted to be the best goal of the 2019–2020 season by 22% of Bundesliga viewers, Emre Can’s jaw-dropping strike was given a near-zero (3%) chance of going in and, taking into account his great distance from the goal (approximately 30 meters) and at such a flat angle (11.55 degrees), we can see why. The only features working to increase his chances of scoring were the fact that he had very little pressure on him at the time, with only two players in the local vicinity capable of closing him down. But this was clearly not enough to stop Can. As has always been the case in football, every aspect of a shot can be too perfect that no human, let alone an advanced ML model, can predict their outcome.

At the other extreme, there are certain goals that our XGBoost model can’t predict and the SHAP values can’t explain.

Let’s take a look at Can’s goal in action, brought to life in 2D animation simply by using the positional tracking data of the players at the time of the goal.

 

Conclusion: Implications for Bundesliga Match Facts

The primary implications for Bundesliga Match Facts powered by AWS going forward are twofold. The experimental results in this post demonstrate that we have:

  • Begun automating the process of exploring and analyzing goal prediction data at scale, in novel ways
  • Offered a model explainability and bias platform that can be improved on for the further capture of interesting and significant shot patterns

In real-world scenarios as complex as a football game, conventional or logic-specific rule-based systems start to break down upon application, failing to offer any sort of match event prediction let alone an in-depth explanation of how it was made. When we apply Clarify, we can both enhance goal prediction models and contextualize football match events on a per-play basis.

As technology for capturing football data has advanced dramatically in recent years, so too have the models that we can use to model this growing mountain of data. As the complexity, depth, and richness of the Bundesliga Match Facts dataset continues to grow, the team is continuously exploring new and exciting ideas for additional match facts and how to tweak our best in-production models in light of insightful explainability results. This, in tandem with inevitable and ongoing Clarify updates and improvements, opens up a wealth of exciting avenues going forward for both xGoals and Bundesliga Match Facts.

“Amazon SageMaker Clarify brings the power of state-of-the-art explainable AI algorithms to the fingertips of our developers in a matter of minutes and seamlessly integrates with the rest of the Bundesliga Match Facts digital platform—a key part of our long-term strategy of standardizing our ML workflows on Amazon SageMaker,” reports Gabriel Anzer, Data Scientist at Sportec Solutions (STS), a key partner organization of Bundesliga Match Facts powered by AWS.

Whether this solution allows fantasy football players an edge in their local league, provides managers with an objective assessment of a player’s current (and predicted) future performance, or serves as a conversation starter for notable football pundits in identifying offensive and defensive trends for particular players and teams, you can already appreciate the tangible value created across all areas of the football ecosystem by applying Clarify to Bundesliga Match Facts.


About the Authors

Nick McCarthy is a Data Scientist in the AWS Professional Services team. He has worked with AWS customers across various industries including healthcare, finance, and sports & media to accelerate their business outcomes through the use of AI/ML. Outside of work he loves to spend time travelling, trying new cuisines and reading about science and technology. Nick’s background is in Astrophysics and Machine Learning and, despite occasionally following the Bundesliga, he has been a Manchester United fan from an early age!

 

Luuk Figdor is a data scientist in the AWS Professional Services team. He works with clients across industries to help them tell stories with data using machine learning. In his spare time he likes to learn all about the mind and the intersection between psychology, economics and AI.

 

 

 

Gabriel Anzer is the lead data scientist at Sportec Solutions AG, a subsidiary of the DFL. He works on extracting interesting insights from football data using AI/ML for both fans and clubs. Gabriel’s background is in Mathematics and Machine Learning, but he is additionally pursuing his PhD in Sports Analytics at the University of Tübingen and working on his football coaching license.

Read More

AI for AgriTech: Classifying Kiwifruits using Amazon Rekognition Custom Labels

Computer vision is a field of artificial intelligence (AI) that is gaining in popularity and interest largely due to increased access to affordable cloud-based training compute, more performant algorithms, and optimizations for scalable model deployment and inference. However, despite these advances in individual AI and machine learning (ML) domains, simplifying ML pipelines into coherent and observable workflows so they’re more accessible to smaller business units has remained an elusive goal. This is especially true in the agricultural technology space, where computer vision has strong potential for improving production yields through automation, but also in the area of health and safety, where dangerous jobs may be performed by AI rather than human agtech workers. Agricultural applications by AWS customers like sorting produce based on its grade and defects (IntelloLabs, Clarifruit, and Hectre) and proactively targeting pest control measures as early and as efficiently as possible (Bayer Crop Science), are some areas where computer vision shows strong promise.

Although compelling, these applications of machine vision are generally only accessible to larger agricultural enterprises due to the complexity of the train-compile-deploy-infer sequence for specific edge hardware architectures, which introduces a degree of separation between technology and the practitioners that could most benefit from it. In many cases, this disconnect is grounded in a perceived complexity of AI/ML, and the lack of a clear path for its end-to-end application in primary sectors like agriculture, forestry, and horticulture. In most cases, the prospect of hiring a qualified and experienced data scientist to explore opportunities, without the ability for managers and operators to experiment and innovate directly, is both financially and organizationally impractical. At a recent agtech presentation in New Zealand, an executive participant highlighted the lack of an end-to-end AWS computer vision solution as a limiting factor for experimentation, which would be required in order to justify organizational buy-in for more robust technology evaluation.

This post seeks to demystify how AWS AI/ML services work together, and specifically show how you can generate labeled imagery, train machine vision models against that imagery, and deploy custom image recognition models using Amazon Rekognition Custom Labels. You should be able to get up and running with a custom computer vision model within about an hour by following this tutorial, and make more informed judgments for further investment in AI/ML innovation based on data that is relevant to your specific needs.

Training image storage

As shown in the following pipeline, the first step to generating a custom computer vision model is to generate labeled images that we use to train our model. To do so, we first load our unlabeled training images into an Amazon Simple Storage Service (Amazon S3) bucket within our account, with each class being stored in its own folder under our bucket. For this example, our prediction classes are two types of kiwifruit (Golden and Monty), with images of known types. After you collect your images of each training class, simply upload them to the respective folder within your Amazon S3 bucket either through the Amazon S3 API or the AWS Management Console.

As shown in the following pipeline, the first step to generating a custom computer vision model is to generate labeled images that we use to train our model.

Setting up Amazon Rekognition

To start using Amazon Rekognition, complete the following steps:

  1. On the Amazon Rekognition console, choose Use Custom Labels.
  2. Choose Get started to create a new project.

Projects are used to store your models and training configurations.

  1. Enter a name for your project (for example, Kiwifruit-classifier-project).
  2. Choose Create.
  3. On the Datasets page, choose Create new dataset.
  4. Enter a name for the dataset (for example, kiwifruit classifier).
  5. For Image location, select Import images from Amazon S3 bucket.

For Image location, select Import images from Amazon S3 bucket.

  1. For S3 folder location, enter the location where your images are stored.
  2. For Automatic labeling, select Automatically attach a label to my images based on the folder they’re stored in.

This means that the labels of the folders are applied to each image as the class of that image.

For Policy, enter the provided JSON into the S3 bucket, to ensure that Amazon Rekognition can access that data to train the model.

  1. For Policy, enter the provided JSON into the Amazon S3 bucket, to ensure that Amazon Rekognition can access that data to train the model.

  1. Choose Submit.

Training the model

Now that we have successfully generated our labeled images using the folder names in which those images are stored, we can train our model.

  1. Choose Train model to create a project in which our models are stored after training.
  2. For Choose project, enter the ARN for the project that you created.
  3. For Choose a training dataset, choose the dataset you created.
  4. For Create test set, select Split training dataset.

This automatically reserves part of your labeled data for use in evaluating performance of our trained model.

  1. Choose Train to start your training job.

Training may take some time (depending on the number of labeled images you provided), and you can monitor progress on the Projects page.

  1. When training is finished, choose the model under your project to see its performance for each class.
  2. Under Use your model, choose API Code.

This allows you to get code samples to start and stop your model and conduct inference using the AWS Command Line Interface (AWS CLI).

It can take a few minutes to deploy the inference endpoint after starting the model.

Using your newly trained model

Now that you have a trained model that you’re happy with, using it is as simple as referencing an image from an Amazon S3 bucket using the sample API code provided in order to generate an inference. The following code is an example of Python code using the boto3 library to analyze an image:

client = boto3.client('rekognition', 
        region_name='us-east-1', 
        aws_access_key_id=access_key_id, 
        aws_secret_access_key=access_key
        )

    api_output = client.detect_custom_labels(
        ProjectVersionArn=modelProject,
        Image={
            'S3Object': {
                'Bucket': bucket,
                'Name': 'images/' + filepath
            }
        }
    )
    return api_output

Simply parse the JSON response in order to access the Name and Confidence fields of the payload for the image inference.

Summary

In this post, we learned how to use Amazon Rekognition Custom Labels with an Amazon S3 folder labeling functionality to train an image classification model, deploy that model, and use it to conduct inference. Next steps might be to follow similar steps for a multi-class classifier, or use Amazon SageMaker Ground Truth to generate data with bounding box annotations in addition to class labels. For more information and ideas for other ways to use computer vision in agriculture, check out the AWS Machine Learning Blog and the AWS for Industries: Agriculture Blog.


About the Author

STEFFEN MERTENSteffen Merten is a Startup aligned Principal Solutions Architect based in New Zealand. Prior to AWS, Steffen was Chief Data Officer for Marsello, following five years as an embedded analyst at Palantir. Steffen’s roots are in complex systems analysis with over ten years spent studying both ecological and social systems in the U.S. national security industry throughout the Middle East, South, and Central Asia.

Read More

Perform interactive data processing using Spark in Amazon SageMaker Studio Notebooks

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). With a single click, data scientists and developers can quickly spin up Studio notebooks to explore datasets and build models. You can now use Studio notebooks to securely connect to Amazon EMR clusters and prepare vast amounts of data for analysis and reporting, model training, or inference.

You can apply this new capability in several ways. For example, data analysts may want to answer a business question by exploring and querying their data in Amazon EMR, viewing the results, and then either alter the initial query or drill deeper into the results. You can complete this interactive query process directly in a Studio notebook and run the Spark code remotely. The results are then presented in the notebook interface.

Data engineers and data scientists can also use Apache Spark for preprocessing data and use Amazon SageMaker for model training and hosting. SageMaker provides an Apache Spark library that you can use to easily train models in SageMaker using org.apache.spark.sql.DataFrame DataFrames in your EMR Spark clusters. After model training, you can also host the model using SageMaker hosting services.

This post walks you through securely connecting Studio to an EMR cluster configured with Kerberos authentication. After we authenticate and connect to the EMR cluster, we query a Hive table and use the data to train and build an ML model.

Solution walkthrough

We use an AWS CloudFormation template to set up a VPC with a private subnet to securely host the EMR cluster. Then we create a Kerberized EMR cluster and configure it to allow secure connectivity from Studio. We then create a Studio domain and a new Studio user. Finally, we use the new PySpark (SparkMagic) kernel to authenticate and connect a Studio notebook to the EMR cluster.

The PySpark (SparkMagic) kernel allows you to define specific Spark configurations and environment variables, and connect to an EMR cluster to query, analyze, and process large amounts of data. Studio comes with a SageMaker SparkMagic image that contains a PySpark kernel. The SparkMagic image also contains an AWS Command Line Interface (AWS CLI) utility, sm-sparkmagic, that you can use to create the configuration files required for the PySpark kernel to connect to the EMR cluster. For added security, you can specify that the connection to the EMR cluster uses Kerberos authentication.

Studio runs on an environment managed by AWS. In this solution, the network access for the new Studio domain is configured as VPC Only. For more details on different connectivity methods, see Securing Amazon SageMaker Studio connectivity using a private VPC. The Elastic Network Interface (ENI) created in the private subnet connects to required AWS services through VPC endpoints.

The following diagram represents the different components used in this solution.

The CloudFormation template creates a Kerberized EMR cluster and configures it with a bootstrap action to create a Linux user and install Python libraries (Pandas, requests, and Matplotlib).

You can set up Kerberos authentication in a few different ways (for more information, see Kerberos Architecture Options):

  • Cluster-dedicated Key Distribution Center (KDC)
  • Cluster-dedicated KDC with Active Directory cross-realm trust
  • External KDC
  • External KDC integrated with Active Directory

The KDC can have its own user database or it can use cross-realm trust with an Active Directory that holds the identity store. For this post, we use a cluster-dedicated KDC that holds its own user database.

First, the EMR cluster has security configuration enabled to support Kerberos and is launched with a bootstrap action to create Linux users on all nodes and install the necessary libraries. The CloudFormation template launches the bash step after the cluster is ready. This step creates HDFS directories for the Linux users with default credentials. The user must change the password the first time they log in to the EMR cluster. The template also creates and populates a Hive table with a movie reviews dataset. We use this dataset in the Explore and query data section of this post.

The CloudFormation template also creates a Studio domain and a user named defaultuser. You can access the SparkMagic image from the Studio environment.

Deploy the resources with CloudFormation

You can use the provided CloudFormation template to set up the solution’s building blocks, including the VPC, subnet, EMR cluster, Studio domain, and other required resources.

This template deploys a new Studio domain. Ensure the Region used to deploy the CloudFormation stack has no existing Studio domain.

Complete the following steps to deploy the environment:

  1. Sign in to the AWS Management Console as an AWS Identity and Access Management (IAM) user, preferably an admin user.
  2. Choose Launch Stack to launch the CloudFormation template:

  1. Choose Next.
  2. For Stack name, enter a name for the stack (for example, blog).
  3. Leave the other values as default.
  4. Continue to choose Next and leave other parameters at their default.
  5. On the review page, select the check box to confirm that AWS CloudFormation might create resources.
  6. Choose Create stack.

Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE. The process usually takes 10–15 minutes.

Connect a Studio Notebook to an EMR cluster

After we deploy our stack, we create a connection between our Studio notebook and the EMR cluster. Establishing this connection allows us to connect code to our data hosted on Amazon EMR.

Complete the following steps to set up and connect your notebook to the EMR cluster:

  1. On the SageMaker console, choose Amazon SageMaker Studio.

The first time launching a Studio session may take a few minutes to start.

  1. Choose the Open Studio link for defaultuser.

The Studio IDE opens. Next, we download the code for this walkthrough from Amazon Simple Storage Service (Amazon S3).

  1. Choose File, then choose New and Terminal.
  2. In the terminal, run the following commands:
    aws s3 cp s3://aws-ml-blog/artifacts/ml-1954/smstudio-pyspark-hive-sentiment-analysis.ipynb .
    aws s3 cp s3://aws-ml-blog/artifacts/ml-1954/preprocessing.py 
    

  3. Open the smstudio-pyspark-hive-sentiment-analysis.ipynb
  4. For Select Kernel, choose PySpark (SparkMagic).

  1. Run each cell in the notebook and explore the capabilities of Sparkmagic using the PySpark kernel.

Before you can run the code in the notebook, you need to provide the cluster ID of the EMR cluster that was created as part of the solution deployment. You can find this information on the EMR console, on the Clusters page.

  1. Substitute the placeholder value with the ID of the EMR cluster.

  1. Connect to the EMR cluster from the notebook using the open-source Studio Sparkmagic library.

The SparkMagic library is available as open source on GitHub.

  1. In the notebook toolbar, choose the Launch terminal icon () to open a terminal in the same SparkMagic image as the notebook.
  2. Run kinit user1 to get the Kerberos ticket.
  3. Enter your password when prompted.

This ticket is valid for 24 hours by default. If you’re connecting to the EMR cluster for the first time, you must change the password.

  1. Choose the notebook tab and restart the Kernel using the Restart kernel icon () from the toolbar.

This is required so that SparkMagic can pick up the generated configuration.

  1. To verify that the connection was set up correctly, run the %%info command.

This command displays the current session information.

Now that we have set up the connectivity, let’s explore and query the data.

Explore and query data

After you configure the notebook, run the code of the cells shown in the following screenshots. This connects to the EMR cluster in order to query data.

Sparkmagic allows you to run Spark code against the remote EMR cluster through Livy. Livy is an open-source REST server for Spark. For more information, see EMR Livy documentation.

Sparkmagic also creates an automatic SparkContext and HiveContext. You can use the HiveContext to query data in the Hive table and make it available in a spark DataFrame.

You can use the DataFrame to look at the shape of the dataset and size of each class (positive and negative) and visualize it using Matplotlib. The following screenshots show that we have a balanced dataset.

You can use the pyspark.sql.functions module as shown in the following screenshot to inspect the length of the reviews.

You can use SparkSQL queries using %%sql from the notebook and save results to a local DataFrame. This allows for a quick data exploration. The maximum rows returned by default is 2,500. You can set the maximum rows by using the -n argument.

As we continue through the notebook, query the movie reviews table in Hive, storing the results into a DataFrame. The Sparkmagic environment allows you to send local data to the remote cluster using %%send_to_spark. We send the S3 location (bucket and key) variables to the remote cluster, then convert the Spark DataFrame to a Pandas DataFrame. Then we upload it to Amazon S3 and use this data as an input to the preprocessing step that creates training and validation data. This data trains a sentiment analysis model using the SageMaker BlazingText algorithm.

Preprocess data and feature engineering

We perform data preprocessing and feature engineering on the data using SageMaker Processing. With SageMaker Processing, you can leverage a simplified, managed experience to run data preprocessing, data postprocessing, and model evaluation workloads on the SageMaker platform. A processing job downloads input from Amazon S3, then uploads output to Amazon S3 during or after the processing job. The preprocessing.py script does the required text preprocessing with the movie reviews dataset and splits the dataset into train data and validation data for the model training.

The notebook uses the Scikit-learn processor within a Docker image to perform the processing job.

For this post, we use the SageMaker instance type ml.m5.xlarge for processing, training, and model hosting. If you don’t have access to this instance type and get a ResourceLimitExceeded error, use another instance type that you have access to. You can also request a service limit increase using AWS Support Center.

Train a SageMaker model

Amazon SageMaker Experiments allows us to organize, track, and review ML experiments with Studio notebooks. We can log metrics and information as we progress through the training process and evaluate results as we run the models.

We create a SageMaker experiment and trial, a SageMaker estimator, and set the hyperparameters. We then start a training job by calling the fit method on the estimator. We use Spot Instances to reduce the training cost.

Deploy the model and get predictions

When the training is complete, we host the model for real-time inference. The deploy method of the SageMaker estimator allows you to easily deploy the model and create an endpoint.

After the model is deployed, we test the deployed endpoint with test data and get predictions.

Clean up resources

Clean up the resources when you’re done, such as the SageMaker endpoint and the S3 bucket created in the notebook.

The %%cleanup -f command deletes all Livy sessions created by the notebook.

Conclusion

We have walked you through connecting a notebook backed by the Sparkmagic image to a kerberized EMR cluster. We then explored and queried the sample dataset from a Hive table. We used that dataset to train a sentiment analysis model with SageMaker. Finally, we deployed the model for inference.

For more information and other SageMaker resources, see the SageMaker Spark GitHub repo and Securing data analytics with an Amazon SageMaker notebook instance and Kerberized Amazon EMR cluster.


About the Authors

Graham Zulauf is a Senior Solutions Architect. Graham is focused on helping AWS’ strategic customers solve important problems at scale.

 

 

 

Huong Nguyen is a Sr. Product Manager at AWS. She is leading the user experience for SageMaker Studio. She has 13 years’ experience creating customer-obsessed and data-driven products for both enterprise and consumer spaces. In her spare time, she enjoys reading, being in nature, and spending time with her family.

 

 

James Sun is a Senior Solutions Architect with Amazon Web Services. James has over 15 years of experience in information technology. Prior to AWS, he held several senior technical positions at MapR, HP, NetApp, Yahoo, and EMC. He holds a PhD from Stanford University.

 

Naresh Kumar Kolloju is part of the Amazon SageMaker launch team. He is focused on building secure machine learning platforms for customers. In his spare time, he enjoys hiking and spending time with family.

 

 

Timothy Kwong is a Solutions Architect based out of California. During his free time, he enjoys playing music and doing digital art.

 

 

 

Praveen Veerath is a Senior AI Solutions Architect for AWS.

 

 

 

Read More

From forecasting demand to ordering – An automated machine learning approach with Amazon Forecast to decrease stockouts, excess inventory, and costs

This post is a guest joint collaboration by Supratim Banerjee of More Retail Limited and Shivaprasad KT and Gaurav H Kankaria of Ganit Inc.

More Retail Ltd. (MRL) is one of India’s top four grocery retailers, with a revenue in the order of several billion dollars. It has a store network of 22 hypermarkets and 624 supermarkets across India, supported by a supply chain of 13 distribution centers, 7 fruits and vegetables collection centers, and 6 staples processing centers.

With such a large network, it’s critical for MRL to deliver the right product quality at the right economic value, while meeting customer demand and keeping operational costs to a minimum. MRL collaborated with Ganit as its AI analytics partner to forecast demand with greater accuracy and build an automated ordering system to overcome the bottlenecks and deficiencies of manual judgment by store managers. MRL used Amazon Forecast to increase their forecasting accuracy from 24% to 76%, leading to a reduction in wastage by up to 30% in the fresh produce category, improving in-stock rates from 80% to 90%, and increasing gross profit by 25%.

We were successful in achieving these business results and building an automated ordering system because of two primary reasons:

  • Ability to experiment – Forecast provides a flexible and modular platform through which we ran more than 200 experiments using different regressors and types of models, which included both traditional and ML models. The team followed a Kaizen approach, learning from previously unsuccessful models, and deploying models only when they were successful. Experimentation continued on the side while winning models were deployed.
  • Change management – We asked category owners who were used to placing orders using business judgment to trust the ML-based ordering system. A systemic adoption plan ensured that the tool’s results were stored, and the tool was operated with a disciplined cadence, so that in filled and current stock were identified and recorded on time.

Complexity in forecasting the fresh produce category

Forecasting demand for the fresh produce category is challenging because fresh products have a short shelf life. With over-forecasting, stores end up selling stale or over-ripe products, or throw away most of their inventory (termed as shrinkage). If under-forecasted, products may be out of stock, which affects customer experience. Customers may abandon their cart if they can’t find key items in their shopping list, because they don’t want to wait in checkout lines for just a handful of products. To add to this complexity, MRL has many SKUs across its over 600 supermarkets, leading to more than 6,000 store-SKU combinations.

By the end of 2019, MRL was using traditional statistical methods to create forecasting models for each store-SKU combination, which resulted in an accuracy as low as 40%. The forecasts were maintained through multiple individual models, making it computationally and operationally expensive.

Demand forecasting to order placement

In early 2020, MRL and Ganit started working together to further improve the accuracy for forecasting the fresh category, known as Fruits and Vegetables (F&V), and reduce shrinkage.

Ganit advised MRL to break their problem into two parts:

  • Forecast demand for each store-SKU combination
  • Calculate order quantity (indents)

We go into more detail of each aspect in the following sections.

Forecast demand

In this section, we discuss the steps of forecasting demand for each store-SKU combination.

Understand drivers of demand

Ganit’s team started their journey by first understanding the factors that drove demand within stores. This included multiple on-site store visits, discussions with category managers, and cadence meetings with the supermarket’s CEO coupled with Ganit’s own in-house forecasting expertise on several other aspects like seasonality, stock-out, socio-economic, and macro-economic factors.

After the store visits, approximately 80 hypotheses on multiple factors were formulated to study their impact on F&V demand. The team performed comprehensive hypotheses testing using techniques like correlation, bivariate and univariate analysis, and statistical significance tests (Student’s t-test, Z tests) to establish the relationship between demand and relevant factors such as festival dates, weather, promotions, and many more.

Data segmentation

The team emphasized developing a granular model that could accurately forecast a store-SKU combination for each day. A combination of the sales contribution and ease of prediction was built as an ABC-XYZ framework, with ABC indicating the sales contribution (A being the highest) and XYZ indicating the ease of prediction (Z being the lowest). For model building, the first line of focus was on store-SKU combinations that had a high contribution to sales and were the most difficult to predict. This was done to ensure that improving forecasting accuracy has the maximum business impact.

Data treatment

MRL’s transaction data was structured like conventional point of sale data, with fields like mobile number, bill number, item code, store code, date, bill quantity, realized value, and discount value. The team used daily transactional data for the last 2 years for model building. Analyzing historical data helped identity two challenges:

  • The presence of numerous missing values
  • Some days had extremely high or low sales at bill levels, which indicated the presence of outliers in the data

Missing value treatment

A deep dive into the missing values identified reasons such as no stock available in the store (no supply or not in season) and stores being closed due to planned holiday or external constraints (such as a regional or national shutdown, or construction work). The missing values were replaced with 0, and appropriate regressors or flags were added to the model so the model could learn from this for any such future events.

Outlier treatment

The team treated the outliers at the most granular bill level, which ensured that factors like liquidation, bulk buying (B2B), and bad quality were considered. For example, bill-level treatment may include observing a KPI for each store-SKU combination at a day level, as in the following graph.

We can then flag dates on which abnormally high quantities are sold as outliers, and dive deeper into those identified outliers. Further analysis shows that these outliers are pre-planned institutional purchases.

These bill-level outliers are then capped with the maximum sales quantity for that date. The following graphs show the difference in bill-level demand.

Forecasting process

The team tested multiple forecasting techniques like time series models, regression-based models, and deep learning models before choosing Forecast. The primary reason for choosing Forecast was the difference in performance when comparing forecast accuracies in the XY bucket against the Z bucket, which was the most difficult to predict. Although most conventional techniques provided higher accuracies in the XY bucket, only the ML algorithms in Forecast provided a 10% incremental accuracy compared to other models. This was primarily due to Forecast’s ability to learn other SKUs (XY) patterns and apply those learnings to highly volatile items in the Z bucket. Through AutoML, the Forecast DeepAR+ algorithm was the winner and chosen as the forecast model.

Iterating to further improve forecasting accuracy

After the team identified Deep AR+ as the winning algorithm, they ran several experiments with additional features to further improve accuracy. They performed multiple iterations on a smaller sample set with different combinations like pure target time series data (with and without outlier treatment), regressors like festivals or store closures, and store-item metadata (store-item hierarchy) to understand the best combination for improving forecast accuracy. The combination of outlier treated target time series along with store-item metadata and regressors returned the highest accuracy. This was scaled back to the original set of 6,230 store-SKU combinations to get the final forecast.

Order quantity calculation

After the team developed the forecasting model, the immediate next step was to use this to decide how much inventory to buy and place orders. Order generation is influenced by forecasted demand, current stock on hand, and other relevant in-store factors.

The following formula served as the basis for designing the order construct.

The team also considered other indent adjustment parameters for the automatic ordering system, such as minimum order quantity, service unit factor, minimum closing stock, minimum display stock (based on planogram), and fill rate adjustment, thereby bridging the gap between machine and human intelligence.

Balance under-forecast and over-forecast scenarios

To optimize the output cost of shrinkage with the cost of stockouts and lost sales, the team used the quantiles feature of Forecast to move the forecast response from the model.

In the model design, three forecasts were generated at p40, p50, and p60 quantiles, with p50 being the base quantile. The selection of quantiles was programmed to be based on stockouts and wastage in stores in the recent past. For example, higher quantiles were automatically chosen if a particular store-SKU combination faced continuous stockouts in the last 3 days, and lower quantiles were automatically chosen if the store-SKU had witnessed high wastage. The quantum of increasing and decreasing quantiles was based on the magnitude of stockout or shrinkage within the store.

Automated order placement through Oracle ERP

MRL deployed Forecast and the indent ordering systems in production by integrating them with Oracle’s ERP system, which MRL uses for order placements. The following diagram illustrates the final architecture.

To deploy the ordering system into production, all MRL data was migrated into AWS. The team set up ETL jobs to move live tables to Amazon Redshift (data warehouse for business intelligence work), so Amazon Redshift became the single source of input for future all data processing.

The entire data architecture was divided into two parts:

  • Forecasting engine:
    • Used historical demand data (1-day demand lag) present in Amazon Redshift
    • Other regressor inputs like last bill time, price, and festivals were maintained in Amazon Redshift
    • An Amazon Elastic Compute Cloud (Amazon EC2) instance was set up with customized Python scripts to wrangle transaction, regressors, and other metadata
    • Post-data wrangling, the data was moved to an Amazon Simple Storage Service (Amazon S3) bucket to generate forecasts (T+2 forecasts for all store-SKU combinations)
    • The final forecast output was stored in a separate folder in an S3 bucket
  • Order (indent) engine:
    • All data required to convert forecasts into orders (such as stock on hand, received to store quantity, last 2 days of orders placed to receive, service unit factor, and planogram-based minimum opening and closing stock) was stored and maintained in Amazon Redshift
    • Order quantity was calculated through Python scripts run on EC2 instances
    • Orders were then moved to Oracle’s ERP system, which placed an order to vendors

The entire ordering system was decoupled into multiple key segments. The team set up Apache Airflow’s scheduler email notifications for each process to notify respective stakeholders upon successful completion or failure, so that they could take immediate action. The orders placed through the ERP system were then moved to Amazon Redshift tables for calculating the next days’ orders. The ease of integration between AWS and ERP systems led to a complete end-to-end automated ordering system with zero human intervention.

Conclusion

An ML-based approach unlocked the true power of data for MRL. With Forecast, we created two national models for different store formats, as opposed to over 1,000 traditional models that we had been using.

Forecast also learns across time series. ML algorithms within Forecast enable cross-learning between store-SKU combinations, which helps improve forecast accuracies.

Additionally, Forecast allows you to add related time series and item metadata, such as customers who send demand signals based on the mix of items in their basket. Forecast considers all the incoming demand information and arrives at a single model. Unlike conventional models, where the addition of variables leads to overfitting, Forecast enriches the model, providing accurate forecasts based on business context. MRL gained the ability to categorize products based on factors like shelf life, promotions, price, type of stores, affluent cluster, competitive store, and stores throughput.  We recommend that you try Amazon Forecast to improve your supply chain operations. You can learn more about Amazon Forecast here. To learn more about Ganit and our solutions, reach out at info@ganitinc.com to learn more.

 

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

 Supratim Banerjee is the Chief Transformational Officer at More Retail Limited. He is an experienced professional with a demonstrated history of working in the venture capital and private equity industries. He was a consultant with KPMG and worked with organizations like A.T. Kearney and India Equity Partners. He holds an MBA focused on Finance, General from Indian School of Business, Hyderabad.

 

Shivaprasad KT is the Co-Founder & CEO at Ganit Inc. He has a 17+ years of experience in delivering top-line and bottom-line impact using data science in the US, Australia, Asia, and India. He has advised CXOs at companies like Walmart, Sam’s Club, Pfizer, Staples, Coles, Lenovo, and Citibank. He holds an MBA from SP Jain, Mumbai, and a bachelor’s degree in Engineering from NITK Surathkal.

 

Gaurav H Kankaria is the Senior Data Scientist at Ganit Inc. He has over 6 years of experience in designing and implementing solutions to help organizations in retail, CPG, and BFSI domains make data-driven decisions. He holds a bachelor’s degree from VIT University, Vellore.

Read More