New built-in Amazon SageMaker algorithms for tabular data modeling: LightGBM, CatBoost, AutoGluon-Tabular, and TabTransformer

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.

Starting today, SageMaker provides four new built-in tabular data modeling algorithms: LightGBM, CatBoost, AutoGluon-Tabular, and TabTransformer. You can use these popular, state-of-the-art algorithms for both tabular classification and regression tasks. They’re available through the built-in algorithms on the SageMaker console as well as through the Amazon SageMaker JumpStart UI inside Amazon SageMaker Studio.

The following is the list of the four new built-in algorithms, with links to their documentation, example notebooks, and source.

Documentation Example Notebooks Source
LightGBM Algorithm Regression, Classification LightGBM
CatBoost Algorithm Regression, Classification CatBoost
AutoGluon-Tabular Algorithm Regression, Classification AutoGluon-Tabular
TabTransformer Algorithm Regression, Classification TabTransformer

In the following sections, we provide a brief technical description of each algorithm, and examples of how to train a model via the SageMaker SDK or SageMaker Jumpstart.

LightGBM

LightGBM is a popular and efficient open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. LightGBM uses additional techniques to significantly improve the efficiency and scalability of conventional GBDT.

CatBoost

CatBoost is a popular and high-performance open-source implementation of the GBDT algorithm. Two critical algorithmic advances are introduced in CatBoost: the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features. Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms.

AutoGluon-Tabular

AutoGluon-Tabular is an open-source AutoML project developed and maintained by Amazon that performs advanced data processing, deep learning, and multi-layer stack ensembling. It automatically recognizes the data type in each column for robust data preprocessing, including special handling of text fields. AutoGluon fits various models ranging from off-the-shelf boosted trees to customized neural network models. These models are ensembled in a novel way: models are stacked in multiple layers and trained in a layer-wise manner that guarantees raw data can be translated into high-quality predictions within a given time constraint. Over-fitting is mitigated throughout this process by splitting the data in various ways with careful tracking of out-of-fold examples. AutoGluon is optimized for performance, and its out-of-the-box usage has achieved several top-3 and top-10 positions in data science competitions.

TabTransformer

TabTransformer is a novel deep tabular data modelling architecture for supervised learning. The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability. This model is the product of recent Amazon Science research (paper and official blog post here) and has been widely adopted by the ML community, with various third-party implementations (KerasAutoGluon,) and social media features such as tweetstowardsdatascience, medium, and Kaggle.

Benefits of SageMaker built-in algorithms

When selecting an algorithm for your particular type of problem and data, using a SageMaker built-in algorithm is the easiest option, because doing so comes with the following major benefits:

  • The built-in algorithms require no coding to start running experiments. The only inputs you need to provide are the data, hyperparameters, and compute resources. This allows you to run experiments more quickly, with less overhead for tracking results and code changes.
  • The built-in algorithms come with parallelization across multiple compute instances and GPU support right out of the box for all applicable algorithms (some algorithms may not be included due to inherent limitations). If you have a lot of data with which to train your model, most built-in algorithms can easily scale to meet the demand. Even if you already have a pre-trained model, it may still be easier to use its corollary in SageMaker and input the hyperparameters you already know rather than port it over and write a training script yourself.
  • You are the owner of the resulting model artifacts. You can take that model and deploy it on SageMaker for several different inference patterns (check out all the available deployment types) and easy endpoint scaling and management, or you can deploy it wherever else you need it.

Let’s now see how to train one of these built-in algorithms.

Train a built-in algorithm using the SageMaker SDK

To train a selected model, we need to get that model’s URI, as well as that of the training script and the container image used for training. Thankfully, these three inputs depend solely on the model name, version (for a list of the available models, see JumpStart Available Model Table), and the type of instance you want to train on. This is demonstrated in the following code snippet:

from sagemaker import image_uris, model_uris, script_uris

train_model_id, train_model_version, train_scope = "lightgbm-classification-model", "*", "training"
training_instance_type = "ml.m5.xlarge"

# Retrieve the docker image
train_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    model_id=train_model_id,
    model_version=train_model_version,
    image_scope=train_scope,
    instance_type=training_instance_type
)
# Retrieve the training script
train_source_uri = script_uris.retrieve(
    model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
)
# Retrieve the model artifact; in the tabular case, the model is not pre-trained 
train_model_uri = model_uris.retrieve(
    model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
)

The train_model_id changes to lightgbm-regression-model if we’re dealing with a regression problem. The IDs for all the other models introduced in this post are listed in the following table.

Model Problem Type Model ID
LightGBM Classification lightgbm-classification-model
. Regression lightgbm-regression-model
CatBoost Classification catboost-classification-model
. Regression catboost-regression-model
AutoGluon-Tabular Classification autogluon-classification-ensemble
. Regression autogluon-regression-ensemble
TabTransformer Classification pytorch-tabtransformerclassification-model
. Regression pytorch-tabtransformerregression-model

We then define where our input is on Amazon Simple Storage Service (Amazon S3). We’re using a public sample dataset for this example. We also define where we want our output to go, and retrieve the default list of hyperparameters needed to train the selected model. You can change their value to your liking.

import sagemaker
from sagemaker import hyperparameters

sess = sagemaker.Session()
region = sess.boto_session.region_name

# URI of sample training dataset
training_dataset_s3_path = f"s3:///jumpstart-cache-prod-{region}/training-datasets/tabular_multiclass/"

# URI for output artifacts 
output_bucket = sess.default_bucket()
s3_output_location = f"s3://{output_bucket}/jumpstart-example-tabular-training/output"

# Retrieve the default hyper-parameters for training
hyperparameters = hyperparameters.retrieve_default(
    model_id=train_model_id, model_version=train_model_version
)

# [Optional] Override default hyperparameters with custom values
hyperparameters[
    "num_boost_round"
] = "500"  # The same hyperparameter is named as "iterations" for CatBoost

Finally, we instantiate a SageMaker Estimator with all the retrieved inputs and launch the training job with .fit, passing it our training dataset URI. The entry_point script provided is named transfer_learning.py (the same for other tasks and algorithms), and the input data channel passed to .fit must be named training.

from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base

# Unique training job name
training_job_name = name_from_base(f"built-in-example-{model_id}")

# Create SageMaker Estimator instance
tc_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
)

# Launch a SageMaker Training job by passing s3 path of the training data
tc_estimator.fit({"training": training_dataset_s3_path}, logs=True)

Note that you can train built-in algorithms with SageMaker automatic model tuning to select the optimal hyperparameters and further improve model performance.

Train a built-in algorithm using SageMaker JumpStart

You can also train any these built-in algorithms with a few clicks via the SageMaker JumpStart UI. JumpStart is a SageMaker feature that allows you to train and deploy built-in algorithms and pre-trained models from various ML frameworks and model hubs through a graphical interface. It also allows you to deploy fully fledged ML solutions that string together ML models and various other AWS services to solve a targeted use case.

For more information, refer to Run text classification with Amazon SageMaker JumpStart using TensorFlow Hub and Hugging Face models.

Conclusion

In this post, we announced the launch of four powerful new built-in algorithms for ML on tabular datasets now available on SageMaker. We provided a technical description of what these algorithms are, as well as an example training job for LightGBM using the SageMaker SDK.

Bring your own dataset and try these new algorithms on SageMaker, and check out the sample notebooks to use built-in algorithms available on GitHub.


About the Authors

Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A journal.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He is an active researcher in machine learning and statistical inference and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is mostly focused on NLP use-cases and helping customers optimize Deep Learning model training and deployment. He is also an active proponent of low-code ML solutions and ML-specialized hardware.

Read More

Semantic segmentation data labeling and model training using Amazon SageMaker

In computer vision, semantic segmentation is the task of classifying every pixel in an image with a class from a known set of labels such that pixels with the same label share certain characteristics. It generates a segmentation mask of the input images. For example, the following images show a segmentation mask of the cat label.

In November 2018, Amazon SageMaker announced the launch of the SageMaker semantic segmentation algorithm. With this algorithm, you can train your models with a public dataset or your own dataset. Popular image segmentation datasets include the Common Objects in Context (COCO) dataset and PASCAL Visual Object Classes (PASCAL VOC), but the classes of their labels are limited and you may want to train a model on target objects that aren’t included in the public datasets. In this case, you can use Amazon SageMaker Ground Truth to label your own dataset.

In this post, I demonstrate the following solutions:

  • Using Ground Truth to label a semantic segmentation dataset
  • Transforming the results from Ground Truth to the required input format for the SageMaker built-in semantic segmentation algorithm
  • Using the semantic segmentation algorithm to train a model and perform inference

Semantic segmentation data labeling

To build a machine learning model for semantic segmentation, we need to label a dataset at the pixel level. Ground Truth gives you the option to use human annotators through Amazon Mechanical Turk, third-party vendors, or your own private workforce. To learn more about workforces, refer to Create and Manage Workforces. If you don’t want to manage the labeling workforce on your own, Amazon SageMaker Ground Truth Plus is another great option as a new turnkey data labeling service that enables you to create high-quality training datasets quickly and reduces costs by up to 40%. For this post, I show you how to manually label the dataset with the Ground Truth auto-segment feature and crowdsource labeling with a Mechanical Turk workforce.

Manual labeling with Ground Truth

In December 2019, Ground Truth added an auto-segment feature to the semantic segmentation labeling user interface to increase labeling throughput and improve accuracy. For more information, refer to Auto-segmenting objects when performing semantic segmentation labeling with Amazon SageMaker Ground Truth. With this new feature, you can accelerate your labeling process on segmentation tasks. Instead of drawing a tightly fitting polygon or using the brush tool to capture an object in an image, you only draw four points: at the top-most, bottom-most, left-most, and right-most points of the object. Ground Truth takes these four points as input and uses the Deep Extreme Cut (DEXTR) algorithm to produce a tightly fitting mask around the object. For a tutorial using Ground Truth for image semantic segmentation labeling, refer to Image Semantic Segmentation. The following is an example of how the auto-segmentation tool generates a segmentation mask automatically after you choose the four extreme points of an object.

Crowdsourcing labeling with a Mechanical Turk workforce

If you have a large dataset and you don’t want to manually label hundreds or thousands of images yourself, you can use Mechanical Turk, which provides an on-demand, scalable, human workforce to complete jobs that humans can do better than computers. Mechanical Turk software formalizes job offers to the thousands of workers willing to do piecemeal work at their convenience. The software also retrieves the work performed and compiles it for you, the requester, who pays the workers for satisfactory work (only). To get started with Mechanical Turk, refer to Introduction to Amazon Mechanical Turk.

Create a labeling job

The following is an example of a Mechanical Turk labeling job for a sea turtle dataset. The sea turtle dataset is from the Kaggle competition Sea Turtle Face Detection, and I selected 300 images of the dataset for demonstration purposes. Sea turtle isn’t a common class in public datasets so it can represent a situation that requires labeling a massive dataset.

  1. On the SageMaker console, choose Labeling jobs in the navigation pane.
  2. Choose Create labeling job.
  3. Enter a name for your job.
  4. For Input data setup, select Automated data setup.
    This generates a manifest of input data.
  5. For S3 location for input datasets, enter the path for the dataset.
  6. For Task category, choose Image.
  7. For Task selection, select Semantic segmentation.
  8. For Worker types, select Amazon Mechanical Turk.
  9. Configure your settings for task timeout, task expiration time, and price per task.
  10. Add a label (for this post, sea turtle), and provide labeling instructions.
  11. Choose Create.

After you set up the labeling job, you can check the labeling progress on the SageMaker console. When it’s marked as complete, you can choose the job to check the results and use them for the next steps.

Dataset transformation

After you get the output from Ground Truth, you can use SageMaker built-in algorithms to train a model on this dataset. First, you need to prepare the labeled dataset as the requested input interface for the SageMaker semantic segmentation algorithm.

Requested input data channels

SageMaker semantic segmentation expects your training dataset to be stored on Amazon Simple Storage Service (Amazon S3). The dataset in Amazon S3 is expected to be presented in two channels, one for train and one for validation, using four directories, two for images and two for annotations. Annotations are expected to be uncompressed PNG images. The dataset might also have a label map that describes how the annotation mappings are established. If not, the algorithm uses a default. For inference, an endpoint accepts images with an image/jpeg content type. The following is the required structure of the data channels:

s3://bucket_name
    |- train
                 | - image1.jpg
                 | - image2.jpg
    |- validation
                 | - image3.jpg
                 | - image4.jpg
    |- train_annotation
                 | - image1.png
                 | - image2.png
    |- validation_annotation
                 | - image3.png
                 | - image4.png
    |- label_map
                 | - train_label_map.json
                 | - validation_label_map.json

Every JPG image in the train and validation directories has a corresponding PNG label image with the same name in the train_annotation and validation_annotation directories. This naming convention helps the algorithm associate a label with its corresponding image during training. The train, train_annotation, validation, and validation_annotation channels are mandatory. The annotations are single-channel PNG images. The format works as long as the metadata (modes) in the image helps the algorithm read the annotation images into a single-channel 8-bit unsigned integer.

Output from the Ground Truth labeling job

The outputs generated from the Ground Truth labeling job have the following folder structure:

s3://turtle2022/labelturtles/
    |- activelearning
    |- annotation-tool
    |- annotations
                 | - consolidated-annotation
                                   | - consolidation-request                               
                                   | - consolidation-response
                                   | - output
			                                  | -0_2022-02-10T17:40:03.294994.png
                                              | -0_2022-02-10T17:41:04.530266.png
                 | - intermediate
                 | - worker-response
    |- intermediate
    |- manifests
                 | - output
                                | - output.manifest

The segmentation masks are saved in s3://turtle2022/labelturtles/annotations/consolidated-annotation/output. Each annotation image is a .png file named after the index of the source image and the time when this image labeling was completed. For example, the following are the source image (Image_1.jpg) and its segmentation mask generated by the Mechanical Turk workforce (0_2022-02-10T17:41:04.724225.png). Notice that the index of the mask is different than the number in the source image name.

The output manifest from the labeling job is in the /manifests/output/output.manifest file. It’s a JSON file, and each line records a mapping between the source image and its label and other metadata. The following JSON line records a mapping between the shown source image and its annotation:

{"source-ref":"s3://turtle2022/Image_1.jpg","labelturtles-ref":"s3://turtle2022/labelturtles/annotations/consolidated-annotation/output/0_2022-02-10T17:41:04.724225.png","labelturtles-ref-metadata":{"internal-color-map":{"0":{"class-name":"BACKGROUND","hex-color":"#ffffff","confidence":0.25988},"1":{"class-name":"Turtle","hex-color":"#2ca02c","confidence":0.25988}},"type":"groundtruth/semantic-segmentation","human-annotated":"yes","creation-date":"2022-02-10T17:41:04.801793","job-name":"labeling-job/labelturtles"}}

The source image is called Image_1.jpg, and the annotation’s name is 0_2022-02-10T17:41: 04.724225.png. To prepare the data as the required data channel formats of the SageMaker semantic segmentation algorithm, we need to change the annotation name so that it has the same name as the source JPG images. And we also need to split the dataset into train and validation directories for source images and the annotations.

Transform the output from a Ground Truth labeling job to the requested input format

To transform the output, complete the following steps:

  1. Download all the files from the labeling job from Amazon S3 to a local directory:
    !aws s3 cp s3://turtle2022/ Seaturtles --recursive

  2. Read the manifest file and change the names of the annotation to the same names as the source images:
    import os
    import re
    
    label_job='labelturtles'
    manifest_path=dir_name+'/'+label_job+'/'+'manifests/output/output.manifest'
    
    file = open(manifest_path, "r") 
    txt=file.readlines()
    output_path=dir_name+'/'+label_job+'/'+'annotations/consolidated-annotation/output'
    S3_name='turtle2022/'
    im_list=[]
    for i in range(len(txt)):
        string = txt[i]
        try:
            im_name = re.search(S3_name+'(.+)'+'.jpg', string).group(1)
            print(im_name)
            im_png=im_name+'.png'
            im_list.append(im_name)
            annotation_name = re.search('output/(.+?)"', string).group(1)
            os.rename(annotation_name, im_png)
        except AttributeError:
            pass

  3. Split the train and validation datasets:
    import numpy as np
    from random import sample
      
    # Prints list of random items of given length
    train_num=len(im_list)*0.8
    test_num=len(im_list)*0.2
    train_name=sample(im_list,int(train_num))
    test_name = list(set(im_list) - set(train_name))

  4. Make a directory in the required format for the semantic segmentation algorithm data channels:
    os.chdir('./semantic_segmentation_pascalvoc_2022-01-11')
    os.mkdir('train')
    os.mkdir('validation')
    os.mkdir('train_annotation')
    os.mkdir('validation_annotation')

  5. Move the train and validation images and their annotations to the created directories.
    1. For images, use the following code:
      for i in range(len(train_name)):
          train_im=train_name[i]+'.jpg'
          train_im_path=dir_name+'/'+train_im
          train_new_path='train/'+train_im
          shutil.move(train_im_path,train_new_path) 
          
          train_annotation=train_name[i]+'.png'
          train_annotation_path=dir_name+'/labelturtles/annotations/consolidated-annotation/output/'+train_annotation
          train_annotation_new_path='train_annotation/'+train_annotation
          shutil.move(train_annotation_path,train_annotation_new_path)

    2. For annotations, use the following code:
      for i in range(len(test_name)):
          val_im=test_name[i]+'.jpg'
          val_im_path=dir_name+'/'+val_im
          val_new_path='validation/'+val_im
          shutil.move(val_im_path,val_new_path) 
          
          val_annotation=test_name[i]+'.png'
          val_annotation_path=dir_name+'/labelturtles/annotations/consolidated-annotation/output/'+val_annotation
          val_annotation_new_path='validation_annotationT/'+val_annotation
          shutil.move(val_annotation_path,val_annotation_new_path)

  6. Upload the train and validation datasets and their annotation datasets to Amazon S3:
    !aws s3 cp train s3://turtle2022/train/ --recursive
    !aws s3 cp train_annotation s3://turtle2022/train_annotation/ --recursive
    !aws s3 cp validation s3://turtle2022/validation/ --recursive
    !aws s3 cp validation_annotation s3://turtle2022/validation_annotation/ --recursive

SageMaker semantic segmentation model training

In this section, we walk through the steps to train your semantic segmentation model.

Follow the sample notebook and set up data channels

You can follow the instructions in Semantic Segmentation algorithm is now available in Amazon SageMaker to implement the semantic segmentation algorithm to your labeled dataset. This sample notebook shows an end-to-end example introducing the algorithm. In the notebook, you learn how to train and host a semantic segmentation model using the fully convolutional network (FCN) algorithm using the Pascal VOC dataset for training. Because I don’t plan to train a model from the Pascal VOC dataset, I skipped Step 3 (data preparation) in this notebook. Instead, I directly created train_channel, train_annotation_channe, validation_channel, and validation_annotation_channel using the S3 locations where I stored my images and annotations:

Train_channel=’s3://turtle2022/train’
train_annotation_channel=’s3://turtle2022/train_annotation’
validation_channel=’s3://turtle2022/validation’
validation_annotation_channel=’s3://turtle2022/validation_annotation’

Adjust hyperparameters for your own dataset in SageMaker estimator

I followed the notebook and created a SageMaker estimator object (ss_estimator) to train my segmentation algorithm. One thing we need to customize for the new dataset is in ss_estimator.set_hyperparameters: we need to change num_classes=21 to num_classes=2 (turtle and background), and I also changed epochs=10 to epochs=30 because 10 is only for demo purposes. Then I used the p3.2xlarge instance for model training by setting instance_type="ml.p3.2xlarge". The training completed in 8 minutes. The best MIoU (Mean Intersection over Union) of 0.846 is achieved at epoch 11 with a pix_acc (the percent of pixels in your image that are classified correctly) of 0.925, which is a pretty good result for this small dataset.

Model inference results

I hosted the model on a low-cost ml.c5.xlarge instance:

training_job_name = 'ss-notebook-demo-2022-02-12-03-37-27-151'
ss_estimator = sagemaker.estimator.Estimator.attach(training_job_name)
ss_predictor = ss_estimator.deploy(initial_instance_count=1, instance_type="ml.c5.xlarge")

Finally, I prepared a test set of 10 turtle images to see the inference result of the trained segmentation model:

import os

path = "testturtle/"
img_path_list=[]
files = os.listdir(path)

for file in files:
 
    if file.endswith(('.jpg', '.png', 'jpeg')):
        img_path = path + file
        img_path_list.append(img_path)

colnum=5
fig, axs = plt.subplots(2, colnum, figsize=(20, 10))

for i in range(len(img_path_list)):
    print(img_path_list[i])
    img = mpimg.imread(img_path_list[i])
    with open(img_path_list[i], "rb") as imfile:
        imbytes = imfile.read()
    cls_mask = ss_predictor.predict(imbytes)
    axs[int(i/colnum),i%colnum].imshow(img, cmap='gray') 
    axs[int(i/colnum),i%colnum].imshow(np.ma.masked_equal(cls_mask,0), cmap='jet', alpha=0.8)
    
plt.show()

The following images show the results.

The segmentation masks of the sea turtles look accurate and I’m happy with this result trained on a 300-image dataset labeled by Mechanical Turk workers. You can also explore other available networks such as pyramid-scene-parsing network (PSP) or DeepLab-V3 in the sample notebook with your dataset.

Clean up

Delete the endpoint when you’re finished with it to avoid incurring continued costs:

ss_predictor.delete_endpoint()

Conclusion

In this post, I showed how to customize semantic segmentation data labeling and model training using SageMaker. First, you can set up a labeling job with the auto-segmentation tool or use a Mechanical Turk workforce (as well as other options). If you have more than 5,000 objects, you can also use automated data labeling. Then you transform the outputs from your Ground Truth labeling job to the required input formats for SageMaker built-in semantic segmentation training. After that, you can use an accelerated computing instance (such as p2 or p3) to train a semantic segmentation model with the following notebook and deploy the model to a more cost-effective instance (such as ml.c5.xlarge). Lastly, you can review the inference results on your test dataset with a few lines of code.

Get started with SageMaker semantic segmentation data labeling and model training with your favorite dataset!


About the Author

Kara Yang is a Data Scientist in AWS Professional Services. She is passionate about helping customers achieve their business goals with AWS cloud services. She has helped organizations build ML solutions across multiple industries such as manufacturing, automotive, environmental sustainability and aerospace.

Read More

Deep demand forecasting with Amazon SageMaker

Every business needs the ability to predict the future accurately in order to make better decisions and give the company a competitive advantage. With historical data, businesses can understand trends, make predictions of what might happen and when, and incorporate that information into their future plans, from product demand to inventory planning and staffing. If a forecast is too high, companies may over-invest in products and staff, which results in wasted investment. If the forecast is too low, companies may under-invest, which leads to a shortfall in raw materials and inventory, creating a poor customer experience.

Time series forecasting is a technique that predicts future time series data based on historical data. Time series forecasting is useful in multiple fields, including retail, finance, logistics, and healthcare. Demand forecasting uses historical time series data in order to make future estimations in relation to customer demand over a specific period and streamline the supply-demand decision-making process across businesses. Demand forecasting use cases include predicting ticket sales in the transportation industry, stock prices, number of hospital visits, number of customer representatives to hire for multiple locations in the next month, product sales across multiple regions in the next quarter, cloud server usage for the next day for a video streaming service, electricity consumption for multiple regions over the next week, number of IoT devices and sensors such as energy consumption, and more.

Time series data is categorized as univariate and multi-variate. For example, the total electricity consumption for a single household is a univariate time series over a period of time. When multiple univariate time series are stacked on each other, it’s called a multi-variate time series. For example, the total electricity consumption of 10 different (but correlated) households in a single neighborhood make up a multi-variate time series dataset.

The traditional approaches for time series forecasting include auto regressive integrated moving average (ARIMA) for univariate time series data and vector autoregression (VAR) for multi-variate time series data. These methods often require tedious data preprocessing and features generation prior to model training. These challenges are addressed by deep learning (DL) methods by automating the feature generation step prior to model training, such as incorporating various data normalization, lags, different time scales, some categorical data, dealing with missing values, and more, with better prediction power and fast GPU-enabled training and deployment.

In this post, we show you how to deploy a demand forecasting solution using Amazon SageMaker JumpStart. We walk you through an end-to-end solution for a demand forecasting task using three state-of-the-art time series algorithms: LSTNet, Prophet, and SageMaker DeepAR, which are available in GluonTS and Amazon SageMaker. The input data is a multi-variate time series that includes hourly electricity consumption of 321 users from 2012–2014. Next, each algorithm takes the historical multi-variate and correlated time series data to train and produce accurate predictions (multi-variate values) over a prediction interval. For each of the time series algorithms, we have two outputs: a trained model on the hourly electricity consumption data and a SageMaker endpoint that can predict the future (multi-variate) values given a prediction interval.

Alternatively, if you are looking for a fully managed service to deliver highly accurate forecasts, without writing code, we recommend checking out Amazon Forecast. Amazon Forecast is a time-series forecasting service based on machine learning (ML) and built for business metrics analysis. Based on the same technology used at Amazon.com, Amazon Forecast uses machine learning to combine time series data with additional variables to build forecasts.

Solution overview

The following diagram shows the architecture for the end-to-end training and deployment process.

The solution workflow is as follows:

  1. The input data for training is located in an Amazon Simple Storage Service (Amazon S3) bucket.
  2. The provided SageMaker notebook gets the input data and launches the following steps.
  3. For each of the LSTNet, Prophet, and SageMaker DeepAR algorithms, train a model and evaluate its results using SageMaker.
  4. Deploy the trained model and create a SageMaker endpoint, which is an HTTPS endpoint that is capable of producing predictions.
  5. Monitor the model training and deployment via Amazon CloudWatch.
  6. The input data for inferencing is located in an S3 bucket. From the SageMaker notebook, send the requests to the SageMaker endpoint and make predictions.

Prerequisites

To try out the solution in your own account, make sure that you have the following in place:

When the Studio instance is ready, you can launch Studio and access JumpStart. JumpStart features aren’t available in SageMaker notebook instances, and you can’t access them through SageMaker APIs or the AWS Command Line Interface (AWS CLI).

Launch the solution

To launch the solution, complete the following steps:

  1. Open JumpStart by using the JumpStart launcher in the Get Started section or by choosing the JumpStart icon in the left sidebar.
  2. In the Solutions section, choose Demand Forecasting to open the solution in another Studio tab.
  1. On the Demand Forecasting tab, choose Launch to deploy the solution resources.
  1. Another tab opens showing the deploy status and the generated artifacts. When the deployment is finished, an Open Notebook button appears. Choose Open Notebook to open the solution notebook in Studio.

In the following sections, we walk you through the steps of the deep demand forecasting solution.

Data preparation and visualization

The dataset we use here is the multi-variate time series electricity consumptions data taken from Dua, D. and Graff, C. (2019). UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science. We use a cleaned version of the data containing 321 time series with 1-hour frequency, starting from January 1, 2012 with 26,304 time-steps. We have also provided the exchange rate dataset in case you want to try with other datasets as well.

We have provided utilities for creating the dataframe from train and test data. The training data includes hourly electricity consumption values (for the 321 households) from 2012-01-01 00:00:00 to 2014-05-26 19:00:00, and the test data contains values from 2012-01-01 00:00:00 to 2014-06-02 19:00:00 (7 more days of hourly data compared to the training data). To train a time series forecasting model, the CONTEXT_LENGTH defines the length of each input time series, and PREDICTION_LENGTH defines the length of each output time series.

Because the CONTEXT_LENGTH and PREDICTION_LENGTH are set to 168 (7 days) and 24 (next 1 day), we plot the last 7 days of the training data and its subsequent 1 day of the testing data for demonstration purposes. The plotted training data and testing data are from 2014-05-19 20:00:00 to 2014-05-26 19:00:00, and from 2014-05-26 20:00:00 to 2014-05-27 02:00:00, respectively. For demonstration purposes, we only plot the 11 time series out of the 321 total, as shown in the following figure.

Train the models

This section demonstrates training an LSTNet model using GluonTS, a Prophet model using GluonTS, and a SageMaker DeepAR model with and without hyperparameter optimization (HPO). For each of these, we first trained the model without HPO, then we trained the model with HPO. We demonstrate how model performance increases with HPO by showing the comparison metrics, namely RRSE (Root Relative Squared Error), MAPE (Mean Absolute Percentage Error), and sMAPE (symmetric Mean Absolute Percentage Error). For HPO, we use the RRSE as the evaluation metric for all the three algorithms.

Train an optimal LSTNet model using GluonTS

LSTNet is a deep learning model that incorporates traditional auto-regressive linear models in parallel to the non-linear neural network part, which makes the non-linear deep learning model more robust for time series that violate scale changes. For information on the mathematics behind LSTNet, see Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks.

We first train a LSTNet model without HPO. With the hyperparameters defined, we can run the training job. We use GluonTS with MXNet as the backend deep learning framework to define and train our LSTNet model. SageMaker makes it do this with the framework estimators, which have the deep learning frameworks already set up. Here, we create a SageMaker MXNet estimator and pass in our model training script, hyperparameters, as well as the number and type of training instances we want.

Next, we train an optimal LSTNet model with HPO and further improve the model performance with SageMaker automatic model tuning. SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. The best model and its corresponding hyperparameters are selected on the validation data from 2014-05-26 20:00:00 to 2014-06-01 19:00:00 (corresponding to 6 days). Next, we deploy the best model in an endpoint that we can query for prediction. Finally, the best model is evaluated on the holdout test data from 2014-06-01 20:00:00 to 2014-06-02 19:00:00 (corresponding to the next 1 day). The following table compares model performance.

Metrics LSTNet without HPO LSTNet with HPO
RRSE 0.555 0.506
MAPE 0.318 0.301
sMAPE 0.337 0.323
Training Time (minutes) 10.780 57.242
Inference Time (seconds) 5.202 5.340

Except for the training and inference time, for RRSE, MAPE, and sMAPE, smaller values indicate better predictive performance. Therefore, we can observe the performance of the model trained with HPO is significantly better than the one trained without HPO.

Train an optimal Prophet model using GluonTS with HPO

Prophet is an algorithm for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. For implementation of Prophet algorithm, we use the GluonTS version, which is a thin wrapper for calling the fbprophet package. First, we train a Prophet model without HPO using SageMaker Estimator. Next, we train an optimal Prophet model with with SageMaker Automatic Model Tuning (HPO) and further improve the model performance.

Metrics Prophet without HPO Prophet with HPO
RRSE 0.183 0.147
MAPE 0.288 0.278
sMAPE 0.278 0.289
Training Time (minutes) 45.633
Inference Time (seconds) 44.813 45.327

The metric values with HPO tuning are smaller than those without HPO tuning on the same test data. This indicates that HPO tuning further improves the model performance.

Train an optimal SageMaker DeepAR model with HPO

The SageMaker DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). Classical forecasting methods, such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ETS), fit a single model to each individual time series. They then use that model to extrapolate the time series into the future.

In many applications, however, you have many similar time series across a set of cross-sectional units. For example, you might have time series groupings for demand for different products, server loads, and requests for webpages. For this type of application, you can benefit from training a single model jointly over all of the time series. DeepAR takes this approach. When your dataset contains hundreds of related time series, DeepAR outperforms the standard ARIMA and ETS methods. You can also use the trained model to generate forecasts for new time series that are similar to the ones it has been trained on. For information on the mathematics behind DeepAR, see DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks.

Similar to the settings in the previous models, we first train a DeepAR model without HPO. Next, we train an optimal DeepAR model with HPO. Then we deploy the best model in an endpoint that we can query for prediction. The following table compares model performance.

Metrics DeepAR without HPO DeepAR with HPO
RRSE 0.136 0.098
MAPE 0.087 0.099
sMAPE 0.104 0.116
Training Time (minutes) 24.048 210.530
Inference Time (seconds) 68.411 72.829

The metrics values with HPO tuning are smaller than those without HPO tuning on the same test data. This indicates that HPO tuning further improves the model performance.

Evaluate model performance of all three algorithms on the same holdout test data

In this section, we compare the model performance from the three models trained from HPO. Based on the input data, the comparisons could vary for different input datasets. The following table that compares the three algorithms for the sample electricity input data used in this post.

Metrics LSTNet with HPO Prophet with HPO DeepAR with HPO
RRSE 0.506 0.147 0.098
MAPE 0.302 0.278 0.099
sMAPE 0.323 0.289 0.116
Training Time (minutes) 57.242 45.633 210.530
Inference Time (seconds) 5.340 45.327 72.829

The following figures visualize these results.

The following figure is another way to visualize the results.

The training and test data (ground truth) are shown as the black solid line (separated by the red vertical line) in the plot. The predictions from different forecasting algorithms are shown as dash lines. The closer the dash line comes to the black solid line, the more accurate the predictions are.

Clean up

When you’re finished with this solution, make sure that you delete all unwanted AWS resources to avoid incurring unintended charges. The solution notebook provides cleanup code. On the solution tab, you can also choose Delete all resources in the Delete solution section.

Conclusion

In this post, we introduced an end-to-end solution for a demand forecasting task using three state-of-the-art time series algorithms: LSTNet, Prophet, and SageMaker DeepAR, which are available in GluonTS and SageMaker. We discussed three training approaches: training an optimal LSTNet model using GluonTS, training an optimal Prophet model using GluonTS, and training an optimal SageMaker DeepAR model with HPO. For each of these, we first trained the model without HPO, and then trained the model with HPO. We demonstrated how the model performance increases with HPO by comparing metrics, namely RRSE, MAPE, and sMAPE.

In this post, we used the electricity data as our input dataset. However, you can change the input and bring your own data to an S3 bucket. You can use that data to train the models and get different performance results and choose the best algorithm accordingly.

On the SageMaker console, open Studio and launch the solution in JumpStart to get started, or you can check out the solution’s GitHub repository to review the code and more information.


About the Authors

Alak Eswaradass is a Senior Solutions Architect at AWS based in Chicago, Illinois. She is passionate about helping customers design cloud architectures utilizing AWS services to solve business challenges. She has a Master’s degree in computer science engineering. Before joining AWS, she worked for different healthcare organizations, and she has in-depth experience architecting complex systems, technology innovation, and research. She hangs out with her daughters and explores the outdoors in her free time.

Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering.

Read More

Inspect your data labels with a visual, no code tool to create high-quality training datasets with Amazon SageMaker Ground Truth Plus

Launched at AWS re:Invent 2021, Amazon SageMaker Ground Truth Plus helps you create high-quality training datasets by removing the undifferentiated heavy lifting associated with building data labeling applications and managing the labeling workforce. All you do is share data along with labeling requirements, and Ground Truth Plus sets up and manages your data labeling workflow based on these requirements. From there, an expert workforce that is trained on a variety of machine learning (ML) tasks performs data labeling. You don’t even need deep ML expertise or knowledge of workflow design and quality management to use Ground Truth Plus.

Building a high-quality training dataset for your ML algorithm is an iterative process. ML practitioners often build custom systems to inspect data labels because accurately labeled data is critical to ML model quality. To ensure you get high-quality training data, Ground Truth Plus provides you with a built-in user interface (Review UI) to inspect the quality of data labels and provide feedback on data labels until you’re satisfied that the labels accurately represent the ground truth, or what is directly observable in the real world.

This post walks you through steps to create a project team and use several new built-in features of the Review UI tool to efficiently complete your inspection of a labeled dataset. The walkthrough assumes that you have an active Ground Truth Plus labeling project. For more information, see Amazon SageMaker Ground Truth Plus – Create Training Datasets Without Code or In-house Resources.

Set up a project team

A project team provides access to the members from your organization to inspect data labels using the Review UI tool. To set up a project team, complete the following steps:

  1. On the Ground Truth Plus console, choose Create project team.
  2. Select Create a new Amazon Cognito user group . If you already have an existing Amazon Cognito user group, select the Import members option.
  3. For Amazon Cognito user group name, enter a name. This name can’t be changed.
  4. For Email addresses, enter the email addresses of up to 50 team members, separated by commas.
  5. Choose Create project team.

Your team members will receive an email inviting them to join the Ground Truth Plus project team. From there, they can log in to the Ground Truth Plus project portal to review the data labels.

Inspect labeled dataset quality

Now let’s dive into a video object tracking example using CBCL StreetScenes dataset.

After the data in your batch has been labeled, the batch is marked as Ready for review.

Select the batch and choose Review batch. You’re redirected to the Review UI. You have the flexibility to choose a different sampling rate for each batch you review. For instance, in our example batch, we have a total of five videos. You can specify if you want to review only a subset of these five videos or all of them.

Now let’s look at the different functionalities within the Review UI that will help you in inspecting the quality of the labeled dataset at a faster pace, and providing feedback on the quality:

  • Filter the labels based on label category – Within the Review UI, in the right-hand pane, you can filter the labels based on their label category. This feature comes in handy when there are multiple label categories (for example, Vehicles, Pedestrians, and Poles) in a dense dataset object, and you want to view labels for one label category at a time. For example, let’s focus on the Car label category. Enter the Car label category in the right pane to filter for all annotations of only type Car. The following screenshots show the Review UI view before and after applying the filter.
  • Overlay associated annotated attribute values – Each label can be assigned attributes to be annotated. For example, for the label category Car , say you want to ask the workers to also annotate the Color  and Occlusion attributes for each label instance. When you load the Review UI, you will see the corresponding attributes under each label instance on the right pane. But what if you want to see these attribute annotations directly on the image instead? You select the label Car:1 , and to overlay the attribute annotations for Car:1 , you press Ctrl+A.
    Now you will see the annotation Dark Blue for the Color attribute and annotation None for the Occlusion attribute directly displayed on the image next to the Car:1 bounding box. Now you can easily verify that Car:1 was marked as Dark Blue, with no occlusion just from looking at the image instead of having to locate Car:1 on the right pane to see the attribute annotations.
  • Leave feedback at the label level – For each label, you can leave feedback at the label level in that label’s Label feedback free string attribute. For example, in this image, Car:1 looks more black than dark blue. You can relay this discrepancy as feedback for Car:1 using the Label feedback field to track the comment to that label on that frame. Our internal quality control team will review this feedback and introduce changes to the annotation process and label policies, and train the annotators as required.
  • Leave feedback at the frame level – Similarly, for each frame, you can leave feedback at the frame level under that frame’s Frame feedback free string attribute. In this case, the annotations for Car and Pedestrian classes look correct and well implemented in this frame. You can relay this positive feedback using the Provide feedback field, and your comment is linked to this frame.
  • Copy the annotation feedback to other frames – You can copy both label-level and frame-level feedback to other frames if you right-click that attribute. This feature is useful when you want to duplicate the same feedback across frames for that label, or apply the same frame-level feedback to several frames. This feature allows you to quickly complete the inspection of data labels.
  • Approve or reject each dataset object – For each dataset object you review, you have the option to either choose Approve if you’re satisfied with the annotations or choose Reject if you’re not satisfied and want those annotations reworked. When you choose Submit, you’re presented with the option to approve or reject the video you just reviewed. In either case, you can provide additional commentary:

    • If you choose Approve, the commentary is optional.
    • If you choose Reject, commentary is required and we suggest providing detailed feedback. Your feedback will be reviewed by a dedicated Ground Truth Plus quality control team, who will take corrective actions to avoid similar mistakes in subsequent videos.

After you submit the video with your feedback, you’re redirected back to the project detail page in the project portal, where you can view the number of rejected objects under the Rejected objects column and the error rate, which is calculated as the number of accepted objects out of reviewed objects under the Acceptance rate column for each batch in your project. For example, for batch 1 in the following screenshot, the acceptance rate is 80% because four objects were accepted out of the five reviewed objects.

Conclusion

A high-quality training dataset is critical for achieving your ML initiatives. With Ground Truth Plus, you now have an enhanced built-in Review UI tool that removes the undifferentiated heavy lifting associated with building custom tools to review the quality of the labeled dataset. This post walked you through how to set up a project team and use the new built-in features of the Review UI tool. Visit the Ground Truth Plus console to get started.

As always, AWS welcomes feedback. Please submit any comments or questions.


About the Author

Manish Goel is the Product Manager for Amazon SageMaker Ground Truth Plus. He is focused on building products that make it easier for customers to adopt machine learning. In his spare time, he enjoys road trips and reading books.

Revekka Kostoeva is a Software Developer Engineer at Amazon AWS where she works on customer facing and internal solutions to expand the breadth and scalability of Sagemaker Ground Truth services. As a researcher, she is driven to improve the tools of the trade to drive innovation forward.

Read More

Choose specific timeseries to forecast with Amazon Forecast

Today, we’re excited to announce that Amazon Forecast offers the ability to generate forecasts on a selected subset of items. This helps you to leverage the full value of your data, and apply it selectively on your choice of items reducing the time and effort to get forecasted results.

Generating a forecast on ‘all’ items of the dataset restricted you from the freedom to have fine-grained controls over specific items that you wanted to forecast. This meant increased cost for low/no priority forecasted items and additional overhead. Earlier, you would spend a lot of time generating multiple predictions on all of the items in your data. This was time consuming and operationally heavy to manage. Moreover, this approach doesn’t fully leverage the value of machine learning (ML): applying inferences across desired items. With the capability to choose a subset of items, you can now focus on training the model with all of your data, but apply the learnings to select few high yield items. This will contribute to overall optimization of forecast planning by increasing productivity (fewer items to manage) and reducing cost (reduction in price per forecasted item). This also makes explainability easier to manage.

With today’s launch, you can not only run all of the steps, but also have the choice to select a subset of items to forecast by uploading a csv during the ‘Create a Forecast’ step. You don’t need to onboard the entire target or related timeseries and item metadata which saves considerable effort for you. This will also help when reducing the overall infrastructure footprint for forecasted items resulting in cost savings and productivity. You can do this step using the ‘CreateForecast’ API, or follow the following console steps.

Forecast on select subset of items

Now we will walk through how to use the Forecast console to choose select items on the input dataset.

Step 1: Import Training Data

To import time-series data into Forecast, create a dataset group, choose a domain for your dataset group, specify the details of your data, and point Forecast to the Amazon Simple Storage Service (Amazon S3) location of your data. In this example, let’s assume that your dataset has 1000 items.

Note: This exercise assumes that you haven’t created any dataset groups. If you previously created a dataset group, then what you see will vary slightly from the following screenshots and instructions.

To import time-series data for forecasting

  1. Open the Forecast console here.
  2. On the Forecast home page, choose Create dataset group.
  3. On the Create dataset group page, add the details for your input dataset.
  4. Choose Next.
  5. The Dataset details panel should look similar to the following:
  6. After you’ve entered all of the necessary details on the dataset import page, the Dataset import details panel should look similar to the following:
  7. Choose Start.

Wait for Forecast to finish importing your time series data. The process can take several minutes or longer. When your dataset has been imported, the status transitions to Active and the banner at the top of the dashboard notifies you that you have successfully imported your data.

Now that your target time series dataset has been imported, you can create a predictor.

Step 2: Create a predictor

Next, you create a predictor, which you use to generate forecasts based on your time series data. Forecast applies the optimal combination of algorithms to each time series in your datasets.

To create a predictor with the Forecast console, you specify a predictor name, a forecast frequency, and define a forecast horizon. For more information about the additional fields that you can configure, see Training Predictors.

To create a predictor

  1. After your target time series dataset has finished importing, your dataset group’s Dashboard should look similar to the following:

    Under Train a predictor, choose Start. The Train predictor page is displayed.
  2. On the Train predictor page, for Predictor settings, provide the following information:
    • Predictor name
    • Forecast frequency
    • Forecast horizon
    • Forecast dimensions and Forecast quantiles (optional)

Now that your predictor has been trained on 1000 items, you can head to the next step of generating a Forecast.

Step 3: Create a Forecast

  1. Select Create Forecast.
  2. Write the Forecast name
  3. Select a predictor.
  4. Select quantiles – Enter up to five quantiles.
  5. If you want to generate the forecast for all 1000 items, then select “All Items”.
  6. Or else you can select “Selected Items”, which will let you choose specific items out of the 1000 items to forecast.
  7. Provide the location for the s3 file which contains the selected timeseries. Timeseries must include all item and dimension columns specified in the target time series.
  8. You must also define your schema for the input file containing the selected timeseries. The order of columns defined in the schema should match the order of columns in the input file.
  9. Hit Generate Forecast.
  10. Perform an export and the .csv file will show you only the selected items that you chose.

Conclusion

Forecast now provides you with the ability to select a subset of items from the input dataset. With this feature, you can train your model with all of the data available and then apply the learnings to select items that you want to forecast. This helps in saving time and focusing efforts on high priority items. You can achieve cost reduction and better align efforts to business outcomes. “Forecast select items” is available in all Regions where Forecast is publicly available.

To learn more about the forecasting of “selected items”, visit this notebook or read more on the Forecast developer guide.


About the Authors

Meetish Dave is a Sr Product Manager in the Amazon Forecast team. He is interested in all things data and application of those to generate new revenue streams. Outside work, he likes to cook Indian food and watch interesting shows.

Ridhim Rastogi is a Software Development Engineer in the Amazon Forecast team. He is passionate about building scalable distributed systems with a focus on solving real world problems through AI/ML. In his spare time, he likes to solve puzzles, read fiction and explore.

Read More

Improve ML developer productivity with Weights & Biases: A computer vision example on Amazon SageMaker

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

As more organizations use deep learning techniques such as computer vision and natural language processing, the machine learning (ML) developer persona needs scalable tooling around experiment tracking, lineage, and collaboration. Experiment tracking includes metadata such as operating system, infrastructure used, library, and input and output datasets—often tracked on a spreadsheet manually. Lineage involves tracking the datasets, transformations, and algorithms used to create an ML model. Collaboration includes ML developers working on a single project and also ML developers sharing their results across teams and to business stakeholders—a process commonly done via email, screenshots, and PowerPoint presentations.

In this post, we train a model to identify objects for an autonomous vehicle use case using Weights & Biases (W&B) and Amazon SageMaker. We showcase how the joint solution reduces manual work for the ML developer, creates more transparency in the model development process, and enables teams to collaborate on projects.

We run this example on Amazon SageMaker Studio for you to try out for yourself.

Overview of Weights & Biases

Weights & Biases helps ML teams build better models faster. With just a few lines of code in your SageMaker notebook, you can instantly debug, compare, and reproduce your models—architecture, hyperparameters, git commits, model weights, GPU usage, datasets, and predictions—all while collaborating with your teammates.

W&B is trusted by more than 200,000 ML practitioners from some of the most innovative companies and research organizations in the world. To try it for free, sign up at Weights & Biases, or visit the W&B AWS Marketplace listing.

Getting started with SageMaker Studio

SageMaker Studio is the first fully integrated development environment (IDE) for ML. Studio provides a single web-based interface where ML practitioners and data scientists can build, train, and deploy models with a few clicks, all in one place.

To get started with Studio, you need an AWS account and an AWS Identity and Access Management (IAM) user or role with permissions to create a Studio domain. Refer to Onboard to Amazon SageMaker Domain to create a domain, and the Studio documentation for an overview on using Studio visual interface and notebooks.

Set up the environment

For this post, we’re interested in running our own code, so let’s import some notebooks from GitHub. We use the following GitHub repo as an example, so let’s load this notebook.

You can clone a repository either through the terminal or the Studio UI. To clone a repository through the terminal, open a system terminal (on the File menu, choose New and Terminal) and enter the following command:

git clone https://github.com/wandb/SageMakerStudioLab

To clone a repository from the Studio UI, see Clone a Git Repository in SageMaker Studio.

To get started, choose the 01_data_processing.ipynb notebook. You’re prompted with a kernel switcher prompt. This example uses PyTorch, so we can choose the pre-built PyTorch 1.10 Python 3.8 GPU optimized image to start our notebook. You can see the app starting, and when the kernel is ready, it shows the instance type and kernel on the top right of your notebook.

Our notebook needs some additional dependencies. This repository provides a requirements.txt with the additional dependencies. Run the first cell to install the required dependencies:

%pip install -r requirements.txt

You can also create a lifecycle configuration to automatically install the packages every time you start the PyTorch app. See Customize Amazon SageMaker Studio using Lifecycle Configurations for instructions and a sample implementation.

Use Weights & Biases in SageMaker Studio

Weights & Biases (wandb) is a standard Python library. Once installed, it’s as simple as adding a few lines of code to your training script and you’re ready to log experiments. We have already installed it through our requirements.txt file. You can also install it manually with the following code:

! pip install wandb

Case study: Autonomous vehicle semantic segmentation

Dataset

We use the Cambridge-driving Labeled Video Database (CamVid) for this example. It contains a collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes. We can version our dataset as a wandb.Artifact, that way we can reference it later. See the following code:

with wandb.init(project="sagemaker_camvid_demo", job_type="upload"):
   artifact = wandb.Artifact(
       name='camvid-dataset',
       type='dataset',
       metadata={
           "url": 'https://s3.amazonaws.com/fast-ai-imagelocal/camvid.tgz',
           "class_labels": class_labels
       },
       description="The Cambridge-driving Labeled Video Database (CamVid) is the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes."
   )
   artifact.add_dir(path)
   wandb.log_artifact(artifact)

You can follow along in the 01_data_processing.ipynb notebook.

We also log a table of the dataset. Tables are rich and powerful DataFrame-like entities that enable you to query and analyze tabular data. You can understand your datasets, visualize model predictions, and share insights in a central dashboard.

Weights & Biases tables support many rich media formats, like image, audio, and waveforms. For a full list of media formats, refer to Data Types.

The following screenshot shows a table with raw images with the ground truth segmentations. You can also view an interactive version of this table.

Train a model

We can now create a model and train it on our dataset. We use PyTorch and fastai to quickly prototype a baseline and then use wandb.Sweeps to optimize our hyperparameters. Follow along in the 02_semantic_segmentation.ipynb notebook. When prompted for a kernel on opening the notebook, choose the same kernel from our first notebook, PyTorch 1.10 Python 3.8 GPU optimized. Your packages are already installed because you’re using the same app.

The model is supposed to learn a per-pixel annotation of a scene captured from the point of view of the autonomous agent. The model needs to categorize or segment each pixel of a given scene into 32 relevant categories, such as road, pedestrian, sidewalk, or cars. You can choose any of the segmented images on the table and access this interactive interface for accessing the segmentation results and categories.

Because the fastai library has integration with wandb, you can simply pass the WandbCallback to the Learner:

from fastai.callback.wandb import WandbCallback

loss_func=FocalLossFlat(axis=1)
model = SegmentationModel(backbone, hidden_dim, num_classes=num_classes)
wandb_callback = WandbCallback(log_preds=True)
   learner = Learner(
        data_loader,
        model,
        loss_func=loss_func,
        metrics=metrics,
        cbs=[wandb_callback],
    )

learn.fit_one_cycle(TRAIN_EPOCHS, LEARNING_RATE)

For the baseline experiments, we decided to use a simple architecture inspired by the UNet paper with different backbones from timm. We trained our models with Focal Loss as criterion. With Weights & Biases, you can easily create dashboards with summaries of your experiments to quickly analyze training results, as shown in the following screenshot. You can also view this dashboard interactively.

Hyperparameter search with sweeps

To improve the performance of the baseline model, we need to select the best model and the best set of hyperparameters to train. W&B makes this easy for us using sweeps.

We perform a Bayesian hyperparameter search with the goal of maximizing the foreground accuracy of the model on the validation dataset. To perform the sweep, we define the configuration file sweep.yaml. Inside this file, we pass the desired method to use: bayes and the parameters and their corresponding values to search. In our case, we try different backbones, batch sizes, and loss functions. We also explore different optimization parameters like learning rate and weight decay. Because these are continuous values, we sample from a distribution. There are multiple configuration options available for sweeps.

program: train.py
project: sagemaker_camvid_demo
method: bayes
metric:
    name: foreground_acc
    goal: maximize
early_terminate:
    type: hyperband
    min_iter: 5
parameters:
    backbone:
        values: ["mobilenetv2_100","mobilenetv3_small_050","mobilenetv3_large_100","resnet18","resnet34","resnet50","vgg19"]
    batch_size: 
        values: [8, 16]
    image_resize_factor: 
        value: 4
    loss_function: 
        values: ["categorical_cross_entropy", "focal", "dice"]
    learning_rate: 
        distribution: uniform 
        min: 1e-5
        max: 1e-2
    weight_decay: 
        distribution: uniform
        min: 0.0 
        max: 0.05

Afterwards, in a terminal, you launch the sweep using the wandb command line:

$ wandb sweep sweep.yaml —-project="sagemaker_camvid_demo"

And then launch a sweep agent on this machine with the following code:

$ wandb agent <sweep_id>

When the sweep has finished, we can use a parallel coordinates plot to explore the performances of the models with various backbones and different sets of hyperparameters. Based on that, we can see which model performs the best.

The following screenshot shows the results of the sweeps, including a parallel coordinates chart and parameter correlation charts. You can also view this sweeps dashboard interactively.

We can derive the following key insights from the sweep:

  • Lower learning rate and lower weight decay results in better foreground accuracy and Dice scores.
  • Batch size has strong positive correlations with the metrics.
  • The VGG-based backbones might not be a good option to train our final model because they’re prone to resulting in a vanishing gradient. (They’re filtered out as the loss diverged.)
  • The ResNet backbones result in the best overall performance with respect to the metrics.
  • The ResNet34 or ResNet50 backbone should be chosen for the final model due to their strong performance in terms of metrics.

Data and model lineage

W&B artifacts were designed to make it effortless to version your datasets and models, regardless of whether you want to store your files with W&B or whether you already have a bucket you want W&B to track. After you track your datasets or model files, W&B automatically logs each modification, giving you a complete and auditable history of changes to your files.

In our case, the dataset, models, and different tables generated during training are logged to the workspace. You can quickly view and visualize this lineage by going to the Artifacts page.

Interpret model predictions

Weight & Biases is especially useful when assessing model performance by using the power of wandb.Tables to visualize where our model is doing badly. In this case, we’re particularly interested in detecting correctly vulnerable users like bicycles and pedestrians.

We logged the predicted masks along with the per-class Dice score coefficient into a table. We then filtered by rows containing the desired classes and sorted by ascending order on the Dice score.

In the following table, we first filter by choosing where the Dice score is positive (pedestrians are present in the image). Then we sort in ascending order to identify our worst-detected pedestrians. Keep in mind that a Dice score equaling 1 means correctly segmenting the pedestrian class. You can also view this table interactively.

We can repeat this analysis with other vulnerable classes, such as bicycles or traffic lights.

This feature is a very good way of identifying images that aren’t labeled correctly and tagging them to re-annotate.

Conclusion

This post showcased the Weights & Biases MLOps platform, how to set up W&B in SageMaker Studio, and how to run an introductory notebook on the joint solution. We then ran through an autonomous vehicle semantic segmentation use case and demonstrated tracking training runs with W&B experiments, hyperparameter optimization using W&B sweeps, and interpreting results with W&B tables.

If you’re interested in learning more, you can access the live W&B report. To try Weights & Biases for free, sign up at Weights & Biases, or visit the W&B AWS Marketplace listing.


About the Authors

Thomas Capelle is a Machine Learning Engineer at Weights and Biases. He is responsible for keeping the www.github.com/wandb/examples repository live and up to date. He also builds content on MLOPS, applications of W&B to industries, and fun deep learning in general. Previously he was using deep learning to solve short-term forecasting for solar energy. He has a background in Urban Planning, Combinatorial Optimization, Transportation Economics, and Applied Math.

Durga Sury is a ML Solutions Architect in the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. In her 3 years at AWS, she has helped set up AI/ML platforms for enterprise customers. When she isn’t working, she loves motorcycle rides, mystery novels, and hikes with her four-year old husky.

Karthik Bharathy is the product leader for Amazon SageMaker with over a decade of product management, product strategy, execution and launch experience.

Read More