Robust Graph Neural Networks

Graph Neural Networks (GNNs) are powerful tools for leveraging graph-structured data in machine learning. Graphs are flexible data structures that can model many different kinds of relationships and have been used in diverse applications like traffic prediction, rumor and fake news detection, modeling disease spread, and understanding why molecules smell.

Graphs can model the relationships between many different types of data, including web pages (left), social connections (center), or molecules (right).

As is standard in machine learning (ML), GNNs assume that training samples are selected uniformly at random (i.e., are an independent and identically distributed or “IID” sample). This is easy to do with standard academic datasets, which are specifically created for research analysis and therefore have every node already labeled. However, in many real world scenarios, data comes without labels, and labeling data can be an onerous process involving skilled human raters, which makes it difficult to label all nodes. In addition, biased training data is a common issue because the act of selecting nodes for labeling is usually not IID. For example, sometimes fixed heuristics are used to select a subset of data (which shares some characteristics) for labeling, and other times, human analysts individually choose data items for labeling using complex domain knowledge.

Localized training data is a typical non-IID bias exhibited in graph-structured data. This is shown on the left figure by taking an orange node and expanding to those around it. Instead, an IID training sample of nodes for labeling would be uniformly distributed, as illustrated by the sampling process on the right.

To quantify the amount of bias present in a training set, one can use methods that measure how large the shift is between two different probability distributions, where the size of the shift can be thought of as the amount of bias. As the shift grows in size, machine learning models have more difficulty generalizing from the biased training set. This situation can meaningfully hurt generalizability — on academic datasets, we’ve observed domain shifts causing a performance drop of 15-20% (as measured by the F1 score).

In “Shift-Robust GNNs: Overcoming the Limitations of Localized Graph Training Data”, presented at NeurIPS 2021, we introduce a solution for using GNNs on biased data. Called Shift-Robust GNN (SR-GNN), this approach is designed to account for distributional differences between biased training data and a graph’s true inference distribution. SR-GNN adapts GNN models to the presence of distributional shift between the nodes labeled for training and the rest of the dataset. We illustrate the effectiveness of SR-GNN in a variety of experiments with biased training datasets on common GNN benchmark datasets for semi-supervised learning and show that SR-GNN outperforms other GNN baselines in accuracy, reducing the negative effects of biased training data by 30–40%.

The Impact of Distribution Shifts on Performance
To demonstrate how distribution shift affects GNN performance, we first generate a number of biased training sets for known academic datasets. Then in order to understand the effect, we plot the generalization (test accuracy) versus a measure of distribution shift (the Central Moment Discrepancy1, CMD). For example, consider the well known PubMed citation dataset, which can be thought of as a graph where the nodes are medical research papers and the edges represent citations between them. When we generate biased training data for PubMed, the plot looks like this:

The effect of distribution shift on the PubMed dataset. Performance (F1) is shown on the y-axis vs. the distribution shift, Central Moment Discrepancy (CMD), on the x-axis, for 100 biased training set samples. As the distribution shift increases, the model’s accuracy falls.

Here one can observe a strong negative correlation between the distribution shift in the dataset and the classification accuracy: as CMD increases, the performance (F1) decreases. That is, GNNs can have difficulty generalizing as their training data looks less like the test dataset.

To address this, we propose a shift-robust regularizer (similar in idea to domain-invariant learning) to minimize the distribution shift between training data and an IID sample from unlabeled data. To do this, we measure the domain shift (e.g., via CMD) in real time as the model is training and apply a direct penalty based on this that forces the model to ignore as much of the training bias as possible. This forces the feature encoders that the model learns for the training data to also work effectively for any unlabeled data, which might come from a different distribution.

The figure below shows what this looks like when compared to a traditional GNN model. We still have the same inputs (the node features X, and the Adjacency Matrix A), and the same number of layers. However at the final embedding Zk from layer (k) of the GNN is compared against embeddings from unlabeled data points to verify that the model is correctly encoding them.

SR-GNN adds two kinds of regularizations to deep GNN models. First, a domain shift regularization (λ term) minimizes the distance between hidden representations of the labeled (Zk) and unlabeled (ZIID) data. Second, the instance weight (β) of the examples can be changed to further approximate the true distribution.

We write this regularization as an additional term in the formula for the model’s loss based on the distance between the training data’s representations and the true data’s distribution (full formulas available in the paper).

In our experiments, we compare our method and a number of standard graph neural network models, to measure their performance on node classification tasks. We demonstrate that adding the SR-GNN regularization gives a 30–40% percent improvement on classification tasks with biased training data labels.

A comparison of SR-GNN using node classification with biased training data on the PubMed dataset. SR-GNN outperforms seven baselines, including DGI, GCN, GAT, SGC and APPNP.

Shift-Robust Regularization for Linear GNNs via Instance Re-weighting
Moreover, it’s worth noting that there’s another class of GNN models (e.g., APPNP, SimpleGCN, etc) that are based on linear operations to speed up their graph convolutions. We also examined how to make these models more reliable in the presence of biased training data. While the same regularization mechanism can not be directly applied due to their different architecture, we can “correct” the training bias by re-weighting the training instances according to their distance from an approximated true distribution. This allows correcting the distribution of the biased training data without passing gradients through the model.

Finally, the two regularizations — for both deep and linear GNNs — can be combined into a generalized regularization for the loss, which combines both domain regularization and instance reweighting (details, including the loss formulas, available in the paper).

Conclusion
Biased training data is common in real world scenarios and can arise due to a variety of reasons, including difficulties of labeling a large amount of data, the various heuristics or inconsistent techniques that are used to choose nodes for labeling, delayed label assignment, and others. We presented a general framework (SR-GNN) that can reduce the influence of biased training data and can be applied to various types of GNNs, including both deeper GNNs and more recent linearized (shallow) versions of these models.

Acknowledgements
Qi Zhu is a PhD Student at UIUC. Thanks to our collaborators Natalia Ponomareva (Google Research) and Jiawei Han (UIUC). Thanks to Tom Small and Anton Tsitsulin for visualizations.


1We note that many measures of distribution shift have been proposed in the literature. Here we use CMD (as it is quick to calculate and generally shows good performance in the domain adaptation literature), but the concept generalizes to any measure of distribution distances/domain shift. 

Read More

How The Barcode Registry detects counterfeit products using object detection and Amazon SageMaker

This is a guest post authored by Andrew Masek, Software Engineer at The Barcode Registry and Erik Quisling, CEO of The Barcode Registry.

Product counterfeiting is the single largest criminal enterprise in the world. Growing over 10,000% in the last two decades, sales of counterfeit goods now total $1.7 trillion per year worldwide, which is more than drugs and human trafficking. Although traditional methods of counterfeit prevention like unique barcodes and product verification can be very effective, new machine learning (ML) technologies such as object detection seem very promising. With object detection, you can now snap a picture of a product and know almost instantly if that product is likely to be legitimate or fraudulent.

The Barcode Registry (in conjunction with its partner Buyabarcode.com) is a full-service solution that helps customers prevent product fraud and counterfeiting. It does this by selling unique GS1-registered barcodes, verifying product ownership, and registering users’ products and barcodes in a comprehensive database. Their latest offering, which we discuss in this post, uses Amazon SageMaker to create object detection models to help instantly recognize counterfeit products.

Overview of solution

To use these object detection models, you first need to collect data to train them. Companies upload annotated pictures of their products to The Barcode Registry website. After this data is uploaded to Amazon Simple Storage Service (Amazon S3) and processed by AWS Lambda functions, you can use it to train a SageMaker object detection model. This model is hosted on a SageMaker endpoint, where the website connects it to the end-user.

There are three key steps to creating The Barcode Registry uses to create a custom object detection model with SageMaker:

  1. Create a training script for SageMaker to run.
  2. Build a Docker container from the training script and upload it to Amazon ECR.
  3. Use the SageMaker console to train a model with the custom algorithm.

Product data

As a prerequisite in order to train an object detection model you will need an AWS account and training images, consisting of at least 100 high-quality (high-resolution and in multiple lighting-conditions) pictures of your object. As with any ML model, high-quality data is paramount. To train an object detection model, we need images containing the relevant products as well as bounding boxes describing where the products are in the images, as shown in the following example.

To train an effective model, pictures of each of a brand’s products with different backgrounds and lighting conditions are needed—approximately 30–100 unique annotated images for each product.

After the images are uploaded to the web server, they’re uploaded to Amazon S3 using the AWS SDK for PHP. A Lambda event is triggered each time an image is uploaded. The function removes the Exif metadata from the images, which can sometimes cause them to appear rotated when they’re opened by the ML libraries later used to train the model. The associated bounding box data is stored in JSON files and uploaded to Amazon S3 to accompany the images.

SageMaker for object detection models

SageMaker is a managed ML service that includes a variety of tools for building, training and hosting models in the cloud. In particular, TheBarcodeRegistry uses SageMaker for its object detection service because of SageMaker’s reliable and scalable ML model training and hosting services. This means that many brands can have their own object detection models trained and hosted and even if usage spikes unpredictably, there won’t be any downtime.

The Barcode Registry uses custom Docker containers uploaded to Amazon Elastic Container Registry (Amazon ECR) in order to have more fine-grained control of the object detection algorithm employed for training and inference as well as support for Multi Model Server (MMS). MMS is very important for the counterfeit detection use case because it allows multiple brand’s models to be cost-effectively hosted on the same server. Alternatively, you can use the built-in object detection algorithm to quickly deploy standard models developed by AWS.

Train a custom object detection model with SageMaker

First, you need to add your object detection algorithm. In this case, upload a Docker container featuring scripts to train a Yolov5 object detection model to Amazon ECR:

  1. On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
  2. Choose Create notebook instance.
  3. Enter a name for the notebook instance and under Permissions and encryption choose an AWS Identity and Access Management (IAM) role with the necessary permissions.
  4. Open the Git repositories menu.
  5. Select Clone a public Git repository to this notebook instance only and paste the following Git repository URL: https://github.com/portoaj/SageMakerObjectDetection
  6. Click Create notebook instance and wait about five minutes for the instance’s status to update from Pending to InService in the Notebook instance menu.
  7. Once the notebook is InService, select it and click Actions and Open Jupyter to launch the notebook instance in a new tab.
  8. Select the SageMakerObjectDetection directory and then click on sagemakerobjectdetection.ipynb to launch the Jupyter notebook.
  9. Select the conda_python3 kernel and click Set Kernel.
  10. Select the code cell and set the aws_account_id variable to your AWS Account ID.
  11. Click Run to begin the process of building a Docker container and uploading it to Amazon ECR. This process may take about 20 minutes to complete.
  12. Once the Docker container has been uploaded, return to the Notebook instances menu, select your instance, and click Actions and Stop to shut your notebook instance down.

After the algorithm is built and pushed to Amazon ECR, you can use it to train a model via the SageMaker console.

  1. On the SageMaker console, under Training in the navigation pane, choose Training jobs.
  2. Choose Create training job.
  3. Enter a name for the job and choose the AWS Identity and Access Management (IAM) role with the necessary permissions.
  4. For Algorithm source, select Your own algorithm container in ECR.
  5. For Container, enter the registry path.
  6. Setting a single ml.p2.xlarge instance under the resource configuration should be sufficient for training a Yolov5 model.
  7. Specify Amazon S3 locations for both your input data and output path and any other settings such as configuring a VPC via Amazon Virtual Private Cloud (Amazon VPC) or enabling Managed Spot Training.
  8. Choose Create training job.

You can track the model’s training progress on the SageMaker console.

Automated model training

The following diagram illustrates the automated model training workflow:

To make SageMaker start training the object detection model as soon as a user finishes uploading their data, the web server uses Amazon API Gateway to notify a Lambda function that the brand has finished and to begin a training job.

When a brand’s model is successfully trained, Amazon EventBridge calls a Lambda function that moves the trained model into the live endpoint’s S3 bucket, where it’s finally ready for inference. A newer alternative to using Amazon EventBridge to move models through the MLOps lifecycle that you should consider is SageMaker Pipelines.

Host the model for inference

The following diagram illustrates the inference workflow:

To use the trained models, SageMaker requires an inference model to be hosted by an endpoint. The endpoint is the server or array of servers that are used to actually host the inference model. Similar to the training container that we created, a Docker container for inference is hosted in Amazon ECR. The inference model uses that Docker container and takes the input image the user took with their phone, runs it through the trained object detection model, and outputs the result.

Again, The Barcode Registry uses custom Docker containers for the inference model to enable the use of Multi Model Server, but if only one model is needed that can be easily hosted through the built-in object detection algorithm.

Conclusion

The Barcode Registry (in conjunction with its partner Buyabarcode.com) uses AWS for its entire object detection pipeline. The web server reliably stores data in Amazon S3 and uses API Gateway and Lambda functions to connect the web server to the cloud. SageMaker readily trains and hosts ML models, which means a user can take a picture of a product on their phone and see if the product is a counterfeit. This post shows how to create and host an object detection model using SageMaker, as well as how to automate the process.

In testing, the model was able to achieve over 90% accuracy on a training set of 62 images and a testing set of 32 images, which is pretty impressive for a model trained without any human intervention. To get started training object detection models yourself check out the official documentation or learn how to deploy an object detection model to the edge using AWS IoT Greengrass.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Andrew Masek, Software Engineer at The Barcode Registry.

Erik Quisling, CEO of The Barcode Registry.

Read More

An animated line-plot showing the stability of optimal learning rate as we change the neural network’s parametrization. The parametrization is varied by interpolating between mup-Parametrization and PyTorch default in terms of the scaling for the learning rate and the initialization scale. The animation shows that mup-Parametrization is the only parametrization that preserves the optimality of learning rate across model widths; it also achieves the best absolute performance across all parametrizations.

µTransfer: A technique for hyperparameter tuning of enormous neural networks

An animated line-plot showing the stability of optimal learning rate as we change the neural network’s parametrization. The parametrization is varied by interpolating between mup-Parametrization and PyTorch default in terms of the scaling for the learning rate and the initialization scale. The animation shows that mup-Parametrization is the only parametrization that preserves the optimality of learning rate across model widths; it also achieves the best absolute performance across all parametrizations.

Great scientific achievements cannot be made by trial and error alone. Every launch in the space program is underpinned by centuries of fundamental research in aerodynamics, propulsion, and celestial bodies. In the same way, when it comes to building large-scale AI systems, fundamental research forms the theoretical insights that drastically reduce the amount of trial and error necessary and can prove very cost-effective. 

In this post, we relay how our fundamental research enabled us, for the first time, to tune enormous neural networks that are too expensive to train more than once. We achieved this by showing that a particular parameterization preserves optimal hyperparameters across different model sizes. This is the µ-Parametrization (or µP, pronounced “myu-P”) that we introduced in a previous paper, where we showed that it uniquely enables maximal feature learning in the infinite-width limit. In collaboration with researchers at OpenAI, we verified its practical advantage on a range of realistic scenarios, which we describe in our new paper, “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer.”

By greatly reducing the need to guess which training hyperparameters to use, this technique can accelerate research on enormous neural networks, such as GPT-3 and potentially larger successors in the future. We also released a PyTorch package that facilitates the integration of our technique in existing models, available on the project GitHub page or by simply running (texttt{pip install mup}).

“µP provides an impressive step toward removing some of the black magic from scaling up neural networks. It also provides a theoretically backed explanation of some tricks used by past work, like the T5 model. I believe both practitioners and researchers alike will find this work valuable.”

— Colin Raffel, Assistant Professor of Computer Science, University of North Carolina at Chapel Hill and co-creator of T5

Scaling the initialization is easy, but scaling training is hard

Large neural networks are hard to train partly because we don’t understand how their behavior changes as their size increases. Early work on deep learning, such as by Glorot & Bengio and He et al., generated useful heuristics that deep learning practitioners widely use today. In general, these heuristics try to keep the activation scales consistent at initialization. However, as training starts, this consistency breaks at different model widths, as illustrated on the left in Figure 1.

Unlike at random initialization, behavior during training is much harder to mathematically analyze. Our goal is to obtain a similar consistency so that as model width increases, the change in activation scales during training stay consistent and similar to initialization to avoid numerical overflow and underflow. Our solution, µP, achieves this goal, as seen on the right in Figure 1, which shows the stability of network activation scales for the first few steps of training across increasing model width. 

Two line-plots showing the change in activation scale between PyTorch default and the µ-Parametrization. Under PyTorch default, the activation scale grows as the network width increases for a particular time step. Under µ-Parametrization, the activation scale is stable across widths for a particular time step.
Figure 1: In the default parameterization in PyTorch, the graph on the left, the activation scales diverge in width after one step of training. But in µP, the graph on the right, the activation scales change by a consistent amount regardless of width for any training step. The y-axis shows the change of network activation scales on a fixed input after t=0, 1, 2, 3, and 4 steps of training as the width of the model varies, which is shown along the x-axis. 

Our parameterization, which maintains this consistency during training, follows two pieces of crucial insight. First, gradient updates behave differently from random weights when the width is large. This is because gradient updates are derived from data and contain correlations, whereas random initializations do not. Therefore, they need to be scaled differently. Second, parameters of different shapes also behave differently when the width is large. While we typically divide parameters into weights and biases, with the former being matrices and the latter vectors, some weights behave like vectors in the large-width setting. For example, the embedding matrix in a language model is of size vocabsize x width. While the width tends to infinity, vocabsize stays constant and finite. During matrix multiplication, the difference in behavior between summing along a finite dimension and an infinite one cannot be more different.

These insights, which we discuss in detail in a previous blog post, motivated us to develop µP. In fact, beyond just keeping the activation scale consistent throughout training, µP ensures that neural networks of different and sufficiently large widths behave similarly during training such that they converge to a desirable limit, which we call the feature learning limit

A theory-guided approach to scaling width

Our theory of scaling enables a procedure to transfer training hyperparameters across model sizes. If, as discussed above, µP networks of different widths share similar training dynamics, they likely also share similar optimal hyperparameters. Consequently, we can simply apply the optimal hyperparameters of a small model directly onto a scaled-up version. We call this practical procedure µTransfer. If our hypothesis is correct, the training loss-hyperparameter curves for µP models of different widths would share a similar minimum.

Conversely, our reasoning suggests that no scaling rule of initialization and learning rate other than µP can achieve the same result. This is supported by the animation below. Here, we vary the parameterization by interpolating the initialization scaling and the learning rate scaling between PyTorch default and µP. As shown, µP is the only parameterization that preserves the optimal learning rate across width, achieves the best performance for the model with width 213 = 8192, and where wider models always do better for a given learning rate—that is, graphically, the curves don’t intersect. 

An animated line-plot showing the stability of optimal learning rate as we change the neural network’s parametrization. The parametrization is varied by interpolating between µ-Parametrization and PyTorch default in terms of the scaling for the learning rate and the initialization scale. The animation shows that µ-Parametrization is the only parametrization that preserves the optimality of learning rate across model widths; it also achieves the best absolute performance across all parametrizations.
Figure 2: On the left, we train multilayer perceptrons (MLPs) of different widths (which correspond to the curves of different colors and patterns) with different learning rates (shown along the x-axis) on CIFAR10 and plot the training loss along the y-axis. On the right, the 2D plane of parameterizations is formed by interpolation of 1) the initialization scaling between PyTorch default and µP (x-axis), and 2) the learning rate scaling between PyTorch default and µP (y-axis). On this plane, PyTorch default is represented by (0, 0) and µP by (1, 1). The width-256 (log2(width) = 8) model is the same across all frames (except for random seed), but we widen models according to the parameterization represented by the dot on the right. 

Building on the theoretical foundation of Tensor Programs, µTransfer works automatically for advanced architectures, such as Transformer and ResNet. It can also simultaneously transfer a wide range of hyperparameters. Using Transformer as an example, we demonstrate in Figure 3 how the optima of key hyperparameters are stable across widths. 

Four line-plots showing the stability of optima of various hyperparameters across widths. From left-to-right and top-to-bottom, we see that the optima for learning rate, cross-entropy temperature, initialization standard deviation, and learning rate schedule are all roughly stable across widths, from 128 to 4,096.
Figure 3: Transformers of different widths parameterized in µP and trained on WikiText-2. As we increase model width, the optimal learning rate, cross-entropy temperature, initialization scale, and learning rate schedule remain stable. We can meaningfully predict the optimal hyperparameters of a wider network by looking at those of a narrow one. In plot on the lower right, we tried the following learning rate schedules: (a) linear decay, (b) StepLR @ [5k, 8k] with a decay factor of 0.1, (c) StepLR @ [4k, 7k] with a decay factor of 0.3, (d) cosine annealing,(e) constant, and (f) inverse square-root decay. 

“I am excited about µP advancing our understanding of large models. µP’s principled way of parameterizing the model and selecting the learning rate make it easier for anybody to scale the training of deep neural networks. Such an elegant combination of beautiful theory and practical impact.”

— Johannes Gehrke, Technical Fellow, Lab Director of Research at Redmond, and CTO and Head of Machine Learning for the Intelligent Communications and Conversations Cloud (IC3)

Beyond width: Empirical scaling of model depth and more

Modern neural network scaling involves many more dimensions than just width. In our work, we also explore how µP can be applied to realistic training scenarios by combining it with simple heuristics for nonwidth dimensions. In Figure 4, we use the same transformer setup to show how the optimal learning rate remains stable within reasonable ranges of nonwidth dimensions. For hyperparameters other than learning rate, see Figure 19 in our paper. 

Four line-plots showing the stability of the optimal learning rate across width, depth, batch size, and sequence length. The width is varied from 128 to 4,096, the depth from 2 to 32, the batch size from 20 to 512, and the sequence length from 32 to 512.
Figure 4: Transformers of different sizes parameterized in µP and trained on Wikitext-2. Not only does the optimal learning rate transfer across width, as shown in Figure 3, it also empirically transfers across other scale dimensions—such as depth, batch size, and sequence length—across the ranges we tested here. This means we can combine our theoretically motivated transfer across width with the empirically verified one across other scale dimensions to obtain the practical procedure, µTransfer, to tune hyperparameters indirectly on a small model and transfer to a large one. 

Testing µTransfer

Now that we have verified the transfer of individual hyperparameters, it is time to combine them in a more realistic scenario. In Figure 5, we compare µTransfer, which transfers tuned hyperparameters from a small proxy model, with directly tuning the large target model. In both cases, the tuning is done via random search. Figure 5 illustrates a Pareto frontier of the relative tuning compute budget compared with the tuned model quality (BLEU score) on IWSLT14 De-En, a machine translation dataset. Across all compute budget levels, µTransfer is about an order of magnitude (in base 10) more compute-efficient for tuning. We expect this efficiency gap to dramatically grow as we move to larger target model sizes. 

A line-plot showing the Pareto-front corresponding to model performance measured in BLEU score and the compute budget for hyperparameter tuning. The curve representing our method, µTransfer, dominates that of conventional tuning with a margin of roughly 10 times in compute budget. Our method also yields the best absolute performance, at almost 35.4 in BLEU score, where as the conventional method tops out at 35.2.
Figure 5: Across different tuning budgets, µTransfer dominates the baseline method of directly tuning the target model. As we train larger target models with billions of parameters, we expect the performance gap to widen, since the proxy model can remain small while still meaningfully predicting the optimal hyperparameters, as shown in Figures 3 and 4. 

A glimpse of the future: µP + GPT-3

Before this work, the larger a model was, the less well-tuned we expected it to be due to the high cost of tuning. Therefore, we expected that the largest models could benefit the most from µTransfer, which is why we partnered with OpenAI to evaluate it on GPT-3. 

After parameterizing a version of GPT-3 with relative attention in µP, we tuned a small proxy model with 40 million parameters before copying the best hyperparameter combination to the 6.7-billion parameter variant of GPT-3, as prescribed by µTransfer. The total compute used during this tuning stage was only 7 percent of the compute used in the pretraining of the final 6.7-billion model. This µTransferred model outperformed the model of the same size (with absolute attention) in the original GPT-3 paper. In fact, it performs similarly to the model (with absolute attention) with double the parameter count from the same paper, as shown in Figure 6. 

Two bar-plots showing the relative performance of GPT-3 6.7B compared to GPT-3 6.7B tuned with µTransfer. On language modeling tasks, including PTB, Wikitext 103, and LM1B, the run with µTransfer achieves lower perplexities. On NLU tasks, including HellaSwag, LAMBADA, and SQuADv2, the run with µTransfer achieves higher accuracies, comparable to those achieved by GPT-3 6.7B or GPT-3 13B tuned without µTransfer.
Figure 6: We applied µTransfer to GPT-3 6.7-billion parameter model with relative attention and obtained better results than the baseline with absolute attention used in the original GPT-3 paper, all while only spending 7 percent of the pretraining compute budget on tuning. The performance of this µTransfer 6.7-billion model is comparable to that of the 13-billion model (with absolute attention) in the original GPT-3 paper.

Implications for deep learning theory

As shown previously, µP gives a scaling rule which uniquely preserves the optimal hyperparameter combination across models of different widths in terms of training loss. Conversely, other scaling rules, like the default in PyTorch or the NTK parameterization studied in the theoretical literature, are looking at regions in the hyperparameter space farther and farther from the optimum as the network gets wider. In that regard, we believe that the feature learning limit of µP, rather than the NTK limit, is the most natural limit to study if our goal is to derive insights that are applicable to feature learning neural networks used in practice. As a result, more advanced theories on overparameterized neural networks should reproduce the feature learning limit of µP in the large width setting. 

Theory of Tensor Programs

The advances described above are made possible by the theory of Tensor Programs (TPs) developed over the last several years. Just as autograd helps practitioners compute the gradient of any general computation graph, TP theory enables researchers to compute the limit of any general computation graph when its matrix dimensions become large. Applied to the underlying graphs for neural network initialization, training, and inference, the TP technique yields fundamental theoretical results, such as the architectural universality of the Neural Network-Gaussian Process correspondence and the Dynamical Dichotomy theorem, in addition to deriving µP and the feature learning limit that led to µTransfer. Looking ahead, we believe extensions of TP theory to depth, batch size, and other scale dimensions hold the key to the reliable scaling of large models beyond width. 

Applying µTransfer to your own models

Even though the math can be intuitive, we found that implementing µP (which enables µTransfer) from scratch can be error prone. This is similar to how autograd is tricky to implement from scratch even though the chain rule for taking derivatives is very straightforward. For this reason, we created the mup package to enable practitioners to easily implement µP in their own PyTorch models, just as how frameworks like PyTorch, TensorFlow, and JAX have enabled us to take autograd for granted. Please note that µTransfer works for models of any size, not just those with billions of parameters. 

The journey has just begun

While our theory explains why models of different widths behave differently, more investigation is needed to build a theoretical understanding of the scaling of network depth and other scale dimensions. Many works have addressed the latter, such as the research on batch size by Shallue et al., Smith et al., and McCandlish et al., as well as research on neural language models in general by Rosenfield et al. and Kaplan et al. We believe µP can remove a confounding variable for such investigations.  Furthermore, recent large-scale architectures often involve scale dimensions beyond those we have talked about in our work, such as the number of experts in a mixture-of-experts system. Another high-impact domain to which µP and µTransfer have not been applied is fine tuning a pretrained model. While feature learning is crucial in that domain, the need for regularization and the finite-width effect prove to be interesting challenges. 

We firmly believe in fundamental research as a cost-effective complement to trial and error and plan to continue our work to derive more principled approaches to large-scale machine learning. To learn about our other deep learning projects or opportunities to work with us and even help us expand µP, please go to our Deep Learning Group page.

The post µTransfer: A technique for hyperparameter tuning of enormous neural networks appeared first on Microsoft Research.

Read More

Learning from Weakly-Labeled Videos via Sub-Concepts

Video recognition is a core task in computer vision with applications from video content analysis to action recognition. However, training models for video recognition often requires untrimmed videos to be manually annotated, which can be prohibitively time consuming. In order to reduce the effort of collecting videos with annotations, learning visual knowledge from videos with weak labels, i.e., where the annotation is auto-generated without manual intervention, has attracted growing research interest, thanks to the large volume of easily accessible video data. Untrimmed videos, for example, are often acquired by querying with keywords for classes that the video recognition model aims to classify. A keyword, which we refer to as a weak label, is then assigned to each untrimmed video obtained.

Although large-scale videos with weak labels are easier to collect, training with unverified weak labels poses another challenge in developing robust models. Recent studies have demonstrated that, in addition to the label noise (e.g., incorrect action labels on untrimmed videos), there is temporal noise due to the lack of accurate temporal action localization — i.e., an untrimmed video may include other non-targeted content or may only show the target action in a small proportion of the video.

Reducing noise effects for large-scale weakly-supervised pre-training is critical but particularly challenging in practice. Recent work indicates that querying short videos (e.g., ~1 minute in length) to obtain more accurate temporal localization of target actions or applying a teacher model to do filtering can yield improved results. However, such data pre-processing methods prevent models from fully utilizing available video data, especially longer videos with richer content.

In “Learning from Weakly-Labeled Web Videos via Exploring Sub-Concepts“, we propose a solution to these issues that uses a simple learning framework to conduct effective pre-training on untrimmed videos. Instead of simply filtering the potential temporal noise, this approach converts such “noisy” data to useful supervision by creating a new set of meaningful “middle ground” pseudo-labels that expand the original weak label space, a novel concept we call Sub-Pseudo Label (SPL). The model is pre-trained on this more “fine-grained” space and then fine-tuned on a target dataset. Our experiments demonstrate that the learned representations are much better than previous approaches. Moreover, SPL has been shown to be effective in improving the action recognition model quality for Google Cloud Video AI, which enables content producers to easily search through massive libraries of their video assets to quickly source content of interest.

Sampled training clips may represent a different visual action (whisking eggs) from the query label of the whole untrimmed video (baking cookies). SPL converts the potential label noise to useful supervision signals by creating a new set of “middle ground” pseudo-classes (i.e., sub-concepts) via extrapolating two related action classes. Enriched supervision is provided for effective model pre-training.

Sub-Pseudo Label (SPL)
SPL is a simple technique that advances the teacher-student training framework, which is known to be effective for self-training and to improve semi-supervised learning. In the teacher-student framework, a teacher model is trained on high-quality labeled data and then assigns pseudo-labels to unlabeled data. The student model trains on both high-quality labeled data and the unlabeled data that has the teacher-predicted labels. While previous methods have proposed a number of ways to improve the pseudo-label quality, SPL takes a novel approach that combines knowledge from both weak labels (i.e., query text used to acquire data) and teacher-predicted labels, which results in better pseudo-labels overall. This method focuses on video recognition where temporal noise is challenging, but it can be extended easily to other domains, like image classification.

The overall pre-training framework for learning from weakly labeled videos via SPLs. Each trimmed video clip is re-labeled using SPL given the teacher-predicted labels and the weak labels used to query the corresponding untrimmed video.

The SPL method is motivated by the observation that within an untrimmed video “noisy” video clips have semantic relations with the target action (i.e., the weak label class), but may also include essential visual components of other actions, such as the teacher model–predicted class. Our approach uses the extrapolated SPLs from weak labels together with the distilled labels to capture the enriched supervision signals, encouraging learning better representations during pre-training that can be used for downstream fine-tuning tasks.

It is straightforward to determine the SPL class for each video clip. We first perform inference on each video clip using the teacher model trained from a target dataset to get a teacher prediction class. Each clip is also labeled by the class (i.e., query text) of the untrimmed source video. A 2-dimensional confusion matrix is used to summarize the alignments between the teacher model inferences and the original weak annotations. Based on this confusion matrix, we conduct label extrapolation between teacher model predictions and weak labels to obtain the raw SPL label space.

Left: The confusion matrix, which is the basis of the raw SPL label space. Middle: The resulting SPL label spaces (16 classes in this example). Right: SPL-B, another SPL version, that reduces the label space by collating agreed and disagreed entries of each row as independent SPL classes, which in this example results in only 8 classes.

Effectiveness of SPL
We evaluate the effectiveness of SPL in comparison to different pre-training methods applied to a 3D ResNet50 model that is fine-tuned on Kinetics-200 (K200). One pre-training approach simply initializes the model using ImageNet. The other pre-training methods use 670k video clips sampled from an internal dataset of 147k videos, collected following standard processes similar to those described for Kinetics-200, that cover a broad range of actions. Weak label training and teacher prediction training use either the weak labels or teacher-predicted labels on the videos, respectively. Agreement filtering uses only the training data for which the weak labels and teacher-predicted labels match. We find that SPL outperforms each of these methods. Though the dataset used to illustrate the SPL approach was constructed for this work, in principle the method we describe applies to any dataset that has weak labels.

Pre-training Method      Top-1      Top-5
ImageNet Initialized      80.6      94.7
Weak Label Train      82.8      95.6
Teacher Prediction Train      81.9      95.0
Agreement Filtering Train      82.9      95.4
SPL      84.3      95.7

We also demonstrate that sampling more video clips from a given number of untrimmed videos can help improve the model performance. With a sufficient number of video clips available, SPL consistently outperforms weak label pre-training by providing enriched supervision.

As more clips are sampled from 147K videos, the label noise is increased gradually. SPL becomes more and more effective at utilizing the weakly-labeled clips to achieve better pre-training.

We visualize the visual concepts learned from SPL with attention visualization by applying Grad-CAM on the trained model. It is interesting to observe some meaningful “middle ground” concepts that can be learned by SPL.

Examples of attention visualization for SPL classes. Some meaningful “middle ground” concepts can be learned by SPL, such as mixing up the eggs and flour (left) and using the abseiling equipment (right).

Conclusion
We demonstrate that SPLs can provide enriched supervision for pre-training. SPL does not increase training complexity and can be treated as an off-the-shelf technique to integrate with teacher-student–based training frameworks. We believe this is a promising direction for discovering meaningful visual concepts by bridging weak labels and the knowledge distilled from teacher models. SPL has also demonstrated promising generalization to the image recognition domain and we expect future extensions that apply to tasks that have noise in labels. We have successfully applied SPL for Google Cloud Video AI where it has improved the accuracy of the action recognition models, helping users to better understand, search, and monetize their video content library.

Acknowledgements
We gratefully acknowledge the contributions of other co-authors, including Kunpeng Li, Xuehan Xiong, Chen-Yu Lee, Zhichao Lu, Yun Fu, Tomas Pfister. We also thank Debidatta Dwibedi, David A Ross, Chen Sun, Jonathan C. Stroud, and Wei Hua for their valuable comments and help on this work, and Tom Small for figure creation.

Read More

Storage Specialist Excelero Joins NVIDIA

Excelero, a Tel Aviv-based provider of high-performance software-defined storage, is now a part of NVIDIA.

The company’s team of engineers — including its seasoned co-founders with decades of experience in HPC, storage and networking — bring deep expertise in the block storage that large businesses use in storage-area networks.

Now their mission is to help expand support for block storage in our enterprise software stack such as clusters for high performance computing. Block storage also has an important role to play inside the DOCA software framework that runs on our DPUs.

“The Excelero team is joining NVIDIA as demand is surging for high-performance computing and AI,” said Yaniv Romem, CEO and co-founder of Excelero. “We’ll be working with NVIDIA to ensure our existing customers are supported, and going forward we’re thrilled to apply our expertise in block storage to NVIDIA’s world-class AI and HPC platforms,” he added.

Founded in 2014, Excelero developed NVMesh, software that manages and secures virtual arrays of NVMe flash drives as block storage available across public and private clouds.

Excelero’s software has won praise from users for its high throughput, low latency and support for Kubernetes containers. It’s also attracted collaborations with major cloud service providers.

The company has been an NVIDIA partner since its early days, attracting the former Mellanox, now part of NVIDIA, as an investor. We collaborated on accelerating storage with RDMA, a key technology at the heart of both InfiniBand and RoCE (Ethernet) networks.

NVIDIA will continue to support Excelero’s customers by honoring its contracts. Looking ahead, Excelero’s technology will be integrated into NVIDIA’s enterprise software stack.

We welcome this world-class engineering team to NVIDIA.

The post Storage Specialist Excelero Joins NVIDIA appeared first on NVIDIA Blog.

Read More

Assessing Generalization of SGD via Disagreement

Imagine training a deep network twice with two different random seeds on the same data, and then measuring the rate at which they disagree on unlabeled test points. Naively, they can disagree with one another with probability anywhere between zero and twice the error rate. But surprisingly, in practice, we observe that the disagreement and test error of deep neural network are remarkably close to each other. The variable (y) refers to the average generalization error of the two models and the variable (x) refers to the disagreement of the two models.

Estimating the generalization error of a model — how well the model performs on unseen data — is a fundamental component in any machine learning system. Generalization performance is traditionally estimated in a supervised manner, by dividing the labeled data into a training set and test set. However, high-quality labels are usually costly and, ideally, we would like to use all of them to train the model. On the other hand, in many real-world settings, a large amount of unlabeled data is readily available. How can we tap into the rich information in these unlabeled data and leverage them to assess a model’s performance without labels?

In this work (full paper), we demonstrate that a simple procedure can accurately estimate the generalization error with only unlabeled data. This result reveals a surprising fact about how neural networks make mistakes and their connection to calibration through an identity we call Generalization Disagreement Equality (GDE).

A surprising observation

Stochastic gradient descent (SGD) is perhaps the most popular optimization algorithm for deep neural networks. Due to the non-convex nature of the deep neural network’s optimization landscape, different runs of SGD will find different solutions. As a result, if the solutions are not perfect, they will disagree with each other on some of the unseen data. This disagreement can be harnessed to estimate generalization error without labels:

  1. Given a model, run SGD with the same hyperparameters but different random seeds on the training data to get two different solutions.
  2. Measure how often the networks’ predictions disagree on a new unlabeled test dataset.

We find that the disagreement rate is approximately equal to the average test error over the two models. Our observation builds on the phenomenon reported by Nakkiran and Bansal (2020) [1]: given two networks of the same architecture trained to zero training error on two independently drawn datasets of the same size, the disagreement rate of the pair on the test dataset is nearly equal to the average test error. Our observation generalizes prior work by showing that the same phenomenon holds for small changes in hyperparameters and, more importantly, the same dataset, which makes the procedure relevant for practical applications.

This procedure estimates the test error with unlabeled data, but the mechanisms behind it are not immediately obvious. Let (h(x)) be the prediction of classifier (h(x)) and (y) be the true label. By the triangle inequality, the disagreement rate can be anywhere between 0 and 2 times the test error: ( 0 leq mathbb{E}[h(x) ne h’(x)] leq mathbb{E}[h(x) ne y] + mathbb{E}[h’(x) ne y]). Given that the models observe the same data, we may expect the models to extract similar knowledge from the data and consequently make similar errors, which would make the disagreement rate lower than the test error. Yet, we find that in practice, the disagreement rate approximates the test error without any proportionality constant. Why is this the case?

The final parameters found by SGD depend on many sources of randomness: 1) random initialization, 2) random ordering of a fixed training dataset, and/or 3) random sampling of training data. To understand this phenomenon, we analyze what kind of randomness is responsible for this peculiar property. It is possible to have other sources of randomness, such as dropout, which are not studied in this work.

We study the phenomenon by observing how the behavior of disagreement changes when the different sources of randomness are isolated. For example, when examining the effect of different initialization,s we fix the dataset and the order in which the dataset is presented to the model and only change the random initialization. We observe that across different types of randomness, disagreement remains consistently close to the test error. This approach is particularly useful because we can use the same training data to train the two copies of the model. In addition, we empirically observe the phenomenon across a wide range of model architectures (ResNet, VGG, and fully-connected networks) and several popular image recognition benchmarks (MNIST, SVHN, Cifar10, and Cifar100).

GDE on CIFAR-10: The scatter plots of pair-wise model disagreement (x-axis) vs the test error (y-axis) of the different ResNet18 trained on CIFAR10. The dashed line is the diagonal line where disagreement equals the test error. Orange dots represent models that use data augmentation. The first two plots correspond to pairs of networks trained on independent datasets, and in the last two plots, on the same dataset.

Furthermore, estimating generalization error is especially important when the test distribution is different from the training distribution. On that note, we see some promising observations in the PACS data [2]. This dataset consists of 4 different environments (Photo, Art, Cartoon, Sketch) of different objects. For each environment, we trained two ResNet18 models on that environment with different initialization and order of the data and measured their disagreement rate on the remaining environments. We observe that similar phenomena hold across many (but not all) pairs of environments,  suggesting that the technique might be adaptable for estimating generalization error under distribution shift.

GDE under distribution shift: The scatter plots of pair-wise model disagreement (x-axis) vs the test error (y-axis) of the different ResNet50 trained on PACS. Each plot corresponds to models evaluated on the domain specified in the title. The source/training domain is indicated by different marker shapes.

Generalization-Disagreement Equality

We will now theoretically investigate the sufficient condition under which the disagreement rate equals generalization error.

In the deep learning literature, it is well-known that ensembles of SGD-trained deep networks are well-calibrated over the training distribution [3]. An ensemble is well-calibrated if the confidence ( tilde{h}_k(x) = mathbb{E}_{h_k sim H_A}[h(x)] ) it outputs for class ( k ) matches the expected accuracy i.e. ( P(y mid tilde{h}_k(x) = p) = p). There exist various formalisms for calibration. In this paper, we show that if an ensemble satisfies class-wise calibration for a distribution D (a more general form of calibration is discussed in the paper), then (mathbb{E}_{h, h’}[mathbb{E}_D[h(x) neq h'(x)]] = mathbb{E}_h[mathbb{E}_D[h(x) neq y]]) with strict equality. We refer to this equality as the Generalization Disagreement Equality (GDE).

Proof Sketch

The proof sketch we show here focuses on a special case, where class-wise calibration holds. Consider binary classification. Say that we have a distribution over the hypotheses ( H_A) given SGD algorithm (A) and we sample two hypotheses (h, h’) from this distribution. Now given some test data (D), we partition it by the confidence output by the ensemble i.e. (D_q = P(X | tilde{h}_0(x) = q)). For any fixed (x) in the support of (D_q), the disagreement rate in expectation over (h, h’) is

$$ E_{h,h’}[h(x) ne h'(x)] \ = P(tilde{h}(x) = 1)P(tilde{h}(x) = 0) + P(tilde{h}(x) = 0)P(tilde{h}(x) = 1) \ = q(1-q) + (1-q)q = 2q(1-q).$$

Additionally, note that for any (x in D), the expected error equals (tilde{h}_{1 – y}(x)). Since the ensemble is also class-wise calibrated, for any (x in D_q) the expected error over (h in H_A) is

$$E_h[h(x) ne y] \= tilde{h}_0(x)P(y = 1 | tilde{h}(x) = 1) + tilde{h}_1(x)P(y = 0 | tilde{h}(x) = 0) \ = q(1-q) + (1-q)q = 2q(1-q)$$

We proved that for each calibration level set, GDE holds. Since the disagreement rate 2q(1-q) equals the expected error 2q(1-q), we conclude that the expected error equals the disagreement rate. Since the disagreement rate 2q(1-q) equals the expected error 2q(1-q), we conclude that the expected error equals the disagreement rate. So in expectation over D, GDE holds.


Note the theorem only shows that the test error and expected disagreement are equal in expectation over all the models learned by SGD, but does not necessarily explain why the equality seems to hold for as few as a single pair of training runs. It captures the observation’s essence, but explaining why a single pair of models is sufficient is still an open problem. Intuitively, this can be true if the distribution of disagreement has an unusually small variance. In our experiments, we do observe very small variance for disagreement, but a rigorous theoretical discussion of why the variance is small remains unsettled.

Further, in practice, no models are perfectly calibrated, but we could still characterize how the deviation from calibration affects the GDE. We show that calibration error upper bounds the absolute difference between the expected disagreement and the expected calibration error. This inequality generalizes the GDE and implies that as long as the ensemble is reasonably calibrated, we can expect a good estimation of generalization error from disagreement. Finally, we empirically validate that this assumption often holds in practice by training 100 copies of the same model to produce an ensemble that is faithful to the true ensemble and measure their calibration error. The results show that these ensembles are indeed well-calibrated.

Conclusion

We have presented a method for estimating the generalization error of black-box deep neural networks with only unlabeled data, as well as some theoretical motivation for why the method works. Specifically, we showed that if the ensembles of the models trained by SGD are well-calibrated, the expected disagreement is equal to the expected test error. However, in practice, merely a single pair of models is sufficient for accurately estimating the generalization error. In the broader picture, this result marks a departure from the more traditional approaches of studying generalization and points to the tantalizing possibility of leveraging unlabeled data to estimate the generalization error. We are excited about the results we presented in the paper, but we are even more excited about the questions that we did not answer in the paper. We hope this work will encourage future work to investigate more unconventional ways of understanding generalization in deep learning.

Reference

[1] Nakkiran and Bansal. Distributional Generalization: A New Kind of Generalization. 2020.
[2] Li et al. Deeper, Broader and Artier Domain Generalization. IEEE Intl. Conf. on Computer Vision (ICCV), 2017.
[3] Lakshminarayanan et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. Conference on Neural Information Processing Systems (NeurIPS), 2017.

Read More

Unlocking new doors to artificial intelligence

Artificial intelligence research is constantly developing new hypotheses that have the potential to benefit society and industry; however, sometimes these benefits are not fully realized due to a lack of engineering tools. To help bridge this gap, graduate students in the MIT Department of Electrical Engineering and Computer Science’s 6-A Master of Engineering (MEng) Thesis Program work with some of the most innovative companies in the world and collaborate on cutting-edge projects, while contributing to and completing their MEng thesis.

During a portion of the last year, four 6-A MEng students teamed up and completed an internship with IBM Research’s advanced prototyping team through the MIT-IBM Watson AI Lab on AI projects, often developing web applications to solve a real-world issue or business use cases. Here, the students worked alongside AI engineers, user experience engineers, full-stack researchers, and generalists to accommodate project requests and receive thesis advice, says Lee Martie, IBM research staff member and 6-A manager. The students’ projects ranged from generating synthetic data to allow for privacy-sensitive data analysis to using computer vision to identify actions in video that allows for monitoring human safety and tracking build progress on a construction site.

“I appreciated all of the expertise from the team and the feedback,” says 6-A graduate Violetta Jusiega ’21, who participated in the program. “I think that working in industry gives the lens of making sure that the project’s needs are satisfied and [provides the opportunity] to ground research and make sure that it is helpful for some use case in the future.”

Jusiega’s research intersected the fields of computer vision and design to focus on data visualization and user interfaces for the medical field. Working with IBM, she built an application programming interface (API) that let clinicians interact with a medical treatment strategy AI model, which was deployed in the cloud. Her interface provided a medical decision tree, as well as some prescribed treatment plans. After receiving feedback on her design from physicians at a local hospital, Jusiega developed iterations of the API and how the results where displayed, visually, so that it would be user-friendly and understandable for clinicians, who don’t usually code. She says that, “these tools are often not acquired into the field because they lack some of these API principles which become more important in an industry where everything is already very fast paced, so there’s little time to incorporate a new technology.” But this project might eventually allow for industry deployment. “I think this application has a bunch of potential, whether it does get picked up by clinicians or whether it’s simply used in research. It’s very promising and very exciting to see how technology can help us modify, or I can improve, the health-care field to be even more custom-tailored towards patients and giving them the best care possible,” she says.

Another 6-A graduate student, Spencer Compton, was also considering aiding professionals to make more informed decisions, for use in settings including health care, but he was tackling it from a causal perspective. When given a set of related variables, Compton was investigating if there was a way to determine not just correlation, but the cause-and-effect relationship between them (the direction of the interaction) from the data alone. For this, he and his collaborators from IBM Research and Purdue University turned to a field of math called information theory. With the goal of designing an algorithm to learn complex networks of causal relationships, Compton used ideas relating to entropy, the randomness in a system, to help determine if a causal relationship is present and how variables might be interacting. “When judging an explanation, people often default to Occam’s razor” says Compton. “We’re more inclined to believe a simpler explanation than a more complex one.” In many cases, he says, it seemed to perform well. For instance, they were able to consider variables such as lung cancer, pollution, and X-ray findings. He was pleased that his research allowed him to help create a framework of “entropic causal inference” that could aid in safe and smart decisions in the future, in a satisfying way. “The math is really surprisingly deep, interesting, and complex,” says Compton. “We’re basically asking, ‘when is the simplest explanation correct?’ but as a math question.”

Determining relationships within data can sometimes require large volumes of it to suss out patterns, but for data that may contain sensitive information, this may not be available. For her master’s work, Ivy Huang worked with IBM Research to generate synthetic tabular data using a natural language processing tool called a transformer model, which can learn and predict future values from past values. Trained on real data, the model can produce new data with similar patterns, properties, and relationships without restrictions like privacy, availability, and access that might come with real data in financial transactions and electronic medical records. Further, she created an API and deployed the model in an IBM cluster, which allowed users increased access to the model and abilities to query it without compromising the original data.

Working with the advanced prototyping team, MEng candidate Brandon Perez also considered how to gather and investigate data with restrictions, but in his case it was to use computer vision frameworks, centered on an action recognition model, to identify construction site happenings. The team based their work on the Moments in Time dataset, which contains over a million three-second video clips with about 300 attached classification labels, and has performed well during AI training. However, the group needed more construction-based video data. For this, they used YouTube-8M. Perez built a framework for testing and fine-tuning existing object detection models and action recognition models that could plug into an automatic spatial and temporal localization tool — how they would identify and label particular actions in a video timeline. “I was satisfied that I was able to explore what made me curious, and I was grateful for the autonomy that I was given with this project,” says Perez. “I felt like I was always supported, and my mentor was a great support to the project.”

“The kind of collaborations that we have seen between our MEng students and IBM researchers are exactly what the 6-A MEng Thesis program at MIT is all about,” says Tomas Palacios, professor of electrical engineering and faculty director of the MIT 6-A MEng Thesis program. “For more than 100 years, 6-A has been connecting MIT students with industry to solve together some of the most important problems in the world.”

Read More

Build a cold start time series forecasting engine using AutoGluon

Whether you’re allocating resources more efficiently for web traffic, forecasting patient demand for staffing needs, or anticipating sales of a company’s products, forecasting is an essential tool across many businesses. One particular use case, known as cold start forecasting, builds forecasts for a time series that has little or no existing historical data, such as a new product that just entered the market in the retail industry. Traditional time series forecasting methods such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ES) rely heavily on historical time series of each individual product, and therefore aren’t effective for cold start forecasting.

In this post, we demonstrate how to build a cold start forecasting engine using AutoGluon AutoML for time series forecasting, an open-source Python package to automate machine learning (ML) on image, text, tabular, and time series data. AutoGluon provides an end-to-end automated machine learning (AutoML) pipeline for beginners to experienced ML developers, making it the most accurate and easy-to-use fully automated solution. We use the free Amazon SageMaker Studio Lab service for this demonstration.

Introduction to AutoGluon time series

AutoGluon is a leading open-source library for AutoML for text, image, and tabular data, allowing you to produce highly accurate models from raw data with just one line of code. Recently, the team has been working to extend these capabilities to time series data, and has developed an automated forecasting module that is publicly available on GitHub. The autogluon.forecasting module automatically processes raw time series data into the appropriate format, and then trains and tunes various state-of-the-art deep learning models to produce accurate forecasts. In this post, we demonstrate how to use autogluon.forecasting and apply it to cold start forecasting tasks.

Solution overview

Because AutoGluon is an open-source Python package, you can implement this solution locally on your laptop or on Amazon SageMaker Studio Lab. We walk through the following steps:

  1. Set up AutoGluon for Amazon SageMaker Studio Lab.
  2. Prepare the dataset.
  3. Define training parameters using AutoGluon.
  4. Train a cold start forecasting engine for time series forecasting.
  5. Visualize cold start forecasting predictions.

The key assumption of cold start forecasting is that items with similar characteristics should have similar time series trajectories, which is what allows cold start forecasting to make predictions on items without historical data, as illustrated in the following figure.

In our walkthrough, we use a synthetic dataset based on electricity consumption, which consists of the hourly time series for 370 items, each with an item_id from 0–369. Within this synthetic dataset, each item_id is also associated with a static feature (a feature that doesn’t change over time). We train a DeepAR model using AutoGluon to learn the typical behavior of similar items, and transfer such behavior to make predictions on new items (item_id 370–373) that don’t have historical time series data. Although we’re demonstrating the cold start forecasting approach with only one static feature, in practice, having informative and high-quality static features is the key for a good cold start forecast.

The following diagram provides a high-level overview of our solution. The open-source code is available on the GitHub repo.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Log in to your Amazon SageMaker Studio Lab account and set up the environment using the terminal:

cd sagemaker-studiolab-notebooks/ 
git clone https://github.com/whosivan/amazon-sagemaker-studio-lab-cold-start-forecasting-using-autogluon
conda env create -f autogluon.yml
conda activate autogluon
git clone https://github.com/yx1215/autogluon.git
cd autogluon/
git checkout --track origin/add_forecasting_predictor

These instructions should also work from your laptop if you don’t have access to Amazon SageMaker Studio Lab (we recommend installing Anaconda on your laptop first).

When you have the virtual environment fully set up, launch the notebook AutoGluon-cold-start-demo.ipynb and select the custom environment .conda-autogluon:Python kernel.

Prepare the target time series and item meta dataset

Download the following datasets to your notebook instance if they’re not included, and save them under the directory data/. You can find these datasets on our GitHub repo:

  • Test.csv.gz
  • coldStartTargetData.csv
  • itemMetaData.csv

Run the following snippet to load the target time series dataset into the kernel:

zipLocalFilePath = "data/test.csv.gz"
localFilePath = "data/test.csv"
util.extract_gz(zipLocalFilePath, localFilePath)

tdf = pd.read_csv(zipLocalFilePath, dtype = object)
tdf['target_value'] = tdf['target_value'].astype('float')
tdf.head()

AutoGluon time series requires static features to be represented in numerical format. This can be achieved through applying LabelEncoder() on our static feature type, where we encode A=0, B=1, C=2, D=3 (see the following code). By default, AutoGluon infers the static feature to be either ordinal or categorical. You can also overwrite this by converting the static feature column to be the object/string data type for categorical features, or integer/float data type for ordinal features.

localItemMetaDataFilePath = "data/itemMetaData.csv"
imdf = pd.read_csv(localItemMetaDataFilePath, dtype = object)

labelencoder = LabelEncoder()
imdf['type'] = labelencoder.fit_transform(imdf['type'])

imdf_without_coldstart_item['type'] = imdf_without_coldstart_item['type'].astype(str)

imdf_without_coldstart_item = imdf[imdf.item_id.isin(tdf.item_id.tolist())]
imdf_without_coldstart_item.to_csv('data/itemMetaDatawithoutColdstart.csv', index=False)

imdf_with_coldstart_item = imdf[~imdf.item_id.isin(tdf.item_id.tolist())]
imdf_with_coldstart_item.to_csv('data/itemMetaDataOnlyColdstart.csv', index=False)

Set up and start AutoGluon model training

We need to specify save_path = ‘autogluon-coldstart-demo’ as the model artifact folder name (see the following code). We also set our eval_metric as mean absolute percentage error, or ‘MAPE’ for short, where we defined prediction_length as 24 hours. If not specified, AutoGluon by default produces probabilistic forecasts and scores them via the weighted quantile loss. We only look at the DeepAR model in our demo, because we know the DeepAR algorithm allows cold start forecasting by design. We set one of the DeepAR hyperparameters arbitrarily and pass that hyperparameter to the ForecastingPredictor().fit() call. This allows AutoGluon to look only into the specified model. For a full list of tunable hyperparameters, refer to gluonts.model.deepar package.

save_path = 'autogluon-coldstart-demo'
eval_metric = 'MAPE'
deepar_params = {
    "scaling":True
}

ag_predictor = ForecastingPredictor(path=save_path, 
eval_metric=eval_metric).fit(tdf, static_features = imdf_without_coldstart_item,
prediction_length=24, #how far out in the future we wish to forecast                                                                  index_column="item_id",                             
target_column="target_value",                                          
time_column="timestamp",
quantiles=[0.1, 0.5, 0.9],                                                                
hyperparameters={"DeepAR": deepar_params})

The training takes 30–45 minutes. You can get the model summary by calling the following function:

ag_predictor.fit_summary()

Forecast on the cold start item

Now we’re ready to generate forecasts for the cold start item. We recommend having at least five rows for each item_id. Therefore, for the item_id that has fewer than five observations, we fill in with NaNs. In our demo, both item_id 370 and 372 have zero observation, a pure cold start problem, whereas the other two have five target values.

Load in the cold start target time series dataset with the following code:

localColdStartDataFilePath = "data/coldStartTargetData.csv"
cstdf = pd.read_csv(localColdStartDataFilePath, dtype = object)
cstdf.head(20)

We feed the cold start target time series into our AutoGluon model, along with the item meta dataset for the cold start item_id:

cold_start_prediction = ag_predictor.predict(cstdf, static_features=imdf_with_coldstart_item)

Visualize the predictions

We can create a plotting function to generate a visualization on the cold start forecasting, as shown in the following graph.

Clean up

To optimize resource usage, consider stopping the runtime on Amazon SageMaker Studio Lab after you have fully explored the notebook.

Conclusion

In this post, we showed how to build a cold start forecasting engine using AutoGluon AutoML for time series data on Amazon SageMaker Studio Lab. For those of you who are wondering the difference between Amazon Forecast and AutoGluon (time series), Amazon Forecast is a fully managed and supported service that uses machine learning (ML) to generate highly accurate forecasts without requiring any prior ML experience. While AutoGluon is an open-source project that is community supported with the latest research contributions. We walked through an end-to-end example to demonstrate what AutoGluon for time series is capable of, and provided a dataset and use case.

AutoGluon for time series data is an open-source Python package, and we hope that this post, together with our code example, gives you a straightforward solution to tackle challenging cold start forecasting problems. You can access the entire example on our GitHub repo. Try it out, and let us know what you think!


About the Authors

Ivan Cui is a Data Scientist with AWS Professional Services, where he helps customers build and deploy solutions using machine learning on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, and healthcare. In his free time, he enjoys reading, spending time with his family, and maximizing his stock portfolio.

Jonas Mueller is a Senior Applied Scientist in the AI Research and Education group at AWS, where he develops new algorithms to improve deep learning and develop automated machine learning. Before joining AWS to democratize ML, he completed his PhD at the MIT Computer Science and Artificial Intelligence Lab. In his free time, he enjoys exploring mountains and the outdoors.

Wenming Ye is a Research Product Manager at AWS AI. He is passionate about helping researchers and enterprise customers rapidly scale their innovations through open-source and state-of-the-art machine learning technology. Wenming has diverse R&D experience from Microsoft Research, the SQL engineering team, and successful startups.

Read More