Continuously monitor predictor accuracy with Amazon Forecast

We’re excited to announce that you can now automatically monitor the accuracy of your Amazon Forecast predictors over time. As new data is provided, Forecast automatically computes predictor accuracy metrics, providing you with more information to decide whether to keep using, retrain, or create new predictors.

Monitoring predictor quality and identifying deterioration in accuracy over time is important to achieving business goals. However, the processes required to continuously monitor predictor accuracy metrics can be time-consuming to set up and challenging to manage: forecasts have to be evaluated, and updated accuracy metrics have to be computed. In addition, metrics have to be stored and charted to understand trends and make decisions about keeping, retraining, or recreating predictors. These processes can result in costly development and maintenance burdens, and place meaningful operational stress on data science and analyst teams. And for customers not willing to take on this time-consuming process (they would prefer to retrain new predictors even when not needed), this wastes time and compute.

With today’s launch, Forecast now automatically tracks predictor accuracy over time as new data is imported. You can now quantify your predictor’s deviation from initial quality metrics and systematically evaluate model quality by visualizing trends, and make more informed decisions about keeping, retraining, or rebuilding your models as new data comes in. Predictor monitoring can be enabled for new predictors at inception, or turned on for existing models. You can enable this feature with one click on the AWS Management Console or using Forecast APIs.

Predictor accuracy over time

A predictor is a machine learning model created at a point in time, using an original set of training data. After a predictor is created, it’s used on an ongoing basis over days, weeks, or months into the future to generate time series forecasts with new ground truth data generated through actual transactions. As new data is imported, the predictor generates new forecasted data points based on the latest data provided to it.

When a predictor is first created, Forecast produces accuracy metrics such as weighted quantile loss (wQL), mean absolute percentage error (MAPE), or root mean squared error (RMSE) to quantify the accuracy of the predictor. These accuracy metrics are used to determine whether a predictor will be put into production. However, the performance of a predictor will fluctuate over time. External factors such as changes in the economic environment or in consumer behavior can change the fundamental factors underlying a predictor. Other factors include new products, items, and services that may be created; changes in the financial or economic environment; or changes in the distribution of data.

For example, consider a predictor trained when a certain color of a product was popular. Months later, new colors may appear or become more popular and the distribution of values change. Or a shift occurs in the business environment that modifies long-standing purchasing patterns (such as from high-margin to low-margin products). All things considered, the predictor may need to be retrained, or a new predictor may need to be created to ensure highly accurate predictions continue to be made.

Automated predictor monitoring

Predictor monitoring is designed to automatically analyze your predictor’s performance as new ground truth time series data becomes available and is used to create new forecasts. This monitoring provides you with continuous model performance information, and saves you time so you don’t have to set up the process yourself.

If predictor monitoring is enabled in Forecast, each time you import new data and produce a new forecast, performance statistics are updated automatically. Until now, these performance statistics were only available when the predictor was initially trained; now these statistics are produced on a continuous basis using new ground truth data, and can be actively monitored to gauge predictor performance.

This allows you to use predictor performance statistics to decide when to train or retrain a new predictor. For example, as the average wQL metric deviates from the initial baseline values, you can determine whether to retrain a new predictor. If you decide to retrain a predictor or create a new one, you can begin generating new forecasted data points using the more accurate predictor.

The following graphs provide two examples of predictor monitoring. In the first chart, the average wQL metric is decreasing from the baseline (the initial value when the predictor was trained), indicating that forecast accuracy is increasing over time. The chart shows average wQL dropping from 0.3 to 0.15 over the course of a few days, meaning that forecast accuracy is increasing. In this case, there is no need to retrain the predictor because it’s producing more accurate forecasts than when it was first trained.

In the next figure, the opposite is true: the average wQL is increasing, indicating that accuracy is decreasing over time. In this case, you should consider retraining or rebuilding the predictor with new data.

In Forecast, you have the choice of retraining the current predictor or rebuilding it from scratch. Retraining is done with one click and incorporates more up-to-date data and any updates and improvements in the Forecast algorithms. Rebuilding the predictor allows you to provide new inputs (such as forecast frequency, horizon, or new dimension) to create a new predictor.

Enable predictor monitoring

You can enable predictor monitoring when creating a new predictor, or turn it on for existing predictors. The steps in this section demonstrate how to perform these steps using the Forecast console. There is also a Jupyter notebook that walks through a sequence of steps to enable predictor monitoring using APIs and generate predictor monitor results.

This example uses the time-sliced sample dataset available from the predictor monitoring notebook. In our example, we start with a 100,000-row dataset of New York City taxi pickups containing a timestamp, location ID, and target value (the number of pickups requested during the timestamp at the location ID).

Complete the following steps:

  1. On the Forecast console, choose View dataset groups in the navigation pane.
  2. Choose Create dataset group and provide your dataset group details.
    After you create the dataset group, you’re prompted to create a target time series dataset. You use this dataset to train the predictor and create forecasts.
  3. On the Create target time series dataset page, provide your data’s schema, frequency, and location.
  4. Choose Start to import your target dataset.
    Next, you build your predictor and train it using your initial dataset.
  5. In the navigation pane, choose Predictors.
  6. Choose Train new predictor.
  7. In the Predictor settings section, enter a name for your predictor, how long in the future you want to forecast and at what frequency, and the number of quantiles you want to forecast for.
  8. For Optimization metric, you can choose an optimization metric to optimize AutoPredictor to tune a model for a specific accuracy metric of your choice. We leave this as default for our walkthrough.
  9. To get the predictor explainability report, select Enable predictor explainability.
  10. To enable predictor monitoring, select Enable predictor monitoring.
  11. Under the input data configuration, you can add local weather information and national holidays for more accurate demand forecasts.
  12. Choose Start to start training your predictor.

    Forecast now trains the predictor with this initial dataset. With predictor monitoring enabled, every time new data is provided in this dataset group, Forecast is able to compute updated predictor accuracy metrics.
  13. After the predictor has been trained, choose it to evaluate the initial accuracy metrics.

    The Metrics tab shows initial predictor quality metrics. Because you haven’t generated any forecasts from your predictor or imported any new ground truth data, there is nothing to show on the Monitoring tab.
    The next step is to generate a forecast using the new predictor.
  14. Choose Forecasts in the navigation pane.
  15. Choose Create forecast to create a new forecast based on the time series data you just imported and the predictor settings.
  16. Provide the forecast name, predictor name, and any additional quantile metrics you wish to compute.

After you create the forecast, you can view and export its details and results on the Forecast details page.

Predictor monitoring: Evaluating accuracy over time

Through the passage of time, new ground truth data is created by your business processes, for example, updated sales figures, staffing levels, or manufacturing output. To create new forecasts based on that new data, you can import your data to the dataset you created.

  1. On the Amazon Forecast console, on the Dataset groups page, choose your dataset group.
  2. Choose your dataset.
  3. In the Dataset imports section, choose Create dataset import.
  4. Provide additional details about your updated data, including its location.
  5. Choose Start.

With predictor monitoring, Forecast compares this new data to the previous forecast generated, and computes accuracy metrics for the predictor. Updated predictor quality metrics are computed on an ongoing basis as new data is added to the dataset.

You can follow these steps to import additional data, representing additional transactions that have occurred through time.

Evaluate predictor monitoring results

To see predictor monitoring results, you must add new ground truth data after generated the initial forecasts. Forecast compares this new ground truth data to the previous forecast, and produces updated model accuracy values for monitoring.

  1. On the Dataset groups page, choose the relevant dataset groups and select the Target Time Series to update it with new ground truth data.
  2. Choose Create Dataset Import and add your new ground truth data.

    After you provide the additional ground truth data, you can open your predictor and view initial predictor monitoring statistics.
  3. Choose your predictor and navigate to the Monitoring tab.

You can follow these steps to run additional forecasts using this predictor and add further iterations of ground truth data. The progression of model accuracy statistics for your predictor are available on the Monitoring tab.

This example shows model accuracy statistics for a predictor that has been evaluated with four additional data updates. The predictor had an initial baseline MAPE of 0.55 when it was initially trained. As additional data was loaded, the MAPE dropped to .42 with the first additional dataset, indicating a more accurate predictor, and fluctuated within a tight range from .42 to .48 with subsequent datasets.

You can toggle the chart to view additional metrics. In the following examples, MASE and average wQL show similar fluctuations from the baseline over time.

The Monitoring History section at the bottom of the page provides full details on all predictor accuracy metrics tracked over time.

Set up prediction monitoring on an existing predictor

You can easily enable monitoring for existing predictors. To do so, complete the following steps:

  1. In the navigation pane, under your dataset, choose Predictors.
  2. From here there are two ways to enable monitoring:
    1. Choose Start monitoring under the Monitoring column.
    2. Choose your predictor and on the Monitoring tab, under Monitor details, choose Start monitor.
  3. In the pop-up dialog, choose Start to start monitoring for the selected predictor.

The Monitoring tab now shows that predictor monitoring has started, and results are generated as you import more data.

Stop and restart predictor monitoring

You can also stop and restart predictor monitoring. Consider the following:

  • Cost – Predictor monitoring consumes additional resources. With typical small datasets, the cost is minimal, but may increase with large datasets (number of items in the input dataset, and forecast horizon).
  • Privacy – A copy of your forecast is stored during monitoring. If you don’t want to store this copy, you can stop monitoring.
  • Noise – If you’re experimenting with a predictor and don’t want to see noise in your predictor monitor results, you can temporarily stop predictor monitoring and start it again when your predictor is stable again.

To stop predictor monitoring, complete the following steps:

  1. Navigate to the Monitoring tab for a predictor where monitoring is enabled.
  2. Choose Stop Monitor to stop the monitoring of the predictor.
  3. Verify your choice when prompted.

A message shows on the next page to indicate that predictor monitoring is stopped.

You can restart predictor monitoring by choosing Resume monitor.

Conclusion

Monitoring the quality of your predictors over time is important to achieve your demand planning and forecasting objectives, and ultimately your business goals. However, predictor monitoring can be a time-consuming exercise, and the processes required to stand up and maintain the necessary workflows can lead to higher operational costs.

Forecast can now automatically track the quality of your predictors, allowing you to reduce operational efforts, while helping you make more informed decisions about keeping, retraining, or rebuilding your predictors. To enable predictor monitoring, you can follow the steps outlined in this post, or follow our GitHub notebook.

Please note that predictor monitoring is only available with AutoPredictor. For more information, refer to New Amazon Forecast API that creates up to 40% more accurate forecasts and provides explainability and CreateAutoPredictor.

To learn more, refer to Predictor Monitoring. We also recommend reviewing the pricing for using these new features. All these new capabilities are available in all Regions where Forecast is publicly available. For more information about Region availability, see AWS Regional Services.


About the Authors

Dan Sinnreich is a Sr. Product Manager for Amazon Forecast. He is focused on democratizing low code/no code machine learning and applying it to improve business outcomes. Outside of work he can be found playing hockey, trying to improve his tennis serve, and reading science fiction.

 Adarsh Singh works as a Software Development Engineer in the Amazon Forecast team. In his current role, he focuses on engineering problems and building scalable distributed systems that provide the most value to end users. In his spare time, he enjoys watching anime and playing video games.

Shannon Killingsworth is a UX Designer for Amazon Forecast. His current work is creating console experiences that are usable by anyone, and integrating new features into the console experience. In his spare time, he is a fitness and automobile enthusiast.

Read More

Adding Quantization-aware Training and Pruning to the TensorFlow Model Garden

Posted by Jaehong Kim, Rino Lee, and Fan Yang, Software Engineers

The TensorFlow model optimization toolkit (TFMOT) provides modern optimization techniques such as quantization aware training (QAT) and pruning. Since the introduction of TFMOT, we have been continuously improving its usability and coverage. Today, we are excited to announce that we are extending the TFMOT model coverage to popular computer vision models in the TensorFlow Model Garden.

To do so, we added 8-bit QAT API support for subclassed models and custom layers, and Pruning API support. You can use these new features in the model garden, and when developing your own models as well. With this, we have showcased applying QAT and pruning to several canonical computer vision models, while accelerating the model development cycle significantly.

In this article, we will describe the technical challenges we encountered to apply QAT and pruning to the subclass models and custom layers. And show the optimized results to show the benefits from optimization techniques.

New support for Model Garden models

Quantization

We have resolved a few technical challenges to support subclassed models and simplified the process of applying QAT API. All the new changes have already been taken care of by TFMOT and Model Garden to save users from knowing all technical details. The final user-facing API to apply QAT on a computer vision model in Model Garden is quite straightforward. By applying a few configuration changes, you can enable QAT to finetune a pre-trained model and obtain a deployable on-device model in just a few hours. There is minimal to no code change at all. Here we will talk about those challenges and how we addressed them.

The previous QAT API assumed that the model only contained built-in layers. To support nested functional models, we apply the QAT method to different parts of the model individually. For example, to apply QAT to an image classification model (M) in the Model Garden that consists of two sub modules: the backbone network (B) and the classification head (C). Here B is a nested model within M, and C is a layer. Both B and C only contain built-in layers. Instead of directly quantizing the entire classification model M, we quantize the backbone B and classification head C individually. First, we apply QAT to backbone B only. Then we connect the quantized backbone B to its corresponding classification head C to form a new classification model, and annotate C to be quantized. Finally, we quantize the entire new model, which effectively applies QAT to the annotated classification head C.

When the backbone network also contains custom layers rather than built-in layers, we add quantized versions of those custom layers first. For example, if the backbone network (B) or the classification head (C) of the classification model (M) also contain a custom layer called MyLayer, we create its QAT counterpart called MyLayerQuantized and wrap any built-in layers within it by a quantize wrapper API. We do this recursively if there are any nested custom layers, until all built-in layers are properly wrapped.

The remaining part after applying quantize is loading the weights from the original model because the QAT-applied model contains more parameters due to additional quantization parameters. Our current solution is variable name filtering. We have added a logic to load the weights from the original model to filtered weight from the QAT-applied model to support fine-tuning from pre-trained models.

Pruning

Along with QAT, we provide two Model garden models with pruning, which is another in-training model optimization technique of MOT. Pruning sparsifies (forces a fixed portion of elements to zero) the given model’s weights during training for computation and storage efficiency.

Users can easily set pruning parameters in Model Garden configs. For better pruned model quality, starting pruning from a pre-trained dense model and careful tuning pruning schedule over training steps are well-known techniques. Both are available in Model Garden Pruning configs.

This work also provides an example of nested functional layer support in pruning. The way we used here using get_prunable_weight() is also applicable to any other Keras models with custom layers.

With the provided two Model Garden Pruning configs, users can quickly demonstrate pruning to ResNet50 and MobileNetV2 models for image classification. Understanding the practical usage of Pruning API and the pruning process by monitoring tensorboard are also another takeaways of this work.

Examples and Results

We support two tasks, image classification and semantic segmentation. Specifically, for QAT in image classification, we support the common MobileNet family, including MobileNetV2, MobileNetV3 (large), Multi-Hardware MobileNet (AVG), and ResNet (through quantization on common building blocks such as InvertedBottleneckBlockQuantized and BottleneckBlockQuantized). For QAT in semantic segmentation, we support MobileNetV2 backbone with DeepLab V3/V3+. For Pruning in image classification we support MobileNetV2 and ResNet. Please refer to the documentations of QAT and pruning for more details.

Create QAT Models using Model Garden

Using QAT with Model Garden is simple and straightforward. First, we train a floating point model following the standard process of training models using Model Garden. After training converges, we take the best checkpoint as our starting point to apply QAT, analogous to a finetuning stage. Soon, we will obtain a model that is more quantization friendly. Such model then can be converted to a TFLite model for on-device deployment.

For image classification, we evaluate the top-1 accuracy on the ImageNet validation set. As shown in Table 1, QAT model consistently outperforms PTQ model by a large margin, which achieves comparable latency. Notably, on models where PTQ fails to produce reasonable results (MobileNetV3), QAT is still capable of generating a strong quantized model with negligible accuracy drop.

Table 1. Accuracy and latency comparison of supported models for ImageNet classification. Latency is measured on a Samsung Galaxy S21 using 1-thread CPU. FP32 refers to the unquantized floating point TFLite model. PTQ INT8 refers to full integer post-training quantization. QAT INT8 refers to the quantized QAT model.

model

reso-

lution

TFLite Model

Top-1 accuracy

Top-1 accuracy (FP32)

Top-1 accuracy (PTQ INT8)

Top-1 accuracy (QAT INT8)

Latency (FP32, ms/img)

Latency (PTQ
INT8, ms/img)

Latency (QAT INT8, ms/img)

ResNet50

224×224

76.7

76.7

76.4

77.2

184.01

48.73

64.49

MobileNet V2

224×224

72.8

72.8

72.4

72.8

16.74

6.85

6.84

MobileNet V3 Large

224×224

75.1

75.1

34.5*

74.4

13.32

6.43

6.85

MobileNet Multi-HW AVG

224×224

75.3

75.2

73.5

75.1

20.97

7.73

7.73

* PTQ fails to quantize MobileNet V3 properly due to hard-swish activation, thus leading to low accuracy.

We have a similar observation on semantic segmentation: PTQ introduces 1.3 mIoU drop, compared to FP32 model, while QAT model minimizes the drop to just 0.7 and maintains comparable latency. On average, we expect QAT will only introduce 0.5 top-1 accuracy drop for image classification and less than 1 mIoU drop for semantic segmentation.

Table 2. Accuracy and latency comparison of a MobileNet v2 + DeepLab v3 on Pascal VOC segmentation. Latency is measured on a Samsung Galaxy S21 using 1-thread CPU. FP32 refers to the unquantized floating point TFLite model. PTQ INT8 refers to full integer post-training quantization. QAT INT8 refers to the quantized QAT model.

model

reso-

lution

TFLite Model

mIoU

mIoU (FP32)

mIoU (PTQ
INT8)

mIoU (QAT INT8)

Latency (FP32, ms/img)

Latency (PTQ
INT8, ms/img)

Latency (QAT INT8, ms/img)

MobileNet v2 + DeepLab v3

512×512

75.27

75.30

73.95

74.68

136.60

60.94

55.53

Pruning Models in Model Garden

We support ResNet50 and MobileNet V2 for image classification. Pretrained dense models for each task are generated using the Model Garden training configs. The pruned model can be converted to the TFLite model. By simply setting a flag for sparsity in TFLite conversion, one can get a benefit of model size reduction through sparse data format.

For image classification, we again evaluate the top-1 accuracy on the ImageNet validation set, as well as the size of converted TFLite models. As sparsity level increases, the model size becomes more compact but accuracy degrades. The accuracy impact in high sparsity is more severe in parameter-efficient models like MobileNetV2.

Table 3. Accuracy and model size comparison of ResNet-50 and MobileNet v2 for ImageNet classification. Model size is measured by disk usage of saved TFLite models. Dense refers to the unpruned TFLite model, and 50% sparsity refers to the TFLite model with all prunable layers’ weights randomly pruned 50% of their elements.

Model

Resolution

Top-1 Accuracy (Dense)

Top-1 Accuracy (50% sparsity)

Top-1 Accuracy (80% sparsity)

TFLite Model size (Dense)

TFLite Model size (Mb, 50% sparsity)

TFLite Model size (Mb, 80% sparsity)

MobileNet V2

224×224

72.768%

71.334%

61.378%

13.36 Mb

9.74 Mb

4.00 Mb

ResNet50

224×224

76.704%

76.61%

75.508%

97.44 Mb

70.34 Mb

28.35 Mb

Conclusions

We have presented an extension to TFMOT that offers QAT and pruning support for computer vision models in Model Garden. We highlight the ease of use and outstanding trade-offs about maintaining accuracy while keeping low latency or small model size.

While we believe this is a simple and user-friendly solution to enable QAT and pruning, we know this is just the beginning of streamlined works to provide even better usability.

Currently, supported tasks are limited to image classification and semantic segmentation. We will continue to add more support to other tasks, such as object detection and instance segmentation. We will also add more models, such as transformer based models, and improve the usability of TFMOT and Model Garden’s API. Thanks for your interest in this work.

Acknowledgements

We would like to thank everyone who contributed to this work, including Model Garden, Model Optimization, and our collaborators from Research. Special thanks to David Rim (emeritus), Ethan Kim (emeritus) from the Model Optimization team; Abdullah Rashwan, Xianzhi Du, Yeqing Li, Jaeyoun Kim, Jing Li from the Model Garden team; Yuqi Li from the on-device ML team.

Read More

Techniques for Training Large Neural Networks

Techniques for Training Large Neural Networks

Large neural networks are at the core of many recent advances in AI, but training them is a difficult engineering and research challenge which requires orchestrating a cluster of GPUs to perform a single synchronized calculation. As cluster and model sizes have grown, machine learning practitioners have developed an increasing variety of techniques to parallelize model training over many GPUs. At first glance, understanding these parallelism techniques may seem daunting, but with only a few assumptions about the structure of the computation these techniques become much more clear—at that point, you’re just shuttling around opaque bits from A to B like a network switch shuttles around packets.

Data Parallelism

Techniques for Training Large Neural Networks

Pipeline Parallelism

Techniques for Training Large Neural Networks

Tensor Parallelism

Techniques for Training Large Neural Networks

Expert Parallelism

Techniques for Training Large Neural Networks

Data Parallelism

Techniques for Training Large Neural Networks

Pipeline Parallelism

Techniques for Training Large Neural Networks

Tensor Parallelism

Techniques for Training Large Neural Networks

Expert Parallelism

Techniques for Training Large Neural Networks

An illustration of various parallelism strategies on a three-layer model. Each color refers to one layer and dashed lines separate different GPUs.

No Parallelism

Training a neural network is an iterative process. In every iteration, we do a pass forward through a model’s layers to compute an output for each training example in a batch of data. Then another pass proceeds backward through the layers, propagating how much each parameter affects the final output by computing a gradient with respect to each parameter. The average gradient for the batch, the parameters, and some per-parameter optimization state is passed to an optimization algorithm, such as Adam, which computes the next iteration’s parameters (which should have slightly better performance on your data) and new per-parameter optimization state. As the training iterates over batches of data, the model evolves to produce increasingly accurate outputs.

Various parallelism techniques slice this training process across different dimensions, including:

  • Data parallelism—run different subsets of the batch on different GPUs;
  • Pipeline parallelism—run different layers of the model on different GPUs;
  • Tensor parallelism—break up the math for a single operation such as a matrix multiplication to be split across GPUs;
  • Mixture-of-Experts—process each example by only a fraction of each layer.

(In this post, we’ll assume that you are using GPUs to train your neural networks, but the same ideas apply to those using any other neural network accelerator.)

Data Parallelism

Data Parallel training means copying the same parameters to multiple GPUs (often called “workers”) and assigning different examples to each to be processed simultaneously. Data parallelism alone still requires that your model fits into a single GPU’s memory, but lets you utilize the compute of many GPUs at the cost of storing many duplicate copies of your parameters. That being said, there are strategies to increase the effective RAM available to your GPU, such as temporarily offloading parameters to CPU memory between usages.

As each data parallel worker updates its copy of the parameters, they need to coordinate to ensure that each worker continues to have similar parameters. The simplest approach is to introduce blocking communication between workers: (1) independently compute the gradient on each worker; (2) average the gradients across workers; and (3) independently compute the same new parameters on each worker. Step (2) is a blocking average which requires transferring quite a lot of data (proportional to the number of workers times the size of your parameters), which can hurt your training throughput. There are various asynchronous synchronization schemes to remove this overhead, but they hurt learning efficiency; in practice, people generally stick with the synchronous approach.

Pipeline Parallelism

With Pipeline Parallel training, we partition sequential chunks of the model across GPUs. Each GPU holds only a fraction of parameters, and thus the same model consumes proportionally less memory per GPU.

It’s straightforward to split a large model into chunks of consecutive layers. However, there’s a sequential dependency between inputs and outputs of layers, so a naive implementation can lead to a large amount of idle time while a worker waits for outputs from the previous machine to be used as its inputs. These waiting time chunks are known as “bubbles,” wasting the computation that could be done by the idling machines.

Techniques for Training Large Neural Networks
Forward
Techniques for Training Large Neural Networks
Backward
Techniques for Training Large Neural Networks
Gradient update
Techniques for Training Large Neural Networks
Idle
Techniques for Training Large Neural Networks

Illustration of a naive pipeline parallelism setup where the model is vertically split into 4 partitions by layer. Worker 1 hosts model parameters of the first layer of the network (closest to the input), while worker 4 hosts layer 4 (which is closest to the output). “F”, “B”, and “U” represent forward, backward and update operations, respectively. The subscripts indicate on which worker an operation runs. Data is processed by one worker at a time due to the sequential dependency, leading to large “bubbles” of idle time.

We can reuse the ideas from data parallelism to reduce the cost of the bubble by having each worker only process a subset of data elements at one time, allowing us to cleverly overlap new computation with wait time. The core idea is to split one batch into multiple microbatches; each microbatch should be proportionally faster to process and each worker begins working on the next microbatch as soon as it’s available, thus expediting the pipeline execution. With enough microbatches the workers can be utilized most of the time with a minimal bubble at the beginning and end of the step. Gradients are averaged across microbatches, and updates to the parameters happen only once all microbatches have been completed.

The number of workers that the model is split over is commonly known as pipeline depth.

During the forward pass, workers only need to send the output (called activations) of its chunk of layers to the next worker; during the backward pass, it only sends the gradients on those activations to the previous worker. There’s a big design space of how to schedule these passes and how to aggregate the gradients across microbatches. GPipe has each worker process forward and backward passes consecutively and then aggregates gradients from multiple microbatches synchronously at the end. PipeDream instead schedules each worker to alternatively process forward and backward passes.

Techniques for Training Large Neural Networks
Forward
Techniques for Training Large Neural Networks
Backward
Techniques for Training Large Neural Networks
Update
Techniques for Training Large Neural Networks
Idle
GPipe

Techniques for Training Large Neural Networks

PipeDream

Techniques for Training Large Neural Networks

Comparison of GPipe and PipeDream pipelining schemes, using 4 microbatches per batch. Microbatches 1-8 correspond to two consecutive data batches. In the image, “(number)” indicates on which microbatch an operation is performed and the subscript marks the worker ID. Note that PipeDream gets more efficiency by performing some computations with stale parameters.

Tensor Parallelism

Pipeline parallelism splits a model “vertically” by layer. It’s also possible to “horizontally” split certain operations within a layer, which is usually called Tensor Parallel training. For many modern models (such as the Transformer), the computation bottleneck is multiplying an activation batch matrix with a large weight matrix. Matrix multiplication can be thought of as dot products between pairs of rows and columns; it’s possible to compute independent dot products on different GPUs, or to compute parts of each dot product on different GPUs and sum up the results. With either strategy, we can slice the weight matrix into even-sized “shards”, host each shard on a different GPU, and use that shard to compute the relevant part of the overall matrix product before later communicating to combine the results.

One example is Megatron-LM, which parallelizes matrix multiplications within the Transformer’s self-attention and MLP layers. PTD-P uses tensor, data, and pipeline parallelism; its pipeline schedule assigns multiple non-consecutive layers to each device, reducing bubble overhead at the cost of more network communication.

Sometimes the input to the network can be parallelized across a dimension with a high degree of parallel computation relative to cross-communication. Sequence parallelism is one such idea, where an input sequence is split across time into multiple sub-examples, proportionally decreasing peak memory consumption by allowing the computation to proceed with more granularly-sized examples.

Mixture-of-Experts (MoE)

With the Mixture-of-Experts (MoE) approach, only a fraction of the network is used to compute the output for any one input. One example approach is to have many sets of weights and the network can choose which set to use via a gating mechanism at inference time. This enables many more parameters without increased computation cost. Each set of weights is referred to as “experts,” in the hope that the network will learn to assign specialized computation and skills to each expert. Different experts can be hosted on different GPUs, providing a clear way to scale up the number of GPUs used for a model.

Techniques for Training Large Neural Networks

Illustration of a mixture-of-experts (MoE) layer. Only 2 out of the n experts are selected by the gating network. (Image adapted from: Shazeer et al., 2017)

GShard scales an MoE Transformer up to 600 billion parameters with a scheme where only the MoE layers are split across multiple TPU devices and other layers are fully duplicated. Switch Transformer scales model size to trillions of parameters with even higher sparsity by routing one input to a single expert.

Other Memory Saving Designs

There are many other computational strategies to make training increasingly large neural networks more tractable. For example:

  • To compute the gradient, you need to have saved the original activations, which can consume a lot of device RAM. Checkpointing (also known as activation recomputation) stores any subset of activations, and recomputes the intermediate ones just-in-time during the backward pass. This saves a lot of memory at the computational cost of at most one additional full forward pass. One can also continually trade off between compute and memory cost by selective activation recomputation, which is checkpointing subsets of the activations that are relatively more expensive to store but cheaper to compute.

  • Mixed Precision Training is to train models using lower-precision numbers (most commonly FP16). Modern accelerators can reach much higher FLOP counts with lower-precision numbers, and you also save on device RAM. With proper care, the resulting model can lose almost no accuracy.

  • Offloading is to temporarily offload unused data to the CPU or amongst different devices and later read it back when needed. Naive implementations will slow down training a lot, but sophisticated implementations will pre-fetch data so that the device never needs to wait on it. One implementation of this idea is ZeRO which splits the parameters, gradients, and optimizer states across all available hardware and materializes them as needed.

  • Memory Efficient Optimizers have been proposed to reduce the memory footprint of the running state maintained by the optimizer, such as Adafactor.

  • Compression also can be used for storing intermediate results in the network. For example, Gist compresses activations that are saved for the backward pass; DALL·E compresses the gradients before synchronizing them.


At OpenAI, we are training and improving large models from the underlying infrastructure all the way to deployment them for real-world problems. If you’d like to put the ideas from this post into practice—especially relevant for our Scaling and Applied Research teams—we’re hiring!


Acknowledgments
Thanks to Nikolas Tezak, Sam Altman, Daniel Gackle, Ilya Sutskever, and Steve Dowling for feedback on drafts. Thanks to Justin Jay Wang, Bianca Martin, and Steve Dowling for communications and design.


OpenAI

Building a more helpful browser with machine learning

At Google we use technologies like machine learning (ML) to build more useful products — from filtering out email spam, to keeping maps up to date, to offering more relevant search results. Chrome is no exception: We use ML to make web images more accessible to people who are blind or have low vision, and we also generate real-time captions for online videos, in service of people in noisy environments, and those who are hard of hearing.

This work in Chrome continues, so we wanted to share some recent and future ML improvements that offer a safer, more accessible and more personalized browsing experience. Importantly: these updates are powered by on-device ML models, which means your data stays private, and never leaves your device.

More peace of mind, less annoying prompts

Safe Browsing in Chrome helps protect billions of devices every day, by showing warnings when people try to navigate to dangerous sites or download dangerous files (see the big red example below). Starting in March of this year, we rolled out a new ML model that identifies 2.5 times more potentially malicious sites and phishing attacks as the previous model – resulting in a safer and more secure web.

To further improve the browsing experience, we’re also evolving how people interact with web notifications. On the one hand, page notifications help deliver updates from sites you care about; on the other hand, notification permission prompts can become a nuisance. To help people browse the web with minimal interruption, Chrome predicts when permission prompts are unlikely to be granted based on how the user previously interacted with similar permission prompts, and silences these undesired prompts. In the next release of Chrome, we’re launching an ML model that makes these predictions entirely on-device.

Two separate images side by side. The first on the left is a smartphone showing a red screen and a warning message about phishing. The image on the right shows a Chrome browser window showing a pop-up message saying “Notifications blocked”.

With the next release of Chrome, this is what you will see if a phishing attempt is detected (Left) and Chrome will show permission requests quietly when the user is unlikely to grant them (Right).

Finding what’s important, always in your language

Earlier this year we launched Journeys to help people retrace their steps online. For example: You might spend weeks planning a national park visit – researching attractions, comparing flights and shopping for gear. With ML and Journeys, Chrome brings together the pages you’ve visited about a given topic, and makes it easy to pick up where you left off (vs. scr o o o l l ling through your browser history).

When you return to those hiking boots and camping guides, we’re also using ML to make those websites available in your preferred language. In particular, we’ve launched an updated language identification model to figure out the language of the page, and whether it needs to be translated to match your preferences. As a result, we’re seeing tens of millions more successful translations every day.

A Chrome browser showing Journeys related to travel. The user can see a cluster of recent searches they did related to a trip to Yosemite.

The Journeys feature of Chrome groups together your search history based on topic or intent.

A browser built just for you

Maybe you like to read news articles in the morning – phone in one hand, cereal spoon in the other – so you share lots of links from Chrome. Or maybe voice search is more your thing, as you sneak in a few questions during your transit ride to work. Either way, we want to make sure Chrome is meeting you where you’re at, so in the near future, we’ll be using ML to adjust the toolbar in real-time – highlighting the action that’s most useful in that moment (e.g., share link, voice search, etc.). Of course, you’ll be able to customize it manually as well.

A Chrome browser with a highlighted square around an icon to the right of the address bar. At the top is a share icon, and at the bottom is a microphone icon.

The toolbar in Chrome on Android will adapt based on your needs.

Our goal is to build a browser that’s genuinely and continuously helpful, and we’re excited about the possibilities that ML provides. At the end of the day, though, your experience is what really matters, so please tweet @googlechrome to send us your feedback.

Read More

Unified data preparation and model training with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot

Data fuels machine learning (ML); the quality of data has a direct impact on the quality of ML models. Therefore, improving data quality and employing the right feature engineering techniques are critical to creating accurate ML models. ML practitioners often tediously iterate on feature engineering, choice of algorithms, and other aspects of ML in search of optimal models that generalize well on real-world data and deliver the desired results. Because speed in doing business disproportionately matters, this extremely tedious and iterative process may lead to project delays and lost business opportunities.

Amazon SageMaker Data Wrangler reduces the time to aggregate and prepare data for ML from weeks to minutes, and Amazon SageMaker Autopilot automatically builds, trains, and tunes the best ML models based on your data. With Autopilot, you still maintain full control and visibility of your data and model. Both services are purpose-built to make ML practitioners more productive and accelerate time to value.

Data Wrangler now provides a unified experience enabling you to prepare data and seamlessly train a ML model in Autopilot. With this newly launched feature, you can now prepare your data in Data Wrangler and easily launch Autopilot experiments directly from the Data Wrangler user interface (UI). With just a few clicks, you can automatically build, train, and tune ML models, making it easier to employ state-of-the-art feature engineering techniques, train high-quality ML models, and gain insights from your data faster.

In this post, we discuss how you can use this new integrated experience in Data Wrangler to analyze datasets and easily build high-quality ML models in Autopilot.

Dataset overview

Pima Indians are an Indigenous group that live in Mexico and Arizona, US. Studies show Pima Indians as a high-risk population group for diabetes mellitus. Predicting the probability of an individual’s risk and susceptibility to a chronic illness like diabetes is an important task in improving the health and well-being of this often underrepresented minority group.

We use the Pima Indian Diabetes public dataset to predict the susceptibility of an individual to diabetes. We focus on the new integration between Data Wrangler and Autopilot to prepare data and automatically create an ML model without writing a single line of code.

The dataset contains information about Pima Indian females 21 years or older and includes several medical predictor (independent) variables and one target (dependent) variable, Outcome. The following chart describes the columns in our dataset.

Column Name Description
Pregnancies The number of times pregnant
Glucose Plasma glucose concentration in an oral glucose tolerance test within 2 hours
BloodPressure Diastolic blood pressure (mm Hg)
SkinThickness Triceps skin fold thickness (mm)
Insulin 2-hour serum insulin (mu U/ml)
BMI Body mass index (weight in kg/(height in m)^2)
DiabetesPedigree Diabetes pedigree function
Age Age in years
Outcome The target variable

The dataset contains 768 records, with 9 total features. We store this dataset in Amazon Simple Storage Bucket (Amazon S3) as a CSV file and then import the CSV directly into a Data Wrangler flow from Amazon S3.

Solution overview

The following diagram summarizes what we accomplish in this post.[KT1]

Data scientists, doctors, and other medical domain experts provide patient data with information on glucose levels, blood pressure, body mass index, and other features used to predict the likelihood of having diabetes. With the dataset in Amazon S3, we import the dataset into Data Wrangler to perform exploratory data analysis (EDA), data profiling, feature engineering, and splitting the dataset into train and test for model building and evaluation.

We then use Autopilot’s new feature integration to quickly build a model directly from the Data Wrangler interface. We choose Autopilot’s best model based on the model with the highest F-beta score. After Autopilot finds the best model, we run a SageMaker Batch Transform job on the test (holdout) set with the model artifacts of the best model for evaluation.

Medical experts can provide new data to the validated model to obtain a prediction to see if a patient will likely have diabetes. With these insights, medical experts can start treatment early to improve the health and well-being of vulnerable populations. Medical experts can also explain a model’s prediction by referencing the model’s detail in Autopilot because they have full visibility into the model’s explainability, performance, and artifacts. This visibility in addition to validation of the model from the test set gives medical experts greater confidence in the model’s predictive ability.

We walk you through the following high-level steps.

  1. Import the dataset from Amazon S3.
  2. Perform EDA and data profiling with Data Wrangler.
  3. Perform feature engineering to handle outliers and missing values.
  4. Split data into train and test sets.
  5. Train and build a model with Autopilot.
  6. Test the model on a holdout sample with a SageMaker notebook.
  7. Analyze validation and test set performance.

Prerequisites

Complete the following prerequisite steps:

  1. Upload the dataset to an S3 bucket of your choice.
  2. Make sure you have the necessary permissions. For more information, refer to Get Started with Data Wrangler.
  3. Set up a SageMaker domain configured to use Data Wrangler. For instructions, refer to Onboard to Amazon SageMaker Domain.

Import your dataset with Data Wrangler

You can integrate a Data Wrangler data flow into your ML workflows to simplify and streamline data preprocessing and feature engineering using little to no coding. Complete the following steps:

  1. Create a new Data Wrangler flow.

If this is your first time opening Data Wrangler, you may have to wait a few minutes for it to be ready.

  1. Choose the dataset stored in Amazon S3 and import it into Data Wrangler.

After you import the dataset, you should see the beginnings of a data flow within the Data Wrangler UI. You now have a flow diagram.

  1. Choose the plus sign next to Data types and choose Edit to confirm that Data Wrangler automatically inferred the correct data types for your data columns.

If the data types aren’t correct, you can easily modify them through the UI. If multiple data sources are present, you can join or concatenate them.

We can now create an analysis and add transformations.

Perform exploratory data analysis with the data insights report

Exploratory data analysis is a critical part of the ML workflow. We can use the new data insights report from Data Wrangler to gain a better understanding of the profile and distribution of our data. The report includes summary statistics, data quality warnings, target column insights, a quick model, and information about anomalous and duplicate rows.

  1. Choose the plus sign next to Data types and choose Get data insights.

  1. For Target column, choose Outcome.
  2. For Problem type, and (optionally) select Classification.
  3. Choose Create.

The results show a summary data with the dataset statistics.

We can also view the distribution of the labeled rows with a histogram, an estimate of the expected predicted quality of the model with the quick model feature, and a feature summary table.

We don’t go into the details of analyzing the data insights report; refer to Accelerate data preparation with data quality and insights in Amazon SageMaker Data Wrangler for additional details about how you can use the data insights report to accelerate your data preparation steps.

Perform feature engineering

Now that we’ve profiled and analyzed the distribution of our input columns at a high level, the first consideration for improving the quality of our data could be to handle missing values.

For example, we know that zeros (0) for the Insulin column represent missing values. We could follow the recommendation to replace the zeros with NaN. But on closer examination, we find that the minimum value is 0 for others columns such as Glucose, BloodPressure, SkinThickness, and BMI. We need a way to handle missing values, but need to be sensitive to columns with zeros as valid data. Let’s see how we can fix this.

In the Feature Details section, the report raises a Disguised missing value warning for the feature Insulin.

Because zeros in the Insulin column are in fact missing data, we use the Convert regex to missing transform to transform zero values to empty (missing values).

  1. Choose the plus sign next to Data types and choose Add transform.
  2.  Choose Search and edit.
  3. For Transform, choose Convert regex to missing.
  4. For Input columns, choose the columns Insulin, Glucose, BloodPressure, SkinThickness, and BMI.
  5. For Pattern, enter 0.
  6. Choose Preview and Add to save this step.

The 0 entries under Insulin, Glucose, BloodPressure, SkinThickness, and BMI are now missing values.

Data Wrangler gives you a few other options to fix missing values.

  1. We handle missing values by imputing the approximate median for the Glucose column.

We also want to ensure that our features are on the same scale. We don’t want to accidentally give more weight to a certain feature just because they contain a larger numeric range. We normalize our features to do this.

  1. Add a new Process numeric transform and choose Scale values.
  2. For Scaler, choose Min-max scaler.
  3. For Input columns, choose the columns Pregnancies, BloodPressure, Glucose, SkinThickness, Insulin, BMI, and Age.
  4. Set Min to 0 and Max to 1.

This makes sure that our features are between the values 0 and 1.

Now that’s we’ve created some features, we split our dataset into training and testing before we build a model.

Split data into training and testing

In the model building phase of your ML workflow, you test the efficacy of your model by running batch predictions. You can set aside a testing or holdout dataset for evaluation to see how your model performs by comparing the predictions to the ground truth. Generally, if more of the model’s predictions match the true labels, we can determine the model is performing well.

We use Data Wrangler to split our dataset for testing. We retain 90% of our dataset for training because we have a relatively small dataset. The remaining 10% of our dataset serves as the test dataset. We use this dataset to validate the Autopilot model later in this post.

We split our data by choosing the Split data transform and choosing Randomized split as the method. We designate 0.9 as the split percentage for training and 0.1 for testing.

With the data transformation and featuring engineering steps complete, we’re now ready to train a model.

Train and validate the model

We can use the new Data Wrangler integration with Autopilot to directly train a model from the Data Wrangler data flow UI.

  1. Choose the plus sign next to Dataset and choose Train model.

  1. For Amazon S3 location, specify the Amazon S3 location where SageMaker exports your data.

Autopilot uses this location to automatically train a model, saving you time from having to define the output location of the Data Wrangler flow, then having to define the input location of the Autopilot training data. This makes for a more seamless experience.

  1. Choose Export and train to initiate model building with Autopilot.

Autopilot automatically selects the training data input and output locations. You only need to specify the target column and click Create Experiment to train your model.

Test the model on a holdout sample

When Autopilot completes the experiment, we can view the training results and explore the best model.

  1. Choose View model details for your desired model, then choose the Performance tab on the model details page.

The Performance tab displays several model measurement tests, including a confusion matrix, the area under the precision/recall curve (AUCPR), and the area under the receiver operating characteristic curve (ROC). These illustrate the overall validation performance of the model, but they don’t tell us if the model will generalize well. We still need to run evaluations on unseen test data to see how accurately the model predicts if an individual will have diabetes.

To ensure the model generalizes well enough, we set aside the test sample for independent sampling. We can do so in the Data Wrangler flow UI.

  1.  Choose the plus sign next to Dataset, choose Export to, and choose Amazon S3.

  1. Specify an Amazon S3 path.

We refer to this path when we run batch inference for validation in the next section.

  1. Create a new SageMaker notebook to perform batch inferencing on the holdout sample and assess the test performance. Refer to the following GitHub repo for a sample notebook to run batch inference for validation.

Analyze validation and test set performance

When the batch transform is complete, we create a confusion matrix to compare the actual and predicted outcomes of the holdout dataset.

We see 23 true positives and 33 true negatives from our results. In our case, true positives refer to the model correctly predicting an individual as having diabetes. In contrast, true negatives refer to the model correctly predicting an individual as not having diabetes.

In our case, precision and recall are important metrics. Precision essentially measures all individuals predicted to have diabetes, how many really have diabetes? In contrast, recall helps measure all individual who indeed have diabetes, how many were predicted to have diabetes? For example, you may want to use a model with high precision because you want to treat as many individuals as you can, especially if the first stage of treatment has no effect on individuals without diabetes (these are false positives—those labeled as having it when in fact they do not).

We also plot the area under the ROC curve (AUC) graph to evaluate the results. The higher the AUC, the better the model is at distinguishing between classes, which in our case is how well the model performs at distinguishing patients with and without diabetes.

Conclusion

In this post, we demonstrated how to integrate your data processing, featuring engineering, and model building using Data Wrangler and Autopilot. We highlighted how you can easily train and tune a model with Autopilot directly from the Data Wrangler user interface. With this integration feature, we can quickly build a model after completing feature engineering, without writing any code. Then we referenced Autopilot’s best model to run batch predictions using the AutoML class with the SageMaker Python SDK.

Low-code and AutoML solutions like Data Wrangler and Autopilot remove the need to have deep coding knowledge to build robust ML models. Get started using Data Wrangler today to experience how easy it is to build ML models using SageMaker Autopilot.


About the Authors

Peter Chung is a Solutions Architect for AWS, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions in both the public and private sectors. He holds all AWS certifications as well as two GCP certifications. He enjoys coffee, cooking, staying active, and spending time with his family.

Pradeep Reddy is a Senior Product Manager in the SageMaker Low/No Code ML team, which includes SageMaker Autopilot, SageMaker Automatic Model Tuner. Outside of work, Pradeep enjoys reading, running and geeking out with palm sized computers like raspberry pi, and other home automation tech.

Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

Srujan Gopu is a Senior Frontend Engineer in SageMaker Low Code/No Code ML helping customers of Autopilot and Canvas products. When not coding, Srujan enjoys going for a run with his dog Max, listening to audio books and VR game development.

Read More

Out of This World: ‘Mass Effect Legendary Edition’ and ‘It Takes Two’ Lead GFN Thursday Updates

Some may call this GFN Thursday legendary as Mass Effect Legendary Edition and It Takes Two join the GeForce NOW library.

Both games expand the available number of Electronic Arts games streaming from our GeForce cloud servers, and are part of 10 new additions this week.

Adventure Awaits In The Cloud

Relive the saga of Commander Shepard in the highly acclaimed “Mass Effect” trilogy with Mass Effect Legendary Edition (Steam and Origin). One person is all that stands between humanity and the greatest threat it’s ever faced. With each action controlling the outcome of every mission, every relationship, every battle and even the fate of the galaxy itself, you decide how the story unfolds.

Play as clashing couple Cody and May, two humans turned into dolls and trapped in a fantastical world in It Takes Two (Steam and Origin). Challenged with saving their relationship, master unique and connected abilities to help each other across an abundance of obstacles and enjoy laugh-out-loud moments. Invite a friend to join for free with Friend’s Pass and work as a team in this heartfelt and hilarious experience.

GeForce NOW gamers can experience both of these beloved games today across compatible devices. RTX 3080 members can take Mass Effect Legendary Edition to the max with 4K resolution and 60 frames per second on the PC and Mac apps. They can also bring It Takes Two on the go streaming at 120 frames per second to select mobile phones.

Plus, RTX 3080 membership gets the perks of ultra-low latency, dedicated RTX 3080 servers and eight-hour-long gaming sessions to support their play.

No Time Like Playtime

Pro Cycling Manager on GeForce NOW
Recruitment, budget, strategy: you make all the decisions in Pro Cycling Manager 2022.

GFN Thursday always means more great gaming. This week comes with 10 new games available to stream on the cloud:

Finally, as you begin your quest known as “The Weekend,” we’ve got a question for you. Let us know your response on Twitter or in the comments below.

The post Out of This World: ‘Mass Effect Legendary Edition’ and ‘It Takes Two’ Lead GFN Thursday Updates appeared first on NVIDIA Blog.

Read More

Techniques for training large neural networks

Large neural networks are at the core of many recent advances in AI, but training them is a difficult engineering and research challenge which requires orchestrating a cluster of GPUs to perform a single synchronized calculation.OpenAI Blog