PyTorch framework for cryptographically secure random number generation, torchcsprng, now available

One of the key components of modern cryptography is the pseudorandom number generator. Katz and Lindell stated, “The use of badly designed or inappropriate random number generators can often leave a good cryptosystem vulnerable to attack. Particular care must be taken to use a random number generator that is designed for cryptographic use, rather than a ‘general-purpose’ random number generator which may be fine for some applications but not ones that are required to be cryptographically secure.”[1] Additionally, most pseudorandom number generators scale poorly to massively parallel high-performance computation because of their sequential nature. Others don’t satisfy cryptographically secure properties.

torchcsprng is a PyTorch C++/CUDA extension that provides cryptographically secure pseudorandom number generators for PyTorch.

torchcsprng overview

Historically, PyTorch had only two pseudorandom number generator implementations: Mersenne Twister for CPU and Nvidia’s cuRAND Philox for CUDA. Despite good performance properties, neither of them are suitable for cryptographic applications. Over the course of the past several months, the PyTorch team developed the torchcsprng extension API. Based on PyTorch dispatch mechanism and operator registration, it allows the users to extend c10::GeneratorImpl and implement their own custom pseudorandom number generator.

torchcsprng generates a random 128-bit key on the CPU using one of its generators and then runs AES128 in CTR mode either on CPU or GPU using CUDA. This then generates a random 128-bit state and applies a transformation function to map it to target tensor values. This approach is based on Parallel Random Numbers: As Easy as 1, 2, 3 (John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw, D. E. Shaw Research). It makes torchcsprng both crypto-secure and parallel on both CPU and CUDA.

Since torchcsprng is a PyTorch extension, it is available on the platforms where PyTorch is available (support for Windows-CUDA will be available in the coming months).

Using torchcsprng

The torchcsprng API is very simple to use and is fully compatible with the PyTorch random infrastructure:

Step 1: Install via binary distribution

Anaconda:

conda install torchcsprng -c pytorch

pip:

pip install torchcsprng

Step 2: import packages as usual but add csprng

import torch
import torchcsprng as csprng

Step 3: Create a cryptographically secure pseudorandom number generator from /dev/urandom:

urandom_gen = csprng.create_random_device_generator('/dev/urandom')

and simply use it with the existing PyTorch methods:

torch.randn(10, device='cpu', generator=urandom_gen)

Step 4: Test with Cuda

One of the advantages of torchcsprng generators is that they can be used with both CPU and CUDA tensors:

torch.randn(10, device='cuda', generator=urandom_gen)

Another advantage of torchcsprng generators is that they are parallel on CPU unlike the default PyTorch CPU generator.

Getting Started

The easiest way to get started with torchcsprng is by visiting the GitHub page where you can find installation and build instructions, and more how-to examples.

Cheers,

The PyTorch Team

[1] Introduction to Modern Cryptography: Principles and Protocols (Chapman & Hall/CRC Cryptography and Network Security Series) by Jonathan Katz and Yehuda Lindell

Read More

PyTorch 1.6 now includes Stochastic Weight Averaging

PyTorch 1.6 now includes Stochastic Weight Averaging

Do you use stochastic gradient descent (SGD) or Adam? Regardless of the procedure you use to train your neural network, you can likely achieve significantly better generalization at virtually no additional cost with a simple new technique now natively supported in PyTorch 1.6, Stochastic Weight Averaging (SWA) [1]. Even if you have already trained your model, it’s easy to realize the benefits of SWA by running SWA for a small number of epochs starting with a pre-trained model. Again and again, researchers are discovering that SWA improves the performance of well-tuned models in a wide array of practical applications with little cost or effort!

SWA has a wide range of applications and features:

  • SWA significantly improves performance compared to standard training techniques in computer vision (e.g., VGG, ResNets, Wide ResNets and DenseNets on ImageNet and CIFAR benchmarks [1, 2]).
  • SWA provides state-of-the-art performance on key benchmarks in semi-supervised learning and domain adaptation [2].
  • SWA was shown to improve performance in language modeling (e.g., AWD-LSTM on WikiText-2 [4]) and policy-gradient methods in deep reinforcement learning [3].
  • SWAG, an extension of SWA, can approximate Bayesian model averaging in Bayesian deep learning and achieves state-of-the-art uncertainty calibration results in various settings. Moreover, its recent generalization MultiSWAG provides significant additional performance gains and mitigates double-descent [4, 10]. Another approach, Subspace Inference, approximates the Bayesian posterior in a small subspace of the parameter space around the SWA solution [5].
  • SWA for low precision training, SWALP, can match the performance of full-precision SGD training, even with all numbers quantized down to 8 bits, including gradient accumulators [6].
  • SWA in parallel, SWAP, was shown to greatly speed up the training of neural networks by using large batch sizes and, in particular, set a record by training a neural network to 94% accuracy on CIFAR-10 in 27 seconds [11].

Figure 1. Illustrations of SWA and SGD with a Preactivation ResNet-164 on CIFAR-100 [1]. Left: test error surface for three FGE samples and the corresponding SWA solution (averaging in weight space). Middle and Right: test error and train loss surfaces showing the weights proposed by SGD (at convergence) and SWA, starting from the same initialization of SGD after 125 training epochs. Please see [1] for details on how these figures were constructed.

In short, SWA performs an equal average of the weights traversed by SGD (or any stochastic optimizer) with a modified learning rate schedule (see the left panel of Figure 1.). SWA solutions end up in the center of a wide flat region of loss, while SGD tends to converge to the boundary of the low-loss region, making it susceptible to the shift between train and test error surfaces (see the middle and right panels of Figure 1). We emphasize that SWA can be used with any optimizer, such as Adam, and is not specific to SGD.

Previously, SWA was in PyTorch contrib. In PyTorch 1.6, we provide a new convenient implementation of SWA in torch.optim.swa_utils.

Is this just Averaged SGD?

At a high level, averaging SGD iterates dates back several decades in convex optimization [7, 8], where it is sometimes referred to as Polyak-Ruppert averaging, or averaged SGD. But the details matter. Averaged SGD is often used in conjunction with a decaying learning rate, and an exponential moving average (EMA), typically for convex optimization. In convex optimization, the focus has been on improved rates of convergence. In deep learning, this form of averaged SGD smooths the trajectory of SGD iterates but does not perform very differently.

By contrast, SWA uses an equal average of SGD iterates with a modified cyclical or high constant learning rate and exploits the flatness of training objectives [8] specific to deep learning for improved generalization.

How does Stochastic Weight Averaging Work?

There are two important ingredients that make SWA work. First, SWA uses a modified learning rate schedule so that SGD (or other optimizers such as Adam) continues to bounce around the optimum and explore diverse models instead of simply converging to a single solution. For example, we can use the standard decaying learning rate strategy for the first 75% of training time and then set the learning rate to a reasonably high constant value for the remaining 25% of the time (see Figure 2 below). The second ingredient is to take an average of the weights (typically an equal average) of the networks traversed by SGD. For example, we can maintain a running average of the weights obtained at the end of every epoch within the last 25% of training time (see Figure 2). After training is complete, we then set the weights of the network to the computed SWA averages.

Figure 2. Illustration of the learning rate schedule adopted by SWA. Standard decaying schedule is used for the first 75% of the training and then a high constant value is used for the remaining 25%. The SWA averages are formed during the last 25% of training.

One important detail is the batch normalization. Batch normalization layers compute running statistics of activations during training. Note that the SWA averages of the weights are never used to make predictions during training. So the batch normalization layers do not have the activation statistics computed at the end of training. We can compute these statistics by doing a single forward pass on the train data with the SWA model.

While we focus on SGD for simplicity in the description above, SWA can be combined with any optimizer. You can also use cyclical learning rates instead of a high constant value (see e.g., [2]).

How to use SWA in PyTorch?

In torch.optim.swa_utils we implement all the SWA ingredients to make it convenient to use SWA with any model. In particular, we implement AveragedModel class for SWA models, SWALR learning rate scheduler, and update_bn utility function to update SWA batch normalization statistics at the end of training.

In the example below, swa_model is the SWA model that accumulates the averages of the weights. We train the model for a total of 300 epochs, and we switch to the SWA learning rate schedule and start to collect SWA averages of the parameters at epoch 160.

from torch.optim.swa_utils import AveragedModel, SWALR
from torch.optim.lr_scheduler import CosineAnnealingLR

loader, optimizer, model, loss_fn = ...
swa_model = AveragedModel(model)
scheduler = CosineAnnealingLR(optimizer, T_max=100)
swa_start = 5
swa_scheduler = SWALR(optimizer, swa_lr=0.05)

for epoch in range(100):
      for input, target in loader:
          optimizer.zero_grad()
          loss_fn(model(input), target).backward()
          optimizer.step()
      if epoch > swa_start:
          swa_model.update_parameters(model)
          swa_scheduler.step()
      else:
          scheduler.step()

# Update bn statistics for the swa_model at the end
torch.optim.swa_utils.update_bn(loader, swa_model)
# Use swa_model to make predictions on test data 
preds = swa_model(test_input)

Next, we explain each component of torch.optim.swa_utils in detail.

AveragedModel class serves to compute the weights of the SWA model. You can create an averaged model by running swa_model = AveragedModel(model). You can then update the parameters of the averaged model by swa_model.update_parameters(model). By default, AveragedModel computes a running equal average of the parameters that you provide, but you can also use custom averaging functions with the avg_fn parameter. In the following example, ema_model computes an exponential moving average.

ema_avg = lambda averaged_model_parameter, model_parameter, num_averaged:
0.1 * averaged_model_parameter + 0.9 * model_parameter
ema_model = torch.optim.swa_utils.AveragedModel(model, avg_fn=ema_avg)

In practice, we find an equal average with the modified learning rate schedule in Figure 2 provides the best performance.

SWALR is a learning rate scheduler that anneals the learning rate to a fixed value, and then keeps it constant. For example, the following code creates a scheduler that linearly anneals the learning rate from its initial value to 0.05 in 5 epochs within each parameter group.

swa_scheduler = torch.optim.swa_utils.SWALR(optimizer, 
anneal_strategy="linear", anneal_epochs=5, swa_lr=0.05)

We also implement cosine annealing to a fixed value (anneal_strategy="cos"). In practice, we typically switch to SWALR at epoch swa_start (e.g. after 75% of the training epochs), and simultaneously start to compute the running averages of the weights:

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
swa_start = 75
for epoch in range(100):
      # <train epoch>
      if i > swa_start:
          swa_model.update_parameters(model)
          swa_scheduler.step()
      else:
          scheduler.step()

Finally, update_bn is a utility function that computes the batchnorm statistics for the SWA model on a given dataloader loader:

torch.optim.swa_utils.update_bn(loader, swa_model) 

update_bn applies the swa_model to every element in the dataloader and computes the activation statistics for each batch normalization layer in the model.

Once you computed the SWA averages and updated the batch normalization layers, you can apply swa_model to make predictions on test data.

Why does it work?

There are large flat regions of the loss surface [9]. In Figure 3 below, we show a visualization of the loss surface in a subspace of the parameter space containing a path connecting two independently trained SGD solutions, such that the loss is similarly low at every point along the path. SGD converges near the boundary of these regions because there isn’t much gradient signal to move inside, as the points in the region all have similarly low values of loss. By increasing the learning rate, SWA spins around this flat region, and then by averaging the iterates, moves towards the center of the flat region.

Figure 3: visualization of mode connectivity for ResNet-20 with no skip connections on CIFAR-10 dataset. The visualization is created in collaboration with Javier Ideami (https://losslandscape.com/). For more details, see this blogpost.

We expect solutions that are centered in the flat region of the loss to generalize better than those near the boundary. Indeed, train and test error surfaces are not perfectly aligned in the weight space. Solutions that are centered in the flat region are not as susceptible to the shifts between train and test error surfaces as those near the boundary. In Figure 4 below, we show the train loss and test error surfaces along the direction connecting the SWA and SGD solutions. As you can see, while the SWA solution has a higher train loss compared to the SGD solution, it is centered in a region of low loss and has a substantially better test error.

Figure 4. Train loss and test error along the line connecting the SWA solution (circle) and SGD solution (square). The SWA solution is centered in a wide region of low train loss, while the SGD solution lies near the boundary. Because of the shift between train loss and test error surfaces, the SWA solution leads to much better generalization.

##What are results achieved with SWA?

We release a GitHub repo with examples using the PyTorch implementation of SWA for training DNNs. For example, these examples can be used to achieve the following results on CIFAR-100:

  VGG-16 ResNet-164 WideResNet-28×10
Regular Training 72.8 ± 0.3 78.4 ± 0.3 82.5 ± 0.2
SWA 74.4 ± 0.3 79.8 ± 0.4 81.0 ± 0.3

Semi-Supervised Learning

In a follow-up paper SWA was applied to semi-supervised learning, where it improved the best reported results in multiple settings [2]. For example, with SWA you can get 95% accuracy on CIFAR-10 if you only have the training labels for 4k training data points (the previous best reported result on this problem was 93.7%). This paper also explores averaging multiple times within epochs, which can accelerate convergence and find still flatter solutions in a given time.

Figure 5. Performance of fast-SWA on semi-supervised learning with CIFAR-10. fast-SWA achieves record results in every setting considered.

Reinforcement Learning

In another follow-up paper SWA was shown to improve the performance of policy gradient methods A2C and DDPG on several Atari games and MuJoCo environments [3]. This application is also an instance of where SWA is used with Adam. Recall that SWA is not specific to SGD and can benefit essentially any optimizer.

Environment Name A2C A2C + SWA
Breakout 522 ± 34 703 ± 60
Qbert 18777 ± 778 21272 ± 655
SpaceInvaders 7727 ± 1121 21676 ± 8897
Seaquest 1779 ± 4 1795 ± 4
BeamRider 9999 ± 402 11321 ± 1065
CrazyClimber 147030 ± 10239 139752 ± 11618

Low Precision Training

We can filter through quantization noise by combining weights that have been rounded down with weights that have been rounded up. Moreover, by averaging weights to find a flat region of the loss surface, large perturbations of the weights will not affect the quality of the solution (Figures 9 and 10). Recent work shows that by adapting SWA to the low precision setting, in a method called SWALP, one can match the performance of full-precision SGD even with all training in 8 bits [5]. This is quite a practically important result, given that (1) SGD training in 8 bits performs notably worse than full precision SGD, and (2) low precision training is significantly harder than predictions in low precision after training (the usual setting). For example, a ResNet-164 trained on CIFAR-100 with float (16-bit) SGD achieves 22.2% error, while 8-bit SGD achieves 24.0% error. By contrast, SWALP with 8 bit training achieves 21.8% error.

Figure 9. Quantizing a solution leads to a perturbation of the weights which has a greater effect on the quality of the sharp solution (left) compared to wide solution (right).

Figure 10. The difference between standard low precision training and SWALP.

Another work, SQWA, presents an approach for quantization and fine-tuning of neural networks in low precision [12]. In particular, SQWA achieved state-of-the-art results for DNNs quantized to 2 bits on CIFAR-100 and ImageNet.

Calibration and Uncertainty Estimates

By finding a centred solution in the loss, SWA can also improve calibration and uncertainty representation. Indeed, SWA can be viewed as an approximation to an ensemble, resembling a Bayesian model average, but with a single model [1].

SWA can be viewed as taking the first moment of SGD iterates with a modified learning rate schedule. We can directly generalize SWA by also taking the second moment of iterates to form a Gaussian approximate posterior over the weights, further characterizing the loss geometry with SGD iterates. This approach,SWA-Gaussian (SWAG) is a simple, scalable and convenient approach to uncertainty estimation and calibration in Bayesian deep learning [4]. The SWAG distribution approximates the shape of the true posterior: Figure 6 below shows the SWAG distribution and the posterior log-density for ResNet-20 on CIFAR-10.

Figure 6. SWAG posterior approximation and the loss surface for a ResNet-20 without skip-connections trained on CIFAR-10 in the subspace formed by the two largest eigenvalues of the SWAG covariance matrix. The shape of SWAG distribution is aligned with the posterior: the peaks of the two distributions coincide, and both distributions are wider in one direction than in the orthogonal direction. Visualization created in collaboration with Javier Ideami.

Empirically, SWAG performs on par or better than popular alternatives including MC dropout, KFAC Laplace, and temperature scaling on uncertainty quantification, out-of-distribution detection, calibration and transfer learning in computer vision tasks. Code for SWAG is available here.

Figure 7. MultiSWAG generalizes SWAG and deep ensembles, to perform Bayesian model averaging over multiple basins of attraction, leading to significantly improved performance. By contrast, as shown here, deep ensembles select different modes, while standard variational inference (VI) marginalizes (model averages) within a single basin.

MultiSWAG [9] uses multiple independent SWAG models to form a mixture of Gaussians as an approximate posterior distribution. Different basins of attraction contain highly complementary explanations of the data. Accordingly, marginalizing over these multiple basins provides a significant boost in accuracy and uncertainty representation. MultiSWAG can be viewed as a generalization of deep ensembles, but with performance improvements.

Indeed, we see in Figure 8 that MultiSWAG entirely mitigates double descent – more flexible models have monotonically improving performance – and provides significantly improved generalization over SGD. For example, when the ResNet-18 has layers of width 20, Multi-SWAG achieves under 30% error whereas SGD achieves over 45%, more than a 15% gap!

Figure 8. SGD, SWAG, and Multi-SWAG on CIFAR-100 for a ResNet-18 with varying widths. We see Multi-SWAG in particular mitigates double descent and provides significant accuracy improvements over SGD.

Reference [10] also considers Multi-SWA, which uses multiple independently trained SWA solutions in an ensemble, providing performance improvements over deep ensembles without any additional computational cost. Code for MultiSWA and MultiSWAG is available here.

Another method, Subspace Inference, constructs a low-dimensional subspace around the SWA solution and marginalizes the weights in this subspace to approximate the Bayesian model average [5]. Subspace Inference uses the statistics from the SGD iterates to construct both the SWA solution and the subspace. The method achieves strong performance in terms of prediction accuracy and uncertainty calibration both in classification and regression problems. Code is available here.

Try it Out!

One of the greatest open questions in deep learning is why SGD manages to find good solutions, given that the training objectives are highly multimodal, and there are many settings of parameters that achieve no training loss but poor generalization. By understanding geometric features such as flatness, which relate to generalization, we can begin to resolve these questions and build optimizers that provide even better generalization, and many other useful features, such as uncertainty representation. We have presented SWA, a simple drop-in replacement for standard optimizers such as SGD and Adam, which can in principle, benefit anyone training a deep neural network. SWA has been demonstrated to have a strong performance in several areas, including computer vision, semi-supervised learning, reinforcement learning, uncertainty representation, calibration, Bayesian model averaging, and low precision training.

We encourage you to try out SWA! SWA is now as easy as any standard training in PyTorch. And even if you have already trained your model, you can use SWA to significantly improve performance by running it for a small number of epochs from a pre-trained model.

[1] Averaging Weights Leads to Wider Optima and Better Generalization; Pavel Izmailov, Dmitry Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson; Uncertainty in Artificial Intelligence (UAI), 2018.

[2] There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average; Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, Andrew Gordon Wilson;
International Conference on Learning Representations (ICLR), 2019.

[3] Improving Stability in Deep Reinforcement Learning with Weight Averaging; Evgenii Nikishin, Pavel Izmailov, Ben Athiwaratkun, Dmitrii Podoprikhin,
Timur Garipov, Pavel Shvechikov, Dmitry Vetrov, Andrew Gordon Wilson; UAI 2018 Workshop: Uncertainty in Deep Learning, 2018.

[4] A Simple Baseline for Bayesian Uncertainty in Deep Learning
Wesley Maddox, Timur Garipov, Pavel Izmailov, Andrew Gordon Wilson; Neural Information Processing Systems (NeurIPS), 2019.

[5] Subspace Inference for Bayesian Deep Learning
Pavel Izmailov, Wesley Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson
Uncertainty in Artificial Intelligence (UAI), 2019

[6] SWALP : Stochastic Weight Averaging in Low Precision Training
Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai,
Andrew Gordon Wilson, Christopher De Sa; International Conference on Machine Learning (ICML), 2019.

[7] David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process; Technical report, Cornell University Operations Research and Industrial Engineering, 1988.

[8] Acceleration of stochastic approximation by averaging. Boris T Polyak and Anatoli B Juditsky; SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

[9] Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov,
Andrew Gordon Wilson. Neural Information Processing Systems (NeurIPS), 2018.

[10] Bayesian Deep Learning and a Probabilistic Perspective of Generalization
Andrew Gordon Wilson, Pavel Izmailov. ArXiv preprint, 2020.

[11] Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well
Gupta, Vipul, Santiago Akle Serrano, and Dennis DeCoste; International Conference on Learning Representations (ICLR). 2019.

[12] SQWA: Stochastic Quantized Weight Averaging for Improving the Generalization Capability of Low-Precision Deep Neural Networks
Shin, Sungho, Yoonho Boo, and Wonyong Sung; arXiv preprint 2020.

Read More

Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs

Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs

Data sets are growing bigger every day and GPUs are getting faster. This means there are more data sets for deep learning researchers and engineers to train and validate their models.

  • Many datasets for research in still image recognition are becoming available with 10 million or more images, including OpenImages and Places.
  • million YouTube videos (YouTube 8M) consume about 300 TB in 720p, used for research in object recognition, video analytics, and action recognition.
  • The Tobacco Corpus consists of about 20 million scanned HD pages, useful for OCR and text analytics research.

Although the most commonly encountered big data sets right now involve images and videos, big datasets occur in many other domains and involve many other kinds of data types: web pages, financial transactions, network traces, brain scans, etc.

However, working with the large amount of data sets presents a number of challenges:

  • Dataset Size: datasets often exceed the capacity of node-local disk storage, requiring distributed storage systems and efficient network access.
  • Number of Files: datasets often consist of billions of files with uniformly random access patterns, something that often overwhelms both local and network file systems.
  • Data Rates: training jobs on large datasets often use many GPUs, requiring aggregate I/O bandwidths to the dataset of many GBytes/s; these can only be satisfied by massively parallel I/O systems.
  • Shuffling and Augmentation: training data needs to be shuffled and augmented prior to training.
  • Scalability: users often want to develop and test on small datasets and then rapidly scale up to large datasets.

Traditional local and network file systems, and even object storage servers, are not designed for these kinds of applications. The WebDataset I/O library for PyTorch, together with the optional AIStore server and Tensorcom RDMA libraries, provide an efficient, simple, and standards-based solution to all these problems. The library is simple enough for day-to-day use, is based on mature open source standards, and is easy to migrate to from existing file-based datasets.

Using WebDataset is simple and requires little effort, and it will let you scale up the same code from running local experiments to using hundreds of GPUs on clusters or in the cloud with linearly scalable performance. Even on small problems and on your desktop, it can speed up I/O tenfold and simplifies data management and processing of large datasets. The rest of this blog post tells you how to get started with WebDataset and how it works.

The WebDataset Library

The WebDataset library provides a simple solution to the challenges listed above. Currently, it is available as a separate library (github.com/tmbdev/webdataset), but it is on track for being incorporated into PyTorch (see RFC 38419). The WebDataset implementation is small (about 1500 LOC) and has no external dependencies.

Instead of inventing a new format, WebDataset represents large datasets as collections of POSIX tar archive files consisting of the original data files. The WebDataset library can use such tar archives directly for training, without the need for unpacking or local storage.

WebDataset scales perfectly from small, local datasets to petascale datasets and training on hundreds of GPUs and allows data to be stored on local disk, on web servers, or dedicated file servers. For container-based training, WebDataset eliminates the need for volume plugins or node-local storage. As an additional benefit, datasets need not be unpacked prior to training, simplifying the distribution and use of research data.

WebDataset implements PyTorch’s IterableDataset interface and can be used like existing DataLoader-based code. Since data is stored as files inside an archive, existing loading and data augmentation code usually requires minimal modification.

The WebDataset library is a complete solution for working with large datasets and distributed training in PyTorch (and also works with TensorFlow, Keras, and DALI via their Python APIs). Since POSIX tar archives are a standard, widely supported format, it is easy to write other tools for manipulating datasets in this format. E.g., the tarp command is written in Go and can shuffle and process training datasets.

Benefits

The use of sharded, sequentially readable formats is essential for very large datasets. In addition, it has benefits in many other environments. WebDataset provides a solution that scales well from small problems on a desktop machine to very large deep learning problems in clusters or in the cloud. The following table summarizes some of the benefits in different environments.

Environment Benefits of WebDataset
Local Cluster with AIStore AIStore can be deployed easily as K8s containers and offers linear scalability and near 100% utilization of network and I/O bandwidth. Suitable for petascale deep learning.
Cloud Computing WebDataset deep learning jobs can be trained directly against datasets stored in cloud buckets; no volume plugins required. Local and cloud jobs work identically. Suitable for petascale learning.
Local Cluster with existing distributed FS or object store WebDataset’s large sequential reads improve performance with existing distributed stores and eliminate the need for dedicated volume plugins.
Educational Environments WebDatasets can be stored on existing web servers and web caches, and can be accessed directly by students by URL
Training on Workstations from Local Drives Jobs can start training as the data still downloads. Data doesn’t need to be unpacked for training. Ten-fold improvements in I/O performance on hard drives over random access file-based datasets.
All Environments Datasets are represented in an archival format and contain metadata such as file types. Data is compressed in native formats (JPEG, MP4, etc.). Data management, ETL-style jobs, and data transformations and I/O are simplified and easily parallelized.

We will be adding more examples giving benchmarks and showing how to use WebDataset in these environments over the coming months.

High-Performance

For high-performance computation on local clusters, the companion open-source AIStore server provides full disk to GPU I/O bandwidth, subject only to hardware constraints. This Bigdata 2019 Paper contains detailed benchmarks and performance measurements. In addition to benchmarks, research projects at NVIDIA and Microsoft have used WebDataset for petascale datasets and billions of training samples.

Below is a benchmark of AIStore with WebDataset clients using 10 server nodes and 120 rotational drives each.

The left axis shows the aggregate bandwidth from the cluster, while the right scale shows the measured per drive I/O bandwidth. WebDataset and AIStore scale linearly to about 300 clients, at which point they are increasingly limited by the maximum I/O bandwidth available from the rotational drives (about 150 MBytes/s per drive). For comparison, HDFS is shown. HDFS uses a similar approach to AIStore/WebDataset and also exhibits linear scaling up to about 192 clients; at that point, it hits a performance limit of about 120 MBytes/s per drive, and it failed when using more than 1024 clients. Unlike HDFS, the WebDataset-based code just uses standard URLs and HTTP to access data and works identically with local files, with files stored on web servers, and with AIStore. For comparison, NFS in similar experiments delivers about 10-20 MBytes/s per drive.

Storing Datasets in Tar Archives

The format used for WebDataset is standard POSIX tar archives, the same archives used for backup and data distribution. In order to use the format to store training samples for deep learning, we adopt some simple naming conventions:

  • datasets are POSIX tar archives
  • each training sample consists of adjacent files with the same basename
  • shards are numbered consecutively

For example, ImageNet is stored in 1282 separate 100 Mbyte shards with names pythonimagenet-train-000000.tar to imagenet-train-001281.tar, the contents of the first shard are:

-r--r--r-- bigdata/bigdata      3 2020-05-08 21:23 n03991062_24866.cls
-r--r--r-- bigdata/bigdata 108611 2020-05-08 21:23 n03991062_24866.jpg
-r--r--r-- bigdata/bigdata      3 2020-05-08 21:23 n07749582_9506.cls
-r--r--r-- bigdata/bigdata 129044 2020-05-08 21:23 n07749582_9506.jpg
-r--r--r-- bigdata/bigdata      3 2020-05-08 21:23 n03425413_23604.cls
-r--r--r-- bigdata/bigdata 106255 2020-05-08 21:23 n03425413_23604.jpg
-r--r--r-- bigdata/bigdata      3 2020-05-08 21:23 n02795169_27274.cls

WebDataset datasets can be used directly from local disk, from web servers (hence the name), from cloud storage and object stores, just by changing a URL. WebDataset datasets can be used for training without unpacking, and training can even be carried out on streaming data, with no local storage.

Shuffling during training is important for many deep learning applications, and WebDataset performs shuffling both at the shard level and at the sample level. Splitting of data across multiple workers is performed at the shard level using a user-provided shard_selection function that defaults to a function that splits based on get_worker_info. (WebDataset can be combined with the tensorcom library to offload decompression/data augmentation and provide RDMA and direct-to-GPU loading; see below.)

Code Sample

Here are some code snippets illustrating the use of WebDataset in a typical PyTorch deep learning application (you can find a full example at http://github.com/tmbdev/pytorch-imagenet-wds.

import webdataset as wds
import ...

sharedurl = "/imagenet/imagenet-train-{000000..001281}.tar"

normalize = transforms.Normalize(
  mean=[0.485, 0.456, 0.406],
  std=[0.229, 0.224, 0.225])

preproc = transforms.Compose([
  transforms.RandomResizedCrop(224),
  transforms.RandomHorizontalFlip(),
  transforms.ToTensor(),
  normalize,
])

dataset = (
  wds.Dataset(sharedurl)
  .shuffle(1000)
  .decode("pil")
  .rename(image="jpg;png", data="json")
  .map_dict(image=preproc)
  .to_tuple("image", "data")
)

loader = torch.utils.data.DataLoader(dataset, batch_size=64, num_workers=8)

for inputs, targets in loader:
  ...

This code is nearly identical to the file-based I/O pipeline found in the PyTorch Imagenet example: it creates a preprocessing/augmentation pipeline, instantiates a dataset using that pipeline and a data source location, and then constructs a DataLoader instance from the dataset.

WebDataset uses a fluent API for a configuration that internally builds up a processing pipeline. Without any added processing stages, In this example, WebDataset is used with the PyTorch DataLoader class, which replicates DataSet instances across multiple threads and performs both parallel I/O and parallel data augmentation.

WebDataset instances themselves just iterate through each training sample as a dictionary:

# load from a web server using a separate client process
sharedurl = "pipe:curl -s http://server/imagenet/imagenet-train-{000000..001281}.tar"

dataset = wds.Dataset(sharedurl)

for sample in dataset:
  # sample["jpg"] contains the raw image data
  # sample["cls"] contains the class
  ...

For a general introduction to how we handle large scale training with WebDataset, see these YouTube videos.

Related Software

  • AIStore is an open-source object store capable of full-bandwidth disk-to-GPU data delivery (meaning that if you have 1000 rotational drives with 200 MB/s read speed, AIStore actually delivers an aggregate bandwidth of 200 GB/s to the GPUs). AIStore is fully compatible with WebDataset as a client, and in addition understands the WebDataset format, permitting it to perform shuffling, sorting, ETL, and some map-reduce operations directly in the storage system. AIStore can be thought of as a remix of a distributed object store, a network file system, a distributed database, and a GPU-accelerated map-reduce implementation.

  • tarp is a small command-line program for splitting, merging, shuffling, and processing tar archives and WebDataset datasets.

  • tensorcom is a library supporting distributed data augmentation and RDMA to GPU.

  • webdataset-examples contains an example (and soon more examples) of how to use WebDataset in practice.

  • Bigdata 2019 Paper with Benchmarks

Check out the library and provide your feedback for RFC 38419.

Read More

Introducing native PyTorch automatic mixed precision for faster training on NVIDIA GPUs

Introducing native PyTorch automatic mixed precision for faster training on NVIDIA GPUs

Most deep learning frameworks, including PyTorch, train with 32-bit floating point (FP32) arithmetic by default. However this is not essential to achieve full accuracy for many deep learning models. In 2017, NVIDIA researchers developed a methodology for mixed-precision training, which combined single-precision (FP32) with half-precision (e.g. FP16) format when training a network, and achieved the same accuracy as FP32 training using the same hyperparameters, with additional performance benefits on NVIDIA GPUs:

  • Shorter training time;
  • Lower memory requirements, enabling larger batch sizes, larger models, or larger inputs.

In order to streamline the user experience of training in mixed precision for researchers and practitioners, NVIDIA developed Apex in 2018, which is a lightweight PyTorch extension with Automatic Mixed Precision (AMP) feature. This feature enables automatic conversion of certain GPU operations from FP32 precision to mixed precision, thus improving performance while maintaining accuracy.

For the PyTorch 1.6 release, developers at NVIDIA and Facebook moved mixed precision functionality into PyTorch core as the AMP package, torch.cuda.amp. torch.cuda.amp is more flexible and intuitive compared to apex.amp. Some of apex.amp’s known pain points that torch.cuda.amp has been able to fix:

  • Guaranteed PyTorch version compatibility, because it’s part of PyTorch
  • No need to build extensions
  • Windows support
  • Bitwise accurate saving/restoring of checkpoints
  • DataParallel and intra-process model parallelism (although we still recommend torch.nn.DistributedDataParallel with one GPU per process as the most performant approach)
  • Gradient penalty (double backward)
  • torch.cuda.amp.autocast() has no effect outside regions where it’s enabled, so it should serve cases that formerly struggled with multiple calls to apex.amp.initialize() (including cross-validation) without difficulty. Multiple convergence runs in the same script should each use a fresh GradScaler instance, but GradScalers are lightweight and self-contained so that’s not a problem.
  • Sparse gradient support

With AMP being added to PyTorch core, we have started the process of deprecating apex.amp. We have moved apex.amp to maintenance mode and will support customers using apex.amp. However, we highly encourage apex.amp customers to transition to using torch.cuda.amp from PyTorch Core.

Example Walkthrough

Please see official docs for usage:

Example:

import torch 
# Creates once at the beginning of training 
scaler = torch.cuda.amp.GradScaler() 
 
for data, label in data_iter: 
   optimizer.zero_grad() 
   # Casts operations to mixed precision 
   with torch.cuda.amp.autocast(): 
      loss = model(data) 
 
   # Scales the loss, and calls backward() 
   # to create scaled gradients 
   scaler.scale(loss).backward() 
 
   # Unscales gradients and calls 
   # or skips optimizer.step() 
   scaler.step(optimizer) 
 
   # Updates the scale for next iteration 
   scaler.update() 

Performance Benchmarks

In this section, we discuss the accuracy and performance of mixed precision training with AMP on the latest NVIDIA GPU A100 and also previous generation V100 GPU. The mixed precision performance is compared to FP32 performance, when running Deep Learning workloads in the NVIDIA pytorch:20.06-py3 container from NGC.

Accuracy: AMP (FP16), FP32

The advantage of using AMP for Deep Learning training is that the models converge to the similar final accuracy while providing improved training performance. To illustrate this point, for Resnet 50 v1.5 training, we see the following accuracy results where higher is better. Please note that the below accuracy numbers are sample numbers that are subject to run to run variance of up to 0.4%. Accuracy numbers for other models including BERT, Transformer, ResNeXt-101, Mask-RCNN, DLRM can be found at NVIDIA Deep Learning Examples Github.

Training accuracy: NVIDIA DGX A100 (8x A100 40GB)

 epochs  Mixed Precision Top 1(%)  TF32 Top1(%)
 90  76.93  76.85

Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

 epochs  Mixed Precision Top 1(%)  FP32 Top1(%)
50 76.25 76.26
90 77.09 77.01
250 78.42 78.30

Speedup Performance:

FP16 on NVIDIA V100 vs. FP32 on V100

AMP with FP16 is the most performant option for DL training on the V100. In Table 1, we can observe that for various models, AMP on V100 provides a speedup of 1.5x to 5.5x over FP32 on V100 while converging to the same final accuracy.

Figure 2. Performance of mixed precision training on NVIDIA 8xV100 vs. FP32 training on 8xV100 GPU. Bars represent the speedup factor of V100 AMP over V100 FP32. The higher the better.

FP16 on NVIDIA A100 vs. FP16 on V100

AMP with FP16 remains the most performant option for DL training on the A100. In Figure 3, we can observe that for various models, AMP on A100 provides a speedup of 1.3x to 2.5x over AMP on V100 while converging to the same final accuracy.

Figure 3. Performance of mixed precision training on NVIDIA 8xA100 vs. 8xV100 GPU. Bars represent the speedup factor of A100 over V100. The higher the better.

Call to action

AMP provides a healthy speedup for Deep Learning training workloads on Nvidia Tensor Core GPUs, especially on the latest Ampere generation A100 GPUs. You can start experimenting with AMP enabled models and model scripts for A100, V100, T4 and other GPUs available at NVIDIA deep learning examples. NVIDIA PyTorch with native AMP support is available from the PyTorch NGC container version 20.06. We highly encourage existing apex.amp customers to transition to using torch.cuda.amp from PyTorch Core available in the latest PyTorch 1.6 release.

Read More

Microsoft becomes maintainer of the Windows version of PyTorch

Microsoft becomes maintainer of the Windows version of PyTorch

Along with the PyTorch 1.6 release, we are excited to announce that Microsoft has expanded its participation in the PyTorch community and is taking ownership of the development and maintenance of the PyTorch build for Windows.

According to the latest Stack Overflow developer survey, Windows remains the primary operating system for the developer community (46% Windows vs 28% MacOS). Jiachen Pu initially made a heroic effort to add support for PyTorch on Windows, but due to limited resources, Windows support for PyTorch has lagged behind other platforms. Lack of test coverage resulted in unexpected issues popping up every now and then. Some of the core tutorials, meant for new users to learn and adopt PyTorch, would fail to run. The installation experience was also not as smooth, with the lack of official PyPI support for PyTorch on Windows. Lastly, some of the PyTorch functionality was simply not available on the Windows platform, such as the TorchAudio domain library and distributed training support. To help alleviate this pain, Microsoft is happy to bring its Windows expertise to the table and bring PyTorch on Windows to its best possible self.

In the PyTorch 1.6 release, we have improved the core quality of the Windows build by bringing test coverage up to par with Linux for core PyTorch and its domain libraries and by automating tutorial testing. Thanks to the broader PyTorch community, which contributed TorchAudio support to Windows, we were able to add test coverage to all three domain libraries: TorchVision, TorchText and TorchAudio. In subsequent releases of PyTorch, we will continue improving the Windows experience based on community feedback and requests. So far, the feedback we received from the community points to distributed training support and a better installation experience using pip as the next areas of improvement.

In addition to the native Windows experience, Microsoft released a preview adding GPU compute support to Windows Subsystem for Linux (WSL) 2 distros, with a focus on enabling AI and ML developer workflows. WSL is designed for developers that want to run any Linux based tools directly on Windows. This preview enables valuable scenarios for a variety of frameworks and Python packages that utilize NVIDIA CUDA for acceleration and only support Linux. This means WSL customers using the preview can run native Linux based PyTorch applications on Windows unmodified without the need for a traditional virtual machine or a dual boot setup.

Getting started with PyTorch on Windows

It’s easy to get started with PyTorch on Windows. To install PyTorch using Anaconda with the latest GPU support, run the command below. To install different supported configurations of PyTorch, refer to the installation instructions on pytorch.org.

conda install pytorch torchvision cudatoolkit=10.2 -c pytorch

Once you install PyTorch, learn more by visiting the PyTorch Tutorials and documentation.

Getting started with PyTorch on Windows Subsystem for Linux

The preview of NVIDIA CUDA support in WSL is now available to Windows Insiders running Build 20150 or higher. In WSL, the command to install PyTorch using Anaconda is the same as the above command for native Windows. If you prefer pip, use the command below.

pip install torch torchvision

You can use the same tutorials and documentation inside your WSL environment as on native Windows. This functionality is still in preview so if you run into issues with WSL please share feedback via the WSL GitHub repo or with NVIDIA CUDA support share via NVIDIA’s Community Forum for CUDA on WSL.

Feedback

If you find gaps in the PyTorch experience on Windows, please let us know on the PyTorch discussion forum or file an issue on GitHub using the #module: windows label.

Read More

PyTorch feature classification changes

PyTorch feature classification changes

Traditionally features in PyTorch were classified as either stable or experimental with an implicit third option of testing bleeding edge features by building master or through installing nightly builds (available via prebuilt whls). This has, in a few cases, caused some confusion around the level of readiness, commitment to the feature and backward compatibility that can be expected from a user perspective. Moving forward, we’d like to better classify the 3 types of features as well as define explicitly here what each mean from a user perspective.

New Feature Designations

We will continue to have three designations for features but, as mentioned, with a few changes: Stable, Beta (previously Experimental) and Prototype (previously Nightlies). Below is a brief description of each and a comment on the backward compatibility expected:

Stable

Nothing changes here. A stable feature means that the user value-add is or has been proven, the API isn’t expected to change, the feature is performant and all documentation exists to support end user adoption.

Level of commitment: We expect to maintain these features long term and generally there should be no major performance limitations, gaps in documentation and we also expect to maintain backwards compatibility (although breaking changes can happen and notice will be given one release ahead of time).

Beta

We previously called these features ‘Experimental’ and we found that this created confusion amongst some of the users. In the case of a Beta level features, the value add, similar to a Stable feature, has been proven (e.g. pruning is a commonly used technique for reducing the number of parameters in NN models, independent of the implementation details of our particular choices) and the feature generally works and is documented. This feature is tagged as Beta because the API may change based on user feedback, because the performance needs to improve or because coverage across operators is not yet complete.

Level of commitment: We are committing to seeing the feature through to the Stable classification. We are however not committing to Backwards Compatibility. Users can depend on us providing a solution for problems in this area going forward, but the APIs and performance characteristics of this feature may change.

Prototype

Previously these were features that were known about by developers who paid close attention to RFCs and to features that land in master. In this case the feature is not available as part of binary distributions like PyPI or Conda (except maybe behind run-time flags), but we would like to get high bandwidth partner feedback ahead of a real release in order to gauge utility and any changes we need to make to the UX. To test these kinds of features we would, depending on the feature, recommend building from master or using the nightly whls that are made available on pytorch.org. For each prototype feature, a pointer to draft docs or other instructions will be provided.

Level of commitment: We are committing to gathering high bandwidth feedback only. Based on this feedback and potential further engagement between community members, we as a community will decide if we want to upgrade the level of commitment or to fail fast. Additionally, while some of these features might be more speculative (e.g. new Frontend APIs), others have obvious utility (e.g. model optimization) but may be in a state where gathering feedback outside of high bandwidth channels is not practical, e.g. the feature may be in an earlier state, may be moving fast (PRs are landing too quickly to catch a major release) and/or generally active development is underway.

What changes for current features?

First and foremost, you can find these designations on pytorch.org/docs. We will also be linking any early stage features here for clarity.

Additionally, the following features will be reclassified under this new rubric:

  1. High Level Autograd APIs: Beta (was Experimental)
  2. Eager Mode Quantization: Beta (was Experimental)
  3. Named Tensors: Prototype (was Experimental)
  4. TorchScript/RPC: Prototype (was Experimental)
  5. Channels Last Memory Layout: Beta (was Experimental)
  6. Custom C++ Classes: Beta (was Experimental)
  7. PyTorch Mobile: Beta (was Experimental)
  8. Java Bindings: Beta (was Experimental)
  9. Torch.Sparse: Beta (was Experimental)

Cheers,

Joe, Greg, Woo & Jessica

Read More

PyTorch 1.6 released w/ Native AMP Support, Microsoft joins as maintainers for Windows

Today, we’re announcing the availability of PyTorch 1.6, along with updated domain libraries. We are also excited to announce the team at Microsoft is now maintaining Windows builds and binaries and will also be supporting the community on GitHub as well as the PyTorch Windows discussion forums.

The PyTorch 1.6 release includes a number of new APIs, tools for performance improvement and profiling, as well as major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training.
A few of the highlights include:

  1. Automatic mixed precision (AMP) training is now natively supported and a stable feature (See here for more details) – thanks for NVIDIA’s contributions;
  2. Native TensorPipe support now added for tensor-aware, point-to-point communication primitives built specifically for machine learning;
  3. Added support for complex tensors to the frontend API surface;
  4. New profiling tools providing tensor-level memory consumption information;
  5. Numerous improvements and new features for both distributed data parallel (DDP) training and the remote procedural call (RPC) packages.

Additionally, from this release onward, features will be classified as Stable, Beta and Prototype. Prototype features are not included as part of the binary distribution and are instead available through either building from source, using nightlies or via compiler flag. You can learn more about what this change means in the post here. You can also find the full release notes here.

Performance & Profiling

[Stable] Automatic Mixed Precision (AMP) Training

AMP allows users to easily enable automatic mixed precision training enabling higher performance and memory savings of up to 50% on Tensor Core GPUs. Using the natively supported torch.cuda.amp API, AMP provides convenience methods for mixed precision, where some operations use the torch.float32 (float) datatype and other operations use torch.float16 (half). Some ops, like linear layers and convolutions, are much faster in float16. Other ops, like reductions, often require the dynamic range of float32. Mixed precision tries to match each op to its appropriate datatype.

  • Design doc (Link)
  • Documentation (Link)
  • Usage examples (Link)

[Beta] Fork/Join Parallelism

This release adds support for a language-level construct as well as runtime support for coarse-grained parallelism in TorchScript code. This support is useful for situations such as running models in an ensemble in parallel, or running bidirectional components of recurrent nets in parallel, and allows the ability to unlock the computational power of parallel architectures (e.g. many-core CPUs) for task level parallelism.

Parallel execution of TorchScript programs is enabled through two primitives: torch.jit.fork and torch.jit.wait. In the below example, we parallelize execution of foo:

import torch
from typing import List

def foo(x):
    return torch.neg(x)

@torch.jit.script
def example(x):
    futures = [torch.jit.fork(foo, x) for _ in range(100)]
    results = [torch.jit.wait(future) for future in futures]
    return torch.sum(torch.stack(results))

print(example(torch.ones([])))
  • Documentation (Link)

[Beta] Memory Profiler

The torch.autograd.profiler API now includes a memory profiler that lets you inspect the tensor memory cost of different operators inside your CPU and GPU models.

Here is an example usage of the API:

import torch
import torchvision.models as models
import torch.autograd.profiler as profiler

model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inputs)

# NOTE: some columns were removed for brevity
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))
# ---------------------------  ---------------  ---------------  ---------------
# Name                         CPU Mem          Self CPU Mem     Number of Calls
# ---------------------------  ---------------  ---------------  ---------------
# empty                        94.79 Mb         94.79 Mb         123
# resize_                      11.48 Mb         11.48 Mb         2
# addmm                        19.53 Kb         19.53 Kb         1
# empty_strided                4 b              4 b              1
# conv2d                       47.37 Mb         0 b              20
# ---------------------------  ---------------  ---------------  ---------------

Distributed Training & RPC

[Beta] TensorPipe backend for RPC

PyTorch 1.6 introduces a new backend for the RPC module which leverages the TensorPipe library, a tensor-aware point-to-point communication primitive targeted at machine learning, intended to complement the current primitives for distributed training in PyTorch (Gloo, MPI, …) which are collective and blocking. The pairwise and asynchronous nature of TensorPipe lends itself to new networking paradigms that go beyond data parallel: client-server approaches (e.g., parameter server for embeddings, actor-learner separation in Impala-style RL, …) and model and pipeline parallel training (think GPipe), gossip SGD, etc.

# One-line change needed to opt in
torch.distributed.rpc.init_rpc(
    ...
    backend=torch.distributed.rpc.BackendType.TENSORPIPE,
)

# No changes to the rest of the RPC API
torch.distributed.rpc.rpc_sync(...)
  • Design doc (Link)
  • Documentation (Link)

[Beta] DDP+RPC

PyTorch Distributed supports two powerful paradigms: DDP for full sync data parallel training of models and the RPC framework which allows for distributed model parallelism. Previously, these two features worked independently and users couldn’t mix and match these to try out hybrid parallelism paradigms.

Starting in PyTorch 1.6, we’ve enabled DDP and RPC to work together seamlessly so that users can combine these two techniques to achieve both data parallelism and model parallelism. An example is where users would like to place large embedding tables on parameter servers and use the RPC framework for embedding lookups, but store smaller dense parameters on trainers and use DDP to synchronize the dense parameters. Below is a simple code snippet.

// On each trainer

remote_emb = create_emb(on="ps", ...)
ddp_model = DDP(dense_model)

for data in batch:
   with torch.distributed.autograd.context():
      res = remote_emb(data)
      loss = ddp_model(res)
      torch.distributed.autograd.backward([loss])
  • DDP+RPC Tutorial (Link)
  • Documentation (Link)
  • Usage Examples (Link)

[Beta] RPC – Asynchronous User Functions

RPC Asynchronous User Functions supports the ability to yield and resume on the server side when executing a user-defined function. Prior to this feature, when a callee processes a request, one RPC thread waits until the user function returns. If the user function contains IO (e.g., nested RPC) or signaling (e.g., waiting for another request to unblock), the corresponding RPC thread would sit idle waiting for these events. As a result, some applications have to use a very large number of threads and send additional RPC requests, which can potentially lead to performance degradation. To make a user function yield on such events, applications need to: 1) Decorate the function with the @rpc.functions.async_execution decorator; and 2) Let the function return a torch.futures.Future and install the resume logic as callbacks on the Future object. See below for an example:

@rpc.functions.async_execution
def async_add_chained(to, x, y, z):
    return rpc.rpc_async(to, torch.add, args=(x, y)).then(
        lambda fut: fut.wait() + z
    )

ret = rpc.rpc_sync(
    "worker1", 
    async_add_chained, 
    args=("worker2", torch.ones(2), 1, 1)
)
        
print(ret)  # prints tensor([3., 3.])
  • Tutorial for performant batch RPC using Asynchronous User Functions (Link)
  • Documentation (Link)
  • Usage examples (Link)

Frontend API Updates

[Beta] Complex Numbers

The PyTorch 1.6 release brings beta level support for complex tensors including torch.complex64 and torch.complex128 dtypes. A complex number is a number that can be expressed in the form a + bj, where a and b are real numbers, and j is a solution of the equation x^2 = −1. Complex numbers frequently occur in mathematics and engineering, especially in signal processing and the area of complex neural networks is an active area of research. The beta release of complex tensors will support common PyTorch and complex tensor functionality, plus functions needed by Torchaudio, ESPnet and others. While this is an early version of this feature, and we expect it to improve over time, the overall goal is provide a NumPy compatible user experience that leverages PyTorch’s ability to run on accelerators and work with autograd to better support the scientific community.

Updated Domain Libraries

torchvision 0.7

torchvision 0.7 introduces two new pretrained semantic segmentation models, FCN ResNet50 and DeepLabV3 ResNet50, both trained on COCO and using smaller memory footprints than the ResNet101 backbone. We also introduced support for AMP (Automatic Mixed Precision) autocasting for torchvision models and operators, which automatically selects the floating point precision for different GPU operations to improve performance while maintaining accuracy.

  • Release notes (Link)

torchaudio 0.6

torchaudio now officially supports Windows. This release also introduces a new model module (with wav2letter included), new functionals (contrast, cvm, dcshift, overdrive, vad, phaser, flanger, biquad), datasets (GTZAN, CMU), and a new optional sox backend with support for TorchScript.

  • Release notes (Link)

Additional updates

HACKATHON

The Global PyTorch Summer Hackathon is back! This year, teams can compete in three categories virtually:

  1. PyTorch Developer Tools: Tools or libraries designed to improve productivity and efficiency of PyTorch for researchers and developers
  2. Web/Mobile Applications powered by PyTorch: Applications with web/mobile interfaces and/or embedded devices powered by PyTorch
  3. PyTorch Responsible AI Development Tools: Tools, libraries, or web/mobile apps for responsible AI development

This is a great opportunity to connect with the community and practice your machine learning skills.

LPCV Challenge

The 2020 CVPR Low-Power Vision Challenge (LPCV) – Online Track for UAV video submission deadline is coming up shortly. You have until July 31, 2020 to build a system that can discover and recognize characters in video captured by an unmanned aerial vehicle (UAV) accurately using PyTorch and Raspberry Pi 3B+.

Prototype Features

To reiterate, Prototype features in PyTorch are early features that we are looking to gather feedback on, gauge the usefulness of and improve ahead of graduating them to Beta or Stable. The following features are not part of the PyTorch 1.6 release and instead are available in nightlies with separate docs/tutorials to help facilitate early usage and feedback.

Distributed RPC/Profiler

Allow users to profile training jobs that use torch.distributed.rpc using the autograd profiler, and remotely invoke the profiler in order to collect profiling information across different nodes. The RFC can be found here and a short recipe on how to use this feature can be found here.

TorchScript Module Freezing

Module Freezing is the process of inlining module parameters and attributes values into the TorchScript internal representation. Parameter and attribute values are treated as final value and they cannot be modified in the frozen module. The PR for this feature can be found here and a short tutorial on how to use this feature can be found here.

Graph Mode Quantization

Eager mode quantization requires users to make changes to their model, including explicitly quantizing activations, module fusion, rewriting use of torch ops with Functional Modules and quantization of functionals are not supported. If we can trace or script the model, then the quantization can be done automatically with graph mode quantization without any of the complexities in eager mode, and it is configurable through a qconfig_dict. A tutorial on how to use this feature can be found here.

Quantization Numerical Suite

Quantization is good when it works, but it’s difficult to know what’s wrong when it doesn’t satisfy the expected accuracy. A prototype is now available for a Numerical Suite that measures comparison statistics between quantized modules and float modules. This is available to test using eager mode and on CPU only with more support coming. A tutorial on how to use this feature can be found here.

Cheers!

Team PyTorch

Read More

Updates & Improvements to PyTorch Tutorials

Updates & Improvements to PyTorch Tutorials

PyTorch.org provides researchers and developers with documentation, installation instructions, latest news, community projects, tutorials, and more. Today, we are introducing usability and content improvements including tutorials in additional categories, a new recipe format for quickly referencing common topics, sorting using tags, and an updated homepage.

Let’s take a look at them in detail.

TUTORIALS HOME PAGE UPDATE

The tutorials home page now provides clear actions that developers can take. For new PyTorch users, there is an easy-to-discover button to take them directly to “A 60 Minute Blitz”. Right next to it, there is a button to view all recipes which are designed to teach specific features quickly with examples.

In addition to the existing left navigation bar, tutorials can now be quickly filtered by multi-select tags. Let’s say you want to view all tutorials related to “Production” and “Quantization”. You can select the “Production” and “Quantization” filters as shown in the image shown below:

The following additional resources can also be found at the bottom of the Tutorials homepage:

PYTORCH RECIPES

Recipes are new bite-sized, actionable examples designed to teach researchers and developers how to use specific PyTorch features. Some notable new recipes include:

View the full recipes here.

LEARNING PYTORCH

This section includes tutorials designed for users new to PyTorch. Based on community feedback, we have made updates to the current Deep Learning with PyTorch: A 60 Minute Blitz tutorial, one of our most popular tutorials for beginners. Upon completion, one can understand what PyTorch and neural networks are, and be able to build and train a simple image classification network. Updates include adding explanations to clarify output meanings and linking back to where users can read more in the docs, cleaning up confusing syntax errors, and reconstructing and explaining new concepts for easier readability.

DEPLOYING MODELS IN PRODUCTION

This section includes tutorials for developers looking to take their PyTorch models to production. The tutorials include:

FRONTEND APIS

PyTorch provides a number of frontend API features that can help developers to code, debug, and validate their models more efficiently. This section includes tutorials that teach what these features are and how to use them. Some tutorials to highlight:

MODEL OPTIMIZATION

Deep learning models often consume large amounts of memory, power, and compute due to their complexity. This section provides tutorials for model optimization:

PARALLEL AND DISTRIBUTED TRAINING

PyTorch provides features that can accelerate performance in research and production such as native support for asynchronous execution of collective operations and peer-to-peer communication that is accessible from Python and C++. This section includes tutorials on parallel and distributed training:

Making these improvements are just the first step of improving PyTorch.org for the community. Please submit your suggestions here.

Cheers,

Team PyTorch

Read More

PyTorch library updates including new model serving library

Along with the PyTorch 1.5 release, we are announcing new libraries for high-performance PyTorch model serving and tight integration with TorchElastic and Kubernetes. Additionally, we are releasing updated packages for torch_xla (Google Cloud TPUs), torchaudio, torchvision, and torchtext. All of these new libraries and enhanced capabilities are available today and accompany all of the core features released in PyTorch 1.5.

TorchServe (Experimental)

TorchServe is a flexible and easy to use library for serving PyTorch models in production performantly at scale. It is cloud and environment agnostic and supports features such as multi-model serving, logging, metrics, and the creation of RESTful endpoints for application integration. TorchServe was jointly developed by engineers from Facebook and AWS with feedback and engagement from the broader PyTorch community. The experimental release of TorchServe is available today. Some of the highlights include:

  • Support for both Python-based and TorchScript-based models
  • Default handlers for common use cases (e.g., image segmentation, text classification) as well as the ability to write custom handlers for other use cases
  • Model versioning, the ability to run multiple versions of a model at the same time, and the ability to roll back to an earlier version
  • The ability to package a model, learning weights, and supporting files (e.g., class mappings, vocabularies) into a single, persistent artifact (a.k.a. the “model archive”)
  • Robust management capability, allowing full configuration of models, versions, and individual worker threads via command line, config file, or run-time API
  • Automatic batching of individual inferences across HTTP requests
  • Logging including common metrics, and the ability to incorporate custom metrics
  • Ready-made Dockerfile for easy deployment
  • HTTPS support for secure deployment

To learn more about the APIs and the design of this feature, see the links below:

  • See for a full multi-node deployment reference architecture.
  • The full documentation can be found here.

TorchElastic integration with Kubernetes (Experimental)

TorchElastic is a proven library for training large scale deep neural networks at scale within companies like Facebook, where having the ability to dynamically adapt to server availability and scale as new compute resources come online is critical. Kubernetes enables customers using machine learning frameworks like PyTorch to run training jobs distributed across fleets of powerful GPU instances like the Amazon EC2 P3. Distributed training jobs, however, are not fault-tolerant, and a job cannot continue if a node failure or reclamation interrupts training. Further, jobs cannot start without acquiring all required resources, or scale up and down without being restarted. This lack of resiliency and flexibility results in increased training time and costs from idle resources. TorchElastic addresses these limitations by enabling distributed training jobs to be executed in a fault-tolerant and elastic manner. Until today, Kubernetes users needed to manage Pods and Services required for TorchElastic training jobs manually.

Through the joint collaboration of engineers at Facebook and AWS, TorchElastic, adding elasticity and fault tolerance, is now supported using vanilla Kubernetes and through the managed EKS service from AWS.

To learn more see the TorchElastic repo for the controller implementation and docs on how to use it.

torch_xla 1.5 now available

torch_xla is a Python package that uses the XLA linear algebra compiler to accelerate the PyTorch deep learning framework on Cloud TPUs and Cloud TPU Pods. torch_xla aims to give PyTorch users the ability to do everything they can do on GPUs on Cloud TPUs as well while minimizing changes to the user experience. The project began with a conversation at NeurIPS 2017 and gathered momentum in 2018 when teams from Facebook and Google came together to create a proof of concept. We announced this collaboration at PTDC 2018 and made the PyTorch/XLA integration broadly available at PTDC 2019. The project already has 28 contributors, nearly 2k commits, and a repo that has been forked more than 100 times.

This release of torch_xla is aligned and tested with PyTorch 1.5 to reduce friction for developers and to provide a stable and mature PyTorch/XLA stack for training models using Cloud TPU hardware. You can try it for free in your browser on an 8-core Cloud TPU device with Google Colab, and you can use it at a much larger scaleon Google Cloud.

See the full torch_xla release notes here. Full docs and tutorials can be found here and here.

PyTorch Domain Libraries

torchaudio, torchvision, and torchtext complement PyTorch with common datasets, models, and transforms in each domain area. We’re excited to share new releases for all three domain libraries alongside PyTorch 1.5 and the rest of the library updates. For this release, all three domain libraries are removing support for Python2 and will support Python3 only.

torchaudio 0.5

The torchaudio 0.5 release includes new transforms, functionals, and datasets. Highlights for the release include:

  • Added the Griffin-Lim functional and transform, InverseMelScale and Vol transforms, and DB_to_amplitude.
  • Added support for allpass, fade, bandpass, bandreject, band, treble, deemph, and riaa filters and transformations.
  • New datasets added including LJSpeech and SpeechCommands datasets.

See the release full notes here and full docs can be found here.

torchvision 0.6

The torchvision 0.6 release includes updates to datasets, models and a significant number of bug fixes. Highlights include:

  • Faster R-CNN now supports negative samples which allows the feeding of images without annotations at training time.
  • Added aligned flag to RoIAlign to match Detectron2.
  • Refactored abstractions for C++ video decoder

See the release full notes here and full docs can be found here.

torchtext 0.6

The torchtext 0.6 release includes a number of bug fixes and improvements to documentation. Based on user’s feedback, dataset abstractions are currently being redesigned also. Highlights for the release include:

  • Fixed an issue related to the SentencePiece dependency in conda package.
  • Added support for the experimental IMDB dataset to allow a custom vocab.
  • A number of documentation updates including adding a code of conduct and a deduplication of the docs on the torchtext site.

Your feedback and discussions on the experimental datasets API are welcomed. You can send them to issue #664. We would also like to highlight the pull request here where the latest dataset abstraction is applied to the text classification datasets. The feedback can be beneficial to finalizing this abstraction.

See the release full notes here and full docs can be found here.

We’d like to thank the entire PyTorch team, the Amazon team and the community for all their contributions to this work.

Cheers!

Team PyTorch

Read More

PyTorch 1.5 released, new and updated APIs including C++ frontend API parity with Python

Today, we’re announcing the availability of PyTorch 1.5, along with new and updated libraries. This release includes several major new API additions and improvements. PyTorch now includes a significant update to the C++ frontend, ‘channels last’ memory format for computer vision models, and a stable release of the distributed RPC framework used for model-parallel training. The release also has new APIs for autograd for hessians and jacobians, and an API that allows the creation of Custom C++ Classes that was inspired by pybind.

You can find the detailed release notes here.

C++ Frontend API (Stable)

The C++ frontend API is now at parity with Python, and the features overall have been moved to ‘stable’ (previously tagged as experimental). Some of the major highlights include:

  • Now with ~100% coverage and docs for C++ torch::nn module/functional, users can easily translate their model from Python API to C++ API, making the model authoring experience much smoother.
  • Optimizers in C++ had deviated from the Python equivalent: C++ optimizers can’t take parameter groups as input while the Python ones can. Additionally, step function implementations were not exactly the same. With the 1.5 release, C++ optimizers will always behave the same as the Python equivalent.
  • The lack of tensor multi-dim indexing API in C++ is a well-known issue and had resulted in many posts in PyTorch Github issue tracker and forum. The previous workaround was to use a combination of narrow / select / index_select / masked_select, which was clunky and error-prone compared to the Python API’s elegant tensor[:, 0, ..., mask] syntax. With the 1.5 release, users can use tensor.index({Slice(), 0, "...", mask}) to achieve the same purpose.

‘Channels last’ memory format for Computer Vision models (Experimental)

‘Channels last’ memory layout unlocks ability to use performance efficient convolution algorithms and hardware (NVIDIA’s Tensor Cores, FBGEMM, QNNPACK). Additionally, it is designed to automatically propagate through the operators, which allows easy switching between memory layouts.

Learn more here on how to write memory format aware operators.

Custom C++ Classes (Experimental)

This release adds a new API, torch::class_, for binding custom C++ classes into TorchScript and Python simultaneously. This API is almost identical in syntax to pybind11. It allows users to expose their C++ class and its methods to the TorchScript type system and runtime system such that they can instantiate and manipulate arbitrary C++ objects from TorchScript and Python. An example C++ binding:

template <class T>
struct MyStackClass : torch::CustomClassHolder {
  std::vector<T> stack_;
  MyStackClass(std::vector<T> init) : stack_(std::move(init)) {}

  void push(T x) {
    stack_.push_back(x);
  }
  T pop() {
    auto val = stack_.back();
    stack_.pop_back();
    return val;
  }
};

static auto testStack =
  torch::class_<MyStackClass<std::string>>("myclasses", "MyStackClass")
      .def(torch::init<std::vector<std::string>>())
      .def("push", &MyStackClass<std::string>::push)
      .def("pop", &MyStackClass<std::string>::pop)
      .def("size", [](const c10::intrusive_ptr<MyStackClass>& self) {
        return self->stack_.size();
      });

Which exposes a class you can use in Python and TorchScript like so:

@torch.jit.script
def do_stacks(s : torch.classes.myclasses.MyStackClass):
    s2 = torch.classes.myclasses.MyStackClass(["hi", "mom"])
    print(s2.pop()) # "mom"
    s2.push("foobar")
    return s2 # ["hi", "foobar"]

You can try it out in the tutorial here.

Distributed RPC framework APIs (Now Stable)

The Distributed RPC framework was launched as experimental in the 1.4 release and the proposal is to mark Distributed RPC framework as stable and no longer experimental. This work involves a lot of enhancements and bug fixes to make the distributed RPC framework more reliable and robust overall, as well as adding a couple of new features, including profiling support, using TorchScript functions in RPC, and several enhancements for ease of use. Below is an overview of the various APIs within the framework:

RPC API

The RPC API allows users to specify functions to run and objects to be instantiated on remote nodes. These functions are transparently recorded so that gradients can backpropagate through remote nodes using Distributed Autograd.

Distributed Autograd

Distributed Autograd connects the autograd graph across several nodes and allows gradients to flow through during the backwards pass. Gradients are accumulated into a context (as opposed to the .grad field as with Autograd) and users must specify their model’s forward pass under a with dist_autograd.context() manager in order to ensure that all RPC communication is recorded properly. Currently, only FAST mode is implemented (see here for the difference between FAST and SMART modes).

Distributed Optimizer

The distributed optimizer creates RRefs to optimizers on each worker with parameters that require gradients, and then uses the RPC API to run the optimizer remotely. The user must collect all remote parameters and wrap them in an RRef, as this is required input to the distributed optimizer. The user must also specify the distributed autograd context_id so that the optimizer knows in which context to look for gradients.

Learn more about distributed RPC framework APIs here.

New High level autograd API (Experimental)

PyTorch 1.5 brings new functions including jacobian, hessian, jvp, vjp, hvp and vhp to the torch.autograd.functional submodule. This feature builds on the current API and allows the user to easily perform these functions.

Detailed design discussion on GitHub can be found here.

Python 2 no longer supported

Starting PyTorch 1.5.0, we will no longer support Python 2, specifically version 2.7. Going forward support for Python will be limited to Python 3, specifically Python 3.5, 3.6, 3.7 and 3.8 (first enabled in PyTorch 1.4.0).

We’d like to thank the entire PyTorch team and the community for all their contributions to this work.

Cheers!

Team PyTorch

Read More