Overview of PyTorch Autograd Engine

This blog post is based on PyTorch version 1.8, although it should apply for older versions too, since most of the mechanics have remained constant.

To help understand the concepts explained here, it is recommended that you read the awesome blog post by @ezyang: PyTorch internals if you are not familiar with PyTorch architecture components such as ATen or c10d.

What is autograd?

Background

PyTorch computes the gradient of a function with respect to the inputs by using automatic differentiation. Automatic differentiation is a technique that, given a computational graph, calculates the gradients of the inputs. Automatic differentiation can be performed in two different ways; forward and reverse mode. Forward mode means that we calculate the gradients along with the result of the function, while reverse mode requires us to evaluate the function first, and then we calculate the gradients starting from the output. While both modes have their pros and cons, the reverse mode is the de-facto choice since the number of outputs is smaller than the number of inputs, which allows a much more efficient computation. Check [3] to learn more about this.

Automatic differentiation relies on a classic calculus formula known as the chain-rule. The chain rule allows us to calculate very complex derivatives by splitting them and recombining them later.

Formally speaking, given a composite function , we can calculate its derivative as . This result is what makes automatic differentiation work.
By combining the derivatives of the simpler functions that compose a larger one, such as a neural network, it is possible to compute the exact value of the gradient at a given point rather than relying on the numerical approximation, which would require multiple perturbations in the input to obtain a value.

To get the intuition of how the reverse mode works, let’s look at a simple function . Figure 1 shows its computational graph where the inputs x, y in the left, flow through a series of operations to generate the output z.

Figure 1: Computational graph of f(x, y) = log(x*y)

The automatic differentiation engine will normally execute this graph. It will also extend it to calculate the derivatives of w with respect to the inputs x, y, and the intermediate result v.

The example function can be decomposed in f and g, where and . Every time the engine executes an operation in the graph, the derivative of that operation is added to the graph to be executed later in the backward pass. Note, that the engine knows the derivatives of the basic functions.

In the example above, when multiplying x and y to obtain v, the engine will extend the graph to calculate the partial derivatives of the multiplication by using the multiplication derivative definition that it already knows. and . The resulting extended graph is shown in Figure 2, where the MultDerivative node also calculates the product of the resulting gradients by an input gradient to apply the chain rule; this will be explicitly seen in the following operations. Note that the backward graph (green nodes) will not be executed until all the forward steps are completed.

Figure 2: Computational graph extended after executing the logarithm

Continuing, the engine now calculates the operation and extends the graph again with the log derivative that it knows to be . This is shown in figure 3. This operation generates the result that when propagated backward and multiplied by the multiplication derivative as in the chain rule, generates the derivatives , .

Figure 3: Computational graph extended after executing the logarithm

The original computation graph is extended with a new dummy variable z that is the same w. The derivative of z with respect to w is 1 as they are the same variable, this trick allows us to apply the chain rule to calculate the derivatives of the inputs. After the forward pass is complete, we start the backward pass, by supplying the initial value of 1.0 for . This is shown in Figure 4.

Figure 4: Computational graph extended for reverse auto differentiation

Then following the green graph we execute the LogDerivative operation that the auto differentiation engine introduced, and multiply its result by to obtain the gradient as per the chain rule states. Next, the multiplication derivative is executed in the same way, and the desired derivatives are finally obtained.

Formally, what we are doing here, and PyTorch autograd engine also does, is computing a Jacobian-vector product (Jvp) to calculate the gradients of the model parameters, since the model parameters and inputs are vectors.

The Jacobian-vector product

When we calculate the gradient of a vector-valued function (a function whose inputs and outputs are vectors), we are essentially constructing a Jacobian matrix .

Thanks to the chain rule, multiplying the Jacobian matrix of a function by a vector with the previously calculated gradients of a scalar function results in the gradients of the scalar output with respect to the vector-valued function inputs.

As an example, let’s look at some functions in python notation to show how the chain rule applies.

def f(x1, x2):
      a = x1 * x2
      y1 = log(a)
      y2 = sin(x2)
      return (y1, y2)
  

def g(y1, y2):
      return y1 * y2
  

Now, if we derive this by hand using the chain rule and the definition of the derivatives, we obtain the following set of identities that we can directly plug into the Jacobian matrix of

Next, let’s consider the gradients for the scalar function

If we now calculate the transpose-Jacobian vector product obeying the chain rule, we obtain the following expression:

Evaluating the Jvp for yields the result:

We can execute the same expression in PyTorch and calculate the gradient of the input:

>>> import torch
>>> x = torch.tensor([0.5, 0.75], requires_grad=True)
>>> y = torch.log(x[0] * x[1]) * torch.sin(x[1])
>>> y.backward(1.0)
>>> x.grad

tensor([1.3633,
0.1912])</pre>

The result is the same as our hand-calculated Jacobian-vector product!
However, PyTorch never constructed the matrix as it could grow prohibitively large but instead, created a graph of operations that traversed backward while applying the Jacobian-vector products defined in tools/autograd/derivatives.yaml.

Going through the graph

Every time PyTorch executes an operation, the autograd engine constructs the graph to be traversed backward.
The reverse mode auto differentiation starts by adding a scalar variable at the end so that as we saw in the introduction. This is the initial gradient value that is supplied to the Jvp engine calculation as we saw in the section above.

In PyTorch, the initial gradient is explicitly set by the user when he calls the backward method.

Then, the Jvp calculation starts but it never constructs the matrix. Instead, when PyTorch records the computational graph, the derivatives of the executed forward operations are added (Backward Nodes). Figure 5 shows a backward graph generated by the execution of the functions and seen before.

Figure 5: Computational Graph extended with the backward pass

Once the forward pass is done, the results are used in the backward pass where the derivatives in the computational graph are executed. The basic derivatives are stored in the tools/autograd/derivatives.yaml file and they are not regular derivatives but the Jvp versions of them [3]. They take their primitive function inputs and outputs as parameters along with the gradient of the function outputs with respect to the final outputs. By repeatedly multiplying the resulting gradients by the next Jvp derivatives in the graph, the gradients up to the inputs will be generated following the chain rule.

Figure 6: How the chain rule is applied in backward differentiation

Figure 6 represents the process by showing the chain rule. We started with a value of 1.0 as detailed before which is the already calculated gradient highlighted in green. And we move to the next node in the graph. The backward function registered in derivatives.yaml will calculate the associated
value highlighted in red and multiply it by . By the chain rule this results in which will be the already calculated gradient (green) when we process the next backward node in the graph.

You may also have noticed that in Figure 5 there is a gradient generated from two different sources. When two different functions share an input, the gradients with respect to the output are aggregated for that input, and calculations using that gradient can’t proceed unless all the paths have been aggregated together.

Let’s see an example of how the derivatives are stored in PyTorch.

Suppose that we are currently processing the backward propagation of the function, in the LogBackward node in Figure 2. The derivative of in derivatives.yaml is specified as grad.div(self.conj()). grad is the already calculated gradient and self.conj() is the complex conjugate of the input vector. For complex numbers PyTorch calculates a special derivative called the conjugate Wirtinger derivative [6]. This derivative takes the complex number and its conjugate and by operating some magic that is described in [6], they are the direction of steepest descent when plugged into optimizers.

This code translates to , the corresponding green, and red squares in Figure 3. Continuing, the autograd engine will execute the next operation; backward of the multiplication. As before, the inputs are the original function’s inputs and the gradient calculated from the backward step. This step will keep repeating until we reach the gradient with respect to the inputs and the computation will be finished. The gradient of is only completed once the multiplication and sin gradients are added together. As you can see, we computed the equivalent of the Jvp but without constructing the matrix.

In the next post we will dive inside PyTorch code to see how this graph is constructed and where are the relevant pieces should you want to experiment with it!

References

  1. https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
  2. https://web.stanford.edu/class/cs224n/readings/gradient-notes.pdf
  3. https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/slides/lec10.pdf
  4. https://mustafaghali11.medium.com/how-pytorch-backward-function-works-55669b3b7c62
  5. https://indico.cern.ch/event/708041/contributions/3308814/attachments/1813852/2963725/automatic_differentiation_and_deep_learning.pdf
  6. https://pytorch.org/docs/stable/notes/autograd.html#complex-autograd-doc
  7. Recommended: shows why the backprop is formally expressed with the Jacobian

  8. cs.ubc.ca/~fwood/CS340/lectures/AD1.pdf

Read More

Everything you need to know about TorchVision’s MobileNetV3 implementation

In TorchVision v0.9, we released a series of new mobile-friendly models that can be used for Classification, Object Detection and Semantic Segmentation. In this article, we will dig deep into the code of the models, share notable implementation details, explain how we configured and trained them, and highlight important tradeoffs we made during their tuning. Our goal is to disclose technical details that typically remain undocumented in the original papers and repos of the models.

Network Architecture

The implementation of the MobileNetV3 architecture follows closely the original paper. It is customizable and offers different configurations for building Classification, Object Detection and Semantic Segmentation backbones. It was designed to follow a similar structure to MobileNetV2 and the two share common building blocks.

Off-the-shelf, we offer the two variants described on the paper: the Large and the Small. Both are constructed using the same code with the only difference being their configuration which describes the number of blocks, their sizes, their activation functions etc.

Configuration parameters

Even though one can write a custom InvertedResidual setting and pass it to the MobileNetV3 class directly, for the majority of applications we can adapt the existing configs by passing parameters to the model building methods. Some of the key configuration parameters are the following:

  • The width_mult parameter is a multiplier that affects the number of channels of the model. The default value is 1 and by increasing or decreasing it one can change the number of filters of all convolutions, including the ones of the first and last layers. The implementation ensures that the number of filters is always a multiple of 8. This is a hardware optimization trick which allows for faster vectorization of operations.

  • The reduced_tail parameter halves the number of channels on the last blocks of the network. This version is used by some Object Detection and Semantic Segmentation models. It’s a speed optimization which is described on the MobileNetV3 paper and reportedly leads to a 15% latency reduction without a significant negative effect on accuracy.

  • The dilated parameter affects the last 3 InvertedResidual blocks of the model and turns their normal depthwise Convolutions to Atrous Convolutions. This is used to control the output stride of these blocks and has a significant positive effect on the accuracy of Semantic Segmentation models.

Implementation details

Below we provide additional information on some notable implementation details of the architecture.
The MobileNetV3 class is responsible for building a network out of the provided configuration. Here are some implementation details of the class:

  • The last convolution block expands the output of the last InvertedResidual block by a factor of 6. The implementation is aligned with the Large and Small configurations described on the paper and can adapt to different values of the multiplier parameter.

  • Similarly to other models such as MobileNetV2, a dropout layer is placed just before the final Linear layer of the classifier.

The InvertedResidual class is the main building block of the network. Here are some notable implementation details of the block along with its visualization which comes from Figure 4 of the paper:

  • There is no expansion step if the input channels and the expanded channels are the same. This happens on the first convolution block of the network.

  • There is always a projection step even when the expanded channels are the same as the output channels.

  • The activation method of the depthwise block is placed before the Squeeze-and-Excite layer as this improves marginally the accuracy.

Classification

In this section we provide benchmarks of the pre-trained models and details on how they were configured, trained and quantized.

Benchmarks

Here is how to initialize the pre-trained models:

large = torchvision.models.mobilenet_v3_large(pretrained=True, width_mult=1.0,  reduced_tail=False, dilated=False)
small = torchvision.models.mobilenet_v3_small(pretrained=True)
quantized = torchvision.models.quantization.mobilenet_v3_large(pretrained=True)

Below we have the detailed benchmarks between new and selected previous models. As we can see MobileNetV3-Large is a viable replacement of ResNet50 for users who are willing to sacrifice a bit of accuracy for a roughly 6x speed-up:

Model Acc@1 Acc@5 Inference on CPU (sec) # Params (M)
MobileNetV3-Large 74.042 91.340 0.0411 5.48
MobileNetV3-Small 67.668 87.402 0.0165 2.54
Quantized MobileNetV3-Large 73.004 90.858 0.0162 2.96
MobileNetV2 71.880 90.290 0.0608 3.50
ResNet50 76.150 92.870 0.2545 25.56
ResNet18 69.760 89.080 0.1032 11.69

Note that the inference times are measured on CPU. They are not absolute benchmarks, but they allow for relative comparisons between models.

Training process

All pre-trained models are configured with a width multiplier of 1, have full tails, are non-dilated, and were fitted on ImageNet. Both the Large and Small variants were trained using the same hyper-parameters and scripts which can be found in our references folder. Below we provide details on the most notable aspects of the training process.

Achieving fast and stable training

Configuring RMSProp correctly was crucial to achieve fast training with numerical stability. The authors of the paper used TensorFlow in their experiments and in their runs they reported using quite high rmsprop_epsilon comparing to the default. Typically this hyper-parameter takes small values as it’s used to avoid zero denominators, but in this specific model choosing the right value seems important to avoid numerical instabilities in the loss.

Another important detail is that though PyTorch’s and TensorFlow’s RMSProp implementations typically behave similarly, there are a few differences with the most notable in our setup being how the epsilon hyperparameter is handled. More specifically, PyTorch adds the epsilon outside of the square root calculation while TensorFlow adds it inside. The result of this implementation detail is that one needs to adjust the epsilon value while porting the hyper parameter of the paper. A reasonable approximation can be taken with the formula PyTorch_eps = sqrt(TF_eps).

Increasing our accuracy by tuning hyperparameters & improving our training recipe

After configuring the optimizer to achieve fast and stable training, we turned into optimizing the accuracy of the model. There are a few techniques that helped us achieve this. First of all, to avoid overfitting we augmented out data using the AutoAugment algorithm, followed by RandomErasing. Additionally we tuned parameters such as the weight decay using cross validation. We also found beneficial to perform weight averaging across different epoch checkpoints after the end of the training. Finally, though not used in our published training recipe, we found that using Label Smoothing, Stochastic Depth and LR noise injection improve the overall accuracy by over 1.5 points.

The graph and table depict a simplified summary of the most important iterations for improving the accuracy of the MobileNetV3 Large variant. Note that the actual number of iterations done while training the model was significantly larger and that the progress in accuracy was not always monotonically increasing. Also note that the Y-axis of the graph starts from 70% instead from 0% to make the difference between iterations more visible:

Iteration Acc@1 Acc@5
Baseline with “MobileNetV2-style” Hyperparams 71.542 90.068
+ RMSProp with default eps 70.684 89.38
+ RMSProp with adjusted eps & LR scheme 71.764 90.178
+ Data Augmentation & Tuned Hyperparams 73.86 91.292
+ Checkpoint Averaging 74.028 91.382
+ Label Smoothing & Stochastic Depth & LR noise 75.536 92.368

Note that once we’ve achieved an acceptable accuracy, we verified the model performance on the hold-out test dataset which hasn’t been used before for training or hyper-parameter tuning. This process helps us detect overfitting and is always performed for all pre-trained models prior their release.

Quantization

We currently offer quantized weights for the QNNPACK backend of the MobileNetV3-Large variant which provides a speed-up of 2.5x. To quantize the model, Quantized Aware Training (QAT) was used. The hyper parameters and the scripts used to train the model can be found in our references folder.

Note that QAT allows us to model the effects of quantization and adjust the weights so that we can improve the model accuracy. This translates to an accuracy increase of 1.8 points comparing to simple post-training quantization:

Quantization Status Acc@1 Acc@5
Non-quantized 74.042 91.340
Quantized Aware Training 73.004 90.858
Post-training Quantization 71.160 89.834

Object Detection

In this section, we will first provide benchmarks of the released models, and then discuss how the MobileNetV3-Large backbone was used in a Feature Pyramid Network along with the FasterRCNN detector to perform Object Detection. We will also explain how the network was trained and tuned alongside with any tradeoffs we had to make. We will not cover details about how it was used with SSDlite as this will be discussed on a future article.

Benchmarks

Here is how the models are initialized:

high_res = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(pretrained=T low_res = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_320_fpn(pretraine

Below are some benchmarks between new and selected previous models. As we can see the high resolution Faster R-CNN with MobileNetV3-Large FPN backbone seems a viable replacement of the equivalent ResNet50 model for those users who are willing to sacrifice few accuracy points for a 5x speed-up:

Model mAP Inference on CPU (sec) # Params (M)
Faster R-CNN MobileNetV3-Large FPN (High-Res) 32.8 0.8409 19.39
Faster R-CNN MobileNetV3-Large 320 FPN (Low-Res) 22.8 0.1679 19.39
Faster R-CNN ResNet-50 FPN 37.0 4.1514 41.76
RetinaNet ResNet-50 FPN 36.4 4.8825 34.01

Implementation details

The Detector uses a FPN-style backbone which extracts features from different convolutions of the MobileNetV3 model. By default the pre-trained model uses the output of the 13th InvertedResidual block and the output of the Convolution prior to the pooling layer but the implementation supports using the outputs of more stages.

All feature maps extracted from the network have their output projected down to 256 channels by the FPN block as this greatly improves the speed of the network. These feature maps provided by the FPN backbone are used by the FasterRCNN detector to provide box and class predictions at different scales.

Training & Tuning process

We currently offer two pre-trained models capable of doing object detection at different resolutions. Both models were trained on the COCO dataset using the same hyper-parameters and scripts which can be found in our references folder.

The High Resolution detector was trained with images of 800-1333px, while the mobile-friendly Low Resolution detector was trained with images of 320-640px. The reason why we provide two separate sets of pre-trained weights is because training a detector directly on the smaller images leads to a 5 mAP increase in precision comparing to passing small images to the pre-trained high-res model. Both backbones were initialized with weights fitted on ImageNet and the 3 last stages of their weights where fined-tuned during the training process.

An additional speed optimization can be applied on the mobile-friendly model by tuning the RPN NMS thresholds. By sacrificing only 0.2 mAP of precision we were able to improve the CPU speed of the model by roughly 45%. The details of the optimization can be seen below:

Tuning Status mAP Inference on CPU (sec)
Before 23.0 0.2904
After 22.8 0.1679

Below we provide some examples of visualizing the predictions of the Faster R-CNN MobileNetV3-Large FPN model:

Semantic Segmentation

In this section we will start by providing some benchmarks of the released pre-trained models. Then we will discuss how a MobileNetV3-Large backbone was combined with segmentation heads such as LR-ASPP, DeepLabV3 and the FCN to conduct Semantic Segmentation. We will also explain how the network was trained and propose a few optional optimization techniques for speed critical applications.

Benchmarks

This is how to initialize the pre-trained models:

lraspp = torchvision.models.segmentation.lraspp_mobilenet_v3_large(pretrained=True) deeplabv3 = torchvision.models.segmentation.deeplabv3_mobilenet_v3_large(pretrained=Tr

Below are the detailed benchmarks between new and selected existing models. As we can see, the DeepLabV3 with a MobileNetV3-Large backbone is a viable replacement of FCN with ResNet50 for the majority of applications as it achieves similar accuracy with a 8.5x speed-up. We also observe that the LR-ASPP network supersedes the equivalent FCN in all metrics:

Model mIoU Global Pixel Acc Inference on CPU (sec) # Params (M)
LR-ASPP MobileNetV3-Large 57.9 91.2 0.3278 3.22
DeepLabV3 MobileNetV3-Large 60.3 91.2 0.5869 11.03
FCN MobileNetV3-Large (not released) 57.8 90.9 0.3702 5.05
DeepLabV3 ResNet50 66.4 92.4 6.3531 39.64
FCN ResNet50 60.5 91.4 5.0146 32.96

Implementation details

In this section we will discuss important implementation details of tested segmentation heads. Note that all models described in this section use a dilated MobileNetV3-Large backbone.

LR-ASPP

The LR-ASPP is the Lite variant of the Reduced Atrous Spatial Pyramid Pooling model proposed by the authors of the MobileNetV3 paper. Unlike the other segmentation models in TorchVision, it does not make use of an auxiliary loss. Instead it uses low and high-level features with output strides of 8 and 16 respectively.

Unlike the paper where a 49×49 AveragePooling layer with variable strides is used, our implementation uses an AdaptiveAvgPool2d layer to process the global features. This is because the authors of the paper tailored the head to the Cityscapes dataset while our focus is to provide a general purpose implementation that can work on multiple datasets. Finally our implementation always has a bilinear interpolation before returning the output to ensure that the sizes of the input and output images match exactly.

DeepLabV3 & FCN

The combination of MobileNetV3 with DeepLabV3 and FCN follows closely the ones of other models and the stage estimation for these methods is identical to LR-ASPP. The only notable difference is that instead of using high and low level features, we attach the normal loss to the feature map with output stride 16 and an auxiliary loss on the feature map with output stride 8.

Finally we should note that the FCN version of the model was not released because it was completely superseded by the LR-ASPP both in terms of speed and accuracy. The pre-trained weights are still available and can be used with minimal changes to the code.

Training & Tuning process

We currently offer two MobileNetV3 pre-trained models capable of doing semantic segmentation: the LR-ASPP and the DeepLabV3. The backbones of the models were initialized with ImageNet weights and trained end-to-end. Both architectures were trained on the COCO dataset using the same scripts with similar hyper-parameters. Their details can be found in our references folder.

Normally, during inference the images are resized to 520 pixels. An optional speed optimization is to construct a Low Res configuration of the model by using the High-Res pre-trained weights and reducing the inference resizing to 320 pixels. This will improve the CPU execution times by roughly 60% while sacrificing a couple of mIoU points. The detailed numbers of this optimization can be found on the table below:

Low-Res Configuration mIoU Difference Speed Improvement mIoU Global Pixel Acc Inference on CPU (sec)
LR-ASPP MobileNetV3-Large -2.1 65.26% 55.8 90.3 0.1139
DeepLabV3 MobileNetV3-Large -3.8 63.86% 56.5 90.3 0.2121
FCN MobileNetV3-Large (not released) -3.0 57.57% 54.8 90.1 0.1571

Here are some examples of visualizing the predictions of the LR-ASPP MobileNetV3-Large model:

We hope that you found this article interesting. We are looking forward to your feedback to see if this is the type of content you would like us to publish more often. If the community finds that such posts are useful, we will be happy to publish more articles that cover the implementation details of newly introduced Machine Learning models.

Read More

Announcing the PyTorch Enterprise Support Program

Today, we are excited to announce the PyTorch Enterprise Support Program, a participatory program that enables service providers to develop and offer tailored enterprise-grade support to their customers. This new offering, built in collaboration between Facebook and Microsoft, was created in direct response to feedback from PyTorch enterprise users who are developing models in production at scale for mission-critical applications.

The PyTorch Enterprise Support Program is available to any service provider. It is designed to mutually benefit all program Participants by sharing and improving PyTorch long-term support (LTS), including contributions of hotfixes and other improvements found while working closely with customers and on their systems.

To benefit the open source community, all hotfixes developed by Participants will be tested and fed back to the LTS releases of PyTorch regularly through PyTorch’s standard pull request process. To participate in the program, a service provider must apply and meet a set of program terms and certification requirements. Once accepted, the service provider becomes a program Participant and can offer a packaged PyTorch Enterprise support service with LTS, prioritized troubleshooting, useful integrations, and more.

As one of the founding members and an inaugural member of the PyTorch Enterprise Support Program, Microsoft is launching PyTorch Enterprise on Microsoft Azure to deliver a reliable production experience for PyTorch users. Microsoft will support each PyTorch release for as long as it is current. In addition, it will support selected releases for two years, enabling a stable production experience. Microsoft Premier and Unified Support customers can access prioritized troubleshooting for hotfixes, bugs, and security patches at no additional cost. Microsoft will extensively test PyTorch releases for performance regression. The latest release of PyTorch will be integrated with Azure Machine Learning and other PyTorch add-ons including ONNX Runtime for faster inference.

PyTorch Enterprise on Microsoft Azure not only benefits its customers, but also the PyTorch community users. All improvements will be tested and fed back to the future release for PyTorch so everyone in the community can use them.

As an organization or PyTorch user, the standard way of researching and deploying with different release versions of PyTorch does not change. If your organization is looking for the managed long-term support, prioritized patches, bug fixes, and additional enterprise-grade support, then you should reach out to service providers participating in the program.

To learn more and participate in the program as a service provider, visit the PyTorch Enterprise Support Program. If you want to learn more about Microsoft’s offering, visit PyTorch Enterprise on Microsoft Azure.

Thank you,

Team PyTorch

Read More

PyTorch Ecosystem Day 2021 Recap and New Contributor Resources

Thank you to our incredible community for making the first ever PyTorch Ecosystem Day a success! The day was filled with discussions on new developments, trends and challenges showcased through 71 posters, 32 breakout sessions and 6 keynote speakers.

Special thanks to our keynote speakers: Piotr Bialecki, Ritchie Ng, Miquel Farré, Joe Spisak, Geeta Chauhan, and Suraj Subramanian who shared updates from the latest release of PyTorch, exciting work being done with partners, use case example from Disney, the growth and development of the PyTorch community in Asia Pacific, and latest contributor highlights.

If you missed the opening talks, you rewatch them here:

In addition to the talks, we had 71 posters covering various topics such as multimodal, NLP, compiler, distributed training, researcher productivity tools, AI accelerators, and more. From the event, it was clear that an underlying thread that ties all of these different projects together is the cross-collaboration of the PyTorch community. Thank you for continuing to push the state of the art with PyTorch!

To view the full catalogue of poster, please visit PyTorch Ecosystem Day 2021 Event Page.

New Contributor Resources

Today, we are also sharing new contributor resources that we are trying out to give you the most access to up-to-date news, networking opportunities and more.

  • Contributor Newsletter – Includes curated news including RFCs, feature roadmaps, notable PRs, editorials from developers, and more to support keeping track of everything that’s happening in our community.
  • Contributors Discussion Forum – Designed for contributors to learn and collaborate on the latest development across PyTorch.
  • PyTorch Developer Podcast (Beta) – Edward Yang, PyTorch Research Scientist, at Facebook AI shares bite-sized (10 to 20 mins) podcast episodes discussing topics about all sorts of internal development topics in PyTorch.

Thank you,

Team PyTorch

Read More

An overview of the ML models introduced in TorchVision v0.9

TorchVision v0.9 has been released and it is packed with numerous new Machine Learning models and features, speed improvements and bug fixes. In this blog post, we provide a quick overview of the newly introduced ML models and discuss their key features and characteristics.

Classification

  • MobileNetV3 Large & Small: These two classification models are optimized for Mobile use-cases and are used as backbones on other Computer Vision tasks. The implementation of the new MobileNetV3 architecture supports the Large & Small variants and the depth multiplier parameter as described in the original paper. We offer pre-trained weights on ImageNet for both Large and Small networks with depth multiplier 1.0 and resolution 224×224. Our previous training recipes have been updated and can be used to easily train the models from scratch (shoutout to Ross Wightman for inspiring some of our training configuration). The Large variant offers a competitive accuracy comparing to ResNet50 while being over 6x faster on CPU, meaning that it is a good candidate for applications where speed is important. For applications where speed is critical, one can sacrifice further accuracy for speed and use the Small variant which is 15x faster than ResNet50.

  • Quantized MobileNetV3 Large: The quantized version of MobilNetV3 Large reduces the number of parameters by 45% and it is roughly 2.5x faster than the non-quantized version while remaining competitive in terms of accuracy. It was fitted on ImageNet using Quantization Aware Training by iterating on the non-quantized version and it can be trained from scratch using the existing reference scripts.

Usage:

model = torchvision.models.mobilenet_v3_large(pretrained=True)
# model = torchvision.models.mobilenet_v3_small(pretrained=True)
# model = torchvision.models.quantization.mobilenet_v3_large(pretrained=True)
model.eval()
predictions = model(img)

Object Detection

  • Faster R-CNN MobileNetV3-Large FPN: Combining the MobileNetV3 Large backbone with a Faster R-CNN detector and a Feature Pyramid Network leads to a highly accurate and fast object detector. The pre-trained weights are fitted on COCO 2017 using the provided reference scripts and the model is 5x faster on CPU than the equivalent ResNet50 detector while remaining competitive in terms of accuracy.
  • Faster R-CNN MobileNetV3-Large 320 FPN: This is an iteration of the previous model that uses reduced resolution (min_size=320 pixel) and sacrifices accuracy for speed. It is 25x faster on CPU than the equivalent ResNet50 detector and thus it is good for real mobile use-cases.

Usage:

model = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(pretrained=True)
# model = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_320_fpn(pretrained=True)
model.eval()
predictions = model(img)

Semantic Segmentation

  • DeepLabV3 with Dilated MobileNetV3 Large Backbone: A dilated version of the MobileNetV3 Large backbone combined with DeepLabV3 helps us build a highly accurate and fast semantic segmentation model. The pre-trained weights are fitted on COCO 2017 using our standard training recipes. The final model has the same accuracy as the FCN ResNet50 but it is 8.5x faster on CPU and thus making it an excellent replacement for the majority of applications.
  • Lite R-ASPP with Dilated MobileNetV3 Large Backbone: We introduce the implementation of a new segmentation head called Lite R-ASPP and combine it with the dilated MobileNetV3 Large backbone to build a very fast segmentation model. The new model sacrifices some accuracy to achieve a 15x speed improvement comparing to the previously most lightweight segmentation model which was the FCN ResNet50.

Usage:

model = torchvision.models.segmentation.deeplabv3_mobilenet_v3_large(pretrained=True)
# model = torchvision.models.segmentation.lraspp_mobilenet_v3_large(pretrained=True)
model.eval()
predictions = model(img)

In the near future we plan to publish an article that covers the details of how the above models were trained and discuss their tradeoffs and design choices. Until then we encourage you to try out the new models and provide your feedback.

Read More

Introducing PyTorch Profiler – the new and improved performance tool

Along with PyTorch 1.8.1 release, we are excited to announce PyTorch Profiler – the new and improved performance debugging profiler for PyTorch. Developed as part of a collaboration between Microsoft and Facebook, the PyTorch Profiler is an open-source tool that enables accurate and efficient performance analysis and troubleshooting for large-scale deep learning models.

Analyzing and improving large-scale deep learning model performance is an ongoing challenge that grows in importance as the model sizes increase. For a long time, PyTorch users had a hard time solving this challenge due to the lack of available tools. There were standard performance debugging tools that provide GPU hardware level information but missed PyTorch-specific context of operations. In order to recover missed information, users needed to combine multiple tools together or manually add minimum correlation information to make sense of the data. There was also the autograd profiler (torch.autograd.profiler) which can capture information about PyTorch operations but does not capture detailed GPU hardware-level information and cannot provide support for visualization.

The new PyTorch Profiler (torch.profiler) is a tool that brings both types of information together and then builds experience that realizes the full potential of that information. This new profiler collects both GPU hardware and PyTorch related information, correlates them, performs automatic detection of bottlenecks in the model, and generates recommendations on how to resolve these bottlenecks. All of this information from the profiler is visualized for the user in TensorBoard. The new Profiler API is natively supported in PyTorch and delivers the simplest experience available to date where users can profile their models without installing any additional packages and see results immediately in TensorBoard with the new PyTorch Profiler plugin. Below is the screenshot of PyTorch Profiler – automatic bottleneck detection.

Getting started

PyTorch Profiler is the next version of the PyTorch autograd profiler. It has a new module namespace torch.profiler but maintains compatibility with autograd profiler APIs. The Profiler uses a new GPU profiling engine, built using Nvidia CUPTI APIs, and is able to capture GPU kernel events with high fidelity. To profile your model training loop, wrap the code in the profiler context manager as shown below.

 with torch.profiler.profile(
    schedule=torch.profiler.schedule(
        wait=2,
        warmup=2,
        active=6,
        repeat=1),
    on_trace_ready=tensorboard_trace_handler,
    with_trace=True
) as profiler:
    for step, data in enumerate(trainloader, 0):
        print("step:{}".format(step))
        inputs, labels = data[0].to(device=device), data[1].to(device=device)

        outputs = model(inputs)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        profiler.step()

The schedule parameter allows you to limit the number of training steps included in the profile to reduce the amount of data collected and simplify visual analysis by focusing on what’s important. The tensorboard_trace_handler automatically saves profiling results to disk for analysis in TensorBoard.

To view results of the profiling session in TensorBoard, install PyTorch Profiler TensorBoard Plugin package.

pip install torch_tb_profiler

Visual Studio Code Integration

Microsoft Visual Studio Code is one of the most popular code editors for Python developers and data scientists. The Python extension for VS Code recently added the integration of TensorBoard into the code editor, including support for the PyTorch Profiler. Once you have VS Code and the Python extension installed, you can quickly open the TensorBoard Profiler plugin by launching the Command Palette using the keyboard shortcut CTRL + SHIFT + P (CMD + SHIFT + P on a Mac) and typing the “Launch TensorBoard” command.

This integration comes with a built-in lifecycle management feature. VS Code will install the TensorBoard package and the PyTorch Profiler plugin package (coming in mid-April) automatically if you don’t have them on your system. VS Code will also launch TensorBoard process for you and automatically look for any TensorBoard log files within your current directory. When you’re done, just close the tab and VS Code will automatically close the process. No more Terminal windows running on your system to provide a backend for the TensorBoard UI! Below is PyTorch Profiler Trace View running in TensorBoard.

Learn more about TensorBoard support in VS Code in this blog.

Feedback

Review PyTorch Profiler documentation, give Profiler a try and let us know about your experience. Provide your feedback on PyTorch Discussion Forum or file issues on PyTorch GitHub.

Read More

PyTorch for AMD ROCm™ Platform now available as Python package

With the PyTorch 1.8 release, we are delighted to announce a new installation option for users of
PyTorch on the ROCm™ open software platform. An installable Python package is now hosted on
pytorch.org, along with instructions for local installation in the same simple, selectable format as
PyTorch packages for CPU-only configurations and other GPU platforms. PyTorch on ROCm includes full
capability for mixed-precision and large-scale training using AMD’s MIOpen & RCCL libraries. This
provides a new option for data scientists, researchers, students, and others in the community to get
started with accelerated PyTorch using AMD GPUs.

The ROCm Ecosystem

ROCm is AMD’s open source software platform for GPU-accelerated high performance computing and
machine learning. Since the original ROCm release in 2016, the ROCm platform has evolved to support
additional libraries and tools, a wider set of Linux® distributions, and a range of new GPUs. This includes
the AMD Instinct™ MI100, the first GPU based on AMD CDNA™ architecture.

The ROCm ecosystem has an established history of support for PyTorch, which was initially implemented
as a fork of the PyTorch project, and more recently through ROCm support in the upstream PyTorch
code. PyTorch users can install PyTorch for ROCm using AMD’s public PyTorch docker image, and can of
course build PyTorch for ROCm from source. With PyTorch 1.8, these existing installation options are
now complemented by the availability of an installable Python package.

The primary focus of ROCm has always been high performance computing at scale. The combined
capabilities of ROCm and AMD’s Instinct family of data center GPUs are particularly suited to the
challenges of HPC at data center scale. PyTorch is a natural fit for this environment, as HPC and ML
workflows become more intertwined.

Getting started with PyTorch for ROCm

The scope for this build of PyTorch is AMD GPUs with ROCm support, running on Linux. The GPUs
supported by ROCm include all of AMD’s Instinct family of compute-focused data center GPUs, along
with some other select GPUs. A current list of supported GPUs can be found in the ROCm Github
repository
. After confirming that the target system includes supported GPUs and the current 4.0.1
release of ROCm, installation of PyTorch follows the same simple Pip-based installation as any other
Python package. As with PyTorch builds for other platforms, the configurator at https://pytorch.org/getstarted/locally/ provides the specific command line to be run.

PyTorch for ROCm is built from the upstream PyTorch repository, and is a full featured implementation.
Notably, it includes support for distributed training across multiple GPUs and supports accelerated
mixed precision training.

More information

A list of ROCm supported GPUs and operating systems can be found at
https://github.com/RadeonOpenCompute/ROCm
General documentation on the ROCm platform is available at https://rocmdocs.amd.com/en/latest/
ROCm Learning Center at https://developer.amd.com/resources/rocm-resources/rocm-learning-center/ General information on AMD’s offerings for HPC and ML can be found at https://amd.com/hpc

Feedback

An engaged user base is a tremendously important part of the PyTorch ecosystem. We would be deeply
appreciative of feedback on the PyTorch for ROCm experience in the PyTorch discussion forum and, where appropriate, reporting any issues via Github.

Read More

Announcing PyTorch Ecosystem Day

We’re proud to announce our first PyTorch Ecosystem Day. The virtual, one-day event will focus completely on our Ecosystem and Industry PyTorch communities!

PyTorch is a deep learning framework of choice for academics and companies, all thanks to its rich ecosystem of tools and strong community. As with our developers, our ecosystem partners play a pivotal role in the development and growth of the community.

We will be hosting our first PyTorch Ecosystem Day, a virtual event designed for our ecosystem and industry communities to showcase their work and discover new opportunities to collaborate.

PyTorch Ecosystem Day will be held on April 21, with both a morning and evening session, to ensure we reach our global community. Join us virtually for a day filled with discussions on new developments, trends, challenges, and best practices through keynotes, breakout sessions, and a unique networking opportunity hosted through Gather.Town .

Event Details

April 21, 2021 (Pacific Time)
Fully digital experience

  • Morning Session: (EMEA)
    Opening Talks – 8:00 am-9:00 am PT
    Poster Exhibition & Breakout Sessions – 9:00 am-12:00 pm PT

  • Evening Session (APAC/US)
    Opening Talks – 3:00 pm-4:00 pm PT
    Poster Exhibition & Breakout Sessions – 3:00 pm-6:00 pm PT

  • Networking – 9:00 am-7:00 pm PT

There are two ways to participate in PyTorch Ecosystem Day:

  1. Poster Exhibition from the PyTorch ecosystem and industry communities covering a variety of topics. Posters are available for viewing throughout the duration of the event. To be part of the poster exhibition, please see below for submission details. If your poster is accepted, we highly recommend tending your poster during one of the morning or evening sessions or both!

  2. Breakout Sessions are 40-min sessions freely designed by the community. The breakouts can be talks, demos, tutorials or discussions. Note: you must have an accepted poster to apply for the breakout sessions.

Call for posters now open! Submit your proposal today! Please send us the title and summary of your projects, tools, and libraries that could benefit PyTorch researchers in academia and industry, application developers, and ML engineers for consideration. The focus must be on academic papers, machine learning research, or open-source projects. Please no sales pitches. Deadline for submission is March 18, 2021.

Visit pytorchecosystemday.fbreg.com for more information and we look forward to welcoming you to PyTorch Ecosystem Day on April 21st!

Read More

New PyTorch library releases including TorchVision Mobile, TorchAudio I/O, and more

Today, we are announcing updates to a number of PyTorch libraries, alongside the PyTorch 1.8 release. The updates include new releases for the domain libraries including TorchVision, TorchText and TorchAudio as well as new version of TorchCSPRNG. These releases include a number of new features and improvements and, along with the PyTorch 1.8 release, provide a broad set of updates for the PyTorch community to build on and leverage.

Some highlights include:

  • TorchVision – Added support for PyTorch Mobile including Detectron2Go (D2Go), auto-augmentation of data during training, on the fly type conversion, and AMP autocasting.
  • TorchAudio – Major improvements to I/O, including defaulting to sox_io backend and file-like object support. Added Kaldi Pitch feature and support for CMake based build allowing TorchAudio to better support no-Python environments.
  • TorchText – Updated the dataset loading API to be compatible with standard PyTorch data loading utilities.
  • TorchCSPRNG – Support for cryptographically secure pseudorandom number generators for PyTorch is now stable with new APIs for AES128 ECB/CTR and CUDA support on Windows.

Please note that, starting in PyTorch 1.6, features are classified as Stable, Beta, and Prototype. Prototype features are not included as part of the binary distribution and are instead available through either building from source, using nightlies or via compiler flag. You can see the detailed announcement here.

TorchVision 0.9.0

[Stable] TorchVision Mobile: Operators, Android Binaries, and Tutorial

We are excited to announce the first on-device support and binaries for a PyTorch domain library. We have seen significant appetite in both research and industry for on-device vision support to allow low latency, privacy friendly, and resource efficient mobile vision experiences. You can follow this new tutorial to build your own Android object detection app using TorchVision operators, D2Go, or your own customer operators and model.

[Stable] New Mobile models for Classification, Object Detection and Semantic Segmentation

We have added support for the MobileNetV3 architecture and provided pre-trained weights for Classification, Object Detection and Segmentation. It is easy to get up and running with these models, just import and load them as you would any torchvision model:

import torch
import torchvision

# Classification
x = torch.rand(1, 3, 224, 224)
m_classifier = torchvision.models.mobilenet_v3_large(pretrained=True)
m_classifier.eval()
predictions = m_classifier(x)

# Quantized Classification
x = torch.rand(1, 3, 224, 224)
m_classifier = torchvision.models.quantization.mobilenet_v3_large(pretrained=True)
m_classifier.eval()
predictions = m_classifier(x)

# Object Detection: Highly Accurate High Resolution Mobile Model
x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
m_detector = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(pretrained=True)
m_detector.eval()
predictions = m_detector(x)

# Semantic Segmentation: Highly Accurate Mobile Model
x = torch.rand(1, 3, 520, 520)
m_segmenter = torchvision.models.segmentation.deeplabv3_mobilenet_v3_large(pretrained=True)
m_segmenter.eval()
predictions = m_segmenter(x)

These models are highly competitive with TorchVision’s existing models on resource efficiency, speed, and accuracy. See our release notes for detailed performance metrics.

[Stable] AutoAugment

AutoAugment is a common Data Augmentation technique that can increase the accuracy of Scene Classification models. Though the data augmentation policies are directly linked to their trained dataset, empirical studies show that ImageNet policies provide significant improvements when applied to other datasets. We’ve implemented 3 policies learned on the following datasets: ImageNet, CIFA10 and SVHN. These can be used standalone or mixed-and-matched with existing transforms:

from torchvision import transforms

t = transforms.AutoAugment()
transformed = t(image)


transform=transforms.Compose([
   transforms.Resize(256),
   transforms.AutoAugment(),
   transforms.ToTensor()])

Other New Features for TorchVision

  • [Stable] All read and decode methods in the io.image package now support:
    • Palette, Grayscale Alpha and RBG Alpha image types during PNG decoding
    • On-the-fly conversion of image from one type to the other during read
  • [Stable] WiderFace dataset
  • [Stable] Improved FasterRCNN speed and accuracy by introducing a score threshold on RPN
  • [Stable] Modulation input for DeformConv2D
  • [Stable] Option to write audio to a video file
  • [Stable] Utility to draw bounding boxes
  • [Beta] Autocast support in all Operators
    Find the full TorchVision release notes here.

TorchAudio 0.8.0

I/O Improvements

We have continued our work from the previous release to improve TorchAudio’s I/O support, including:

  • [Stable] Changing the default backend to “sox_io” (for Linux/macOS), and updating the “soundfile” backend’s interface to align with that of “sox_io”. The legacy backend and interface are still accessible, though it is strongly discouraged to use them.
  • [Stable] File-like object support in both “sox_io” backend, “soundfile” backend and sox_effects.
  • [Stable] New options to change the format, encoding, and bits_per_sample when saving.
  • [Stable] Added GSM, HTK, AMB, AMR-NB and AMR-WB format support to the “sox_io” backend.
  • [Beta] A new functional.apply_codec function which can degrade audio data by applying audio codecs supported by “sox_io” backend in an in-memory fashion.
    Here are some examples of features landed in this release:
# Load audio over HTTP
with requests.get(URL, stream=True) as response:
    waveform, sample_rate = torchaudio.load(response.raw)
 
# Saving to Bytes buffer as 32-bit floating-point PCM
buffer_ = io.BytesIO()
torchaudio.save(
    buffer_, waveform, sample_rate,
    format="wav", encoding="PCM_S", bits_per_sample=16)
 
# Apply effects while loading audio from S3
client = boto3.client('s3')
response = client.get_object(Bucket=S3_BUCKET, Key=S3_KEY)
waveform, sample_rate = torchaudio.sox_effects.apply_effect_file(
    response['Body'],
    [["lowpass", "-1", "300"], ["rate", "8000"]])
 
# Apply GSM codec to Tensor
encoded = torchaudio.functional.apply_codec(
    waveform, sample_rate, format="gsm")

Check out the revamped audio preprocessing tutorial, Audio Manipulation with TorchAudio.

[Stable] Switch to CMake-based build

In the previous version of TorchAudio, it was utilizing CMake to build third party dependencies. Starting in 0.8.0, TorchaAudio uses CMake to build its C++ extension. This will open the door to integrate TorchAudio in non-Python environments (such as C++ applications and mobile). We will continue working on adding example applications and mobile integrations.

[Beta] Improved and New Audio Transforms

We have added two widely requested operators in this release: the SpectralCentroid transform and the Kaldi Pitch feature extraction (detailed in “A pitch extraction algorithm tuned for automatic speech recognition”). We’ve also exposed a normalization method to Mel transforms, and additional STFT arguments to Spectrogram. We would like to ask our community to continue to raise feature requests for core audio processing features like these!

Community Contributions

We had more contributions from the open source community in this release than ever before, including several completely new features. We would like to extend our sincere thanks to the community. Please check out the newly added CONTRIBUTING.md for ways to contribute code, and remember that reporting bugs and requesting features are just as valuable. We will continue posting well-scoped work items as issues labeled “help-wanted” and “contributions-welcome” for anyone who would like to contribute code, and are happy to coach new contributors through the contribution process.

Find the full TorchAudio release notes here.

TorchText 0.9.0

[Beta] Dataset API Updates

In this release, we are updating TorchText’s dataset API to be compatible with PyTorch data utilities, such as DataLoader, and are deprecating TorchText’s custom data abstractions such as Field. The updated datasets are simple string-by-string iterators over the data. For guidance about migrating from the legacy abstractions to use modern PyTorch data utilities, please refer to our migration guide.

The text datasets listed below have been updated as part of this work. For examples of how to use these datasets, please refer to our end-to-end text classification tutorial.

  • Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
  • Sequence tagging: UDPOS, CoNLL2000Chunking
  • Translation: IWSLT2016, IWSLT2017
  • Question answer: SQuAD1, SQuAD2

Find the full TorchText release notes here.

[Stable] TorchCSPRNG 0.2.0

We released TorchCSPRNG in August 2020, a PyTorch C++/CUDA extension that provides cryptographically secure pseudorandom number generators for PyTorch. Today, we are releasing the 0.2.0 version and designating the library as stable. This release includes a new API for encrypt/decrypt with AES128 ECB/CTR as well as CUDA 11 and Windows CUDA support.

Find the full TorchCSPRNG release notes here.

Thanks for reading, and if you are excited about these updates and want to participate in the future of PyTorch, we encourage you to join the discussion forums and open GitHub issues.

Cheers!

Team PyTorch

Read More

PyTorch 1.8 Release, including Compiler and Distributed Training updates, and New Mobile Tutorials

We are excited to announce the availability of PyTorch 1.8. This release is composed of more than 3,000 commits since 1.7. It includes major updates and new features for compilation, code optimization, frontend APIs for scientific computing, and AMD ROCm support through binaries that are available via pytorch.org. It also provides improved features for large-scale training for pipeline and model parallelism, and gradient compression. A few of the highlights include:

  1. Support for doing python to python functional transformations via torch.fx;
  2. Added or stabilized APIs to support FFTs (torch.fft), Linear Algebra functions (torch.linalg), added support for autograd for complex tensors and updates to improve performance for calculating hessians and jacobians; and
  3. Significant updates and improvements to distributed training including: Improved NCCL reliability; Pipeline parallelism support; RPC profiling; and support for communication hooks adding gradient compression.
    See the full release notes here.

Along with 1.8, we are also releasing major updates to PyTorch libraries including TorchCSPRNG, TorchVision, TorchText and TorchAudio. For more on the library releases, see the post here. As previously noted, features in PyTorch releases are classified as Stable, Beta and Prototype. You can learn more about the definitions in the post here.

New and Updated APIs

The PyTorch 1.8 release brings a host of new and updated API surfaces ranging from additional APIs for NumPy compatibility, also support for ways to improve and scale your code for performance at both inference and training time. Here is a brief summary of the major features coming in this release:

[Stable] Torch.fft support for high performance NumPy style FFTs

As part of PyTorch’s goal to support scientific computing, we have invested in improving our FFT support and with PyTorch 1.8, we are releasing the torch.fft module. This module implements the same functions as NumPy’s np.fft module, but with support for hardware acceleration and autograd.

[Beta] Support for NumPy style linear algebra functions via torch.linalg

The torch.linalg module, modeled after NumPy’s np.linalg module, brings NumPy-style support for common linear algebra operations including Cholesky decompositions, determinants, eigenvalues and many others.

[Beta] Python code Transformations with FX

FX allows you to write transformations of the form transform(input_module : nn.Module) -> nn.Module, where you can feed in a Module instance and get a transformed Module instance out of it.

This kind of functionality is applicable in many scenarios. For example, the FX-based Graph Mode Quantization product is releasing as a prototype contemporaneously with FX. Graph Mode Quantization automates the process of quantizing a neural net and does so by leveraging FX’s program capture, analysis and transformation facilities. We are also developing many other transformation products with FX and we are excited to share this powerful toolkit with the community.

Because FX transforms consume and produce nn.Module instances, they can be used within many existing PyTorch workflows. This includes workflows that, for example, train in Python then deploy via TorchScript.

Below is an FX transform example:

import torch
import torch.fx

def transform(m: nn.Module,
             tracer_class : type = torch.fx.Tracer) -> torch.nn.Module:
   # Step 1: Acquire a Graph representing the code in `m`
  
   # NOTE: torch.fx.symbolic_trace is a wrapper around a call to
   # fx.Tracer.trace and constructing a GraphModule. We'll
   # split that out in our transform to allow the caller to
   # customize tracing behavior.
   graph : torch.fx.Graph = tracer_class().trace(m)
  
   # Step 2: Modify this Graph or create a new one
   graph = ...
  
   # Step 3: Construct a Module to return
   return torch.fx.GraphModule(m, graph)

You can read more about FX in the official documentation. You can also find several examples of program transformations implemented using torch.fx here. We are constantly improving FX and invite you to share any feedback you have about the toolkit on the forums or issue tracker.

Distributed Training

The PyTorch 1.8 release added a number of new features as well as improvements to reliability and usability. Concretely, support for: Stable level async error/timeout handling was added to improve NCCL reliability; and stable support for RPC based profiling. Additionally, we have added support for pipeline parallelism as well as gradient compression through the use of communication hooks in DDP. Details are below:

[Beta] Pipeline Parallelism

As machine learning models continue to grow in size, traditional Distributed DataParallel (DDP) training no longer scales as these models don’t fit on a single GPU device. The new pipeline parallelism feature provides an easy to use PyTorch API to leverage pipeline parallelism as part of your training loop.

[Beta] DDP Communication Hook

The DDP communication hook is a generic interface to control how to communicate gradients across workers by overriding the vanilla allreduce in DistributedDataParallel. A few built-in communication hooks are provided including PowerSGD, and users can easily apply any of these hooks to optimize communication. Additionally, the communication hook interface can also support user-defined communication strategies for more advanced use cases.

Additional Prototype Features for Distributed Training

In addition to the major stable and beta distributed training features in this release, we also have a number of prototype features available in our nightlies to try out and provide feedback. We have linked in the draft docs below for reference:

  • (Prototype) ZeroRedundancyOptimizer – Based on and in partnership with the Microsoft DeepSpeed team, this feature helps reduce per-process memory footprint by sharding optimizer states across all participating processes in the ProcessGroup gang. Refer to this documentation for more details.
  • (Prototype) Process Group NCCL Send/Recv – The NCCL send/recv API was introduced in v2.7 and this feature adds support for it in NCCL process groups. This feature will provide an option for users to implement collective operations at Python layer instead of C++ layer. Refer to this documentation and code examples to learn more.
  • (Prototype) CUDA-support in RPC using TensorPipe – This feature should bring consequent speed improvements for users of PyTorch RPC with multiple-GPU machines, as TensorPipe will automatically leverage NVLink when available, and avoid costly copies to and from host memory when exchanging GPU tensors between processes. When not on the same machine, TensorPipe will fall back to copying the tensor to host memory and sending it as a regular CPU tensor. This will also improve the user experience as users will be able to treat GPU tensors like regular CPU tensors in their code. Refer to this documentation for more details.
  • (Prototype) Remote Module – This feature allows users to operate a module on a remote worker like using a local module, where the RPCs are transparent to the user. In the past, this functionality was implemented in an ad-hoc way and overall this feature will improve the usability of model parallelism on PyTorch. Refer to this documentation for more details.

PyTorch Mobile

Support for PyTorch Mobile is expanding with a new set of tutorials to help new users launch models on-device quicker and give existing users a tool to get more out of our framework. These include:

Our new demo apps also include examples of image segmentation, object detection, neural machine translation, question answering, and vision transformers. They are available on both iOS and Android:

In addition to performance improvements on CPU for MobileNetV3 and other models, we also revamped our Android GPU backend prototype for broader models coverage and faster inferencing:

Lastly, we are launching the PyTorch Mobile Lite Interpreter as a prototype feature in this release. The Lite Interpreter allows users to reduce the runtime binary size. Please try these out and send us your feedback on the PyTorch Forums. All our latest updates can be found on the PyTorch Mobile page

[Prototype] PyTorch Mobile Lite Interpreter

PyTorch Lite Interpreter is a streamlined version of the PyTorch runtime that can execute PyTorch programs in resource constrained devices, with reduced binary size footprint. This prototype feature reduces binary sizes by up to 70% compared to the current on-device runtime in the current release.

Performance Optimization

In 1.8, we are releasing the support for benchmark utils to enable users to better monitor performance. We are also opening up a new automated quantization API. See the details below:

(Beta) Benchmark utils

Benchmark utils allows users to take accurate performance measurements, and provides composable tools to help with both benchmark formulation and post processing. This expected to be helpful for contributors to PyTorch to quickly understand how their contributions are impacting PyTorch performance.

Example:

from torch.utils.benchmark import Timer

results = []
for num_threads in [1, 2, 4]:
    timer = Timer(
        stmt="torch.add(x, y, out=out)",
        setup="""
            n = 1024
            x = torch.ones((n, n))
            y = torch.ones((n, 1))
            out = torch.empty((n, n))
        """,
        num_threads=num_threads,
    )
    results.append(timer.blocked_autorange(min_run_time=5))
    print(
        f"{num_threads} thread{'s' if num_threads > 1 else ' ':<4}"
        f"{results[-1].median * 1e6:>4.0f} us   " +
        (f"({results[0].median / results[-1].median:.1f}x)" if num_threads > 1 else '')
    )

1 thread     376 us   
2 threads    189 us   (2.0x)
4 threads     99 us   (3.8x)

(Prototype) FX Graph Mode Quantization

FX Graph Mode Quantization is the new automated quantization API in PyTorch. It improves upon Eager Mode Quantization by adding support for functionals and automating the quantization process, although people might need to refactor the model to make the model compatible with FX Graph Mode Quantization (symbolically traceable with torch.fx).

Hardware Support

[Beta] Ability to Extend the PyTorch Dispatcher for a new backend in C++

In PyTorch 1.8, you can now create new out-of-tree devices that live outside the pytorch/pytorch repo. The tutorial linked below shows how to register your device and keep it in sync with native PyTorch devices.

[Beta] AMD GPU Binaries Now Available

Starting in PyTorch 1.8, we have added support for ROCm wheels providing an easy onboarding to using AMD GPUs. You can simply go to the standard PyTorch installation selector and choose ROCm as an installation option and execute the provided command.

Thanks for reading, and if you are excited about these updates and want to participate in the future of PyTorch, we encourage you to join the discussion forums and open GitHub issues.

Cheers!

Team PyTorch

Read More