Introduction to Quantization on PyTorch

Introduction to Quantization on PyTorch

It’s important to make efficient use of both server-side and on-device compute resources when developing machine learning applications. To support more efficient deployment on servers and edge devices, PyTorch added a support for model quantization using the familiar eager mode Python API.

Quantization leverages 8bit integer (int8) instructions to reduce the model size and run the inference faster (reduced latency) and can be the difference between a model achieving quality of service goals or even fitting into the resources available on a mobile device. Even when resources aren’t quite so constrained it may enable you to deploy a larger and more accurate model. Quantization is available in PyTorch starting in version 1.3 and with the release of PyTorch 1.4 we published quantized models for ResNet, ResNext, MobileNetV2, GoogleNet, InceptionV3 and ShuffleNetV2 in the PyTorch torchvision 0.5 library.

This blog post provides an overview of the quantization support on PyTorch and its incorporation with the TorchVision domain library.

What is Quantization?

Quantization refers to techniques for doing both computations and memory accesses with lower precision data, usually int8 compared to floating point implementations. This enables performance gains in several important areas:

  • 4x reduction in model size;
  • 2-4x reduction in memory bandwidth;
  • 2-4x faster inference due to savings in memory bandwidth and faster compute with int8 arithmetic (the exact speed up varies depending on the hardware, the runtime, and the model).

Quantization does not however come without additional cost. Fundamentally quantization means introducing approximations and the resulting networks have slightly less accuracy. These techniques attempt to minimize the gap between the full floating point accuracy and the quantized accuracy.

We designed quantization to fit into the PyTorch framework. The means that:

  1. PyTorch has data types corresponding to quantized tensors, which share many of the features of tensors.
  2. One can write kernels with quantized tensors, much like kernels for floating point tensors to customize their implementation. PyTorch supports quantized modules for common operations as part of the torch.nn.quantized and torch.nn.quantized.dynamic name-space.
  3. Quantization is compatible with the rest of PyTorch: quantized models are traceable and scriptable. The quantization method is virtually identical for both server and mobile backends. One can easily mix quantized and floating point operations in a model.
  4. Mapping of floating point tensors to quantized tensors is customizable with user defined observer/fake-quantization blocks. PyTorch provides default implementations that should work for most use cases.

We developed three techniques for quantizing neural networks in PyTorch as part of quantization tooling in the torch.quantization name-space.

The Three Modes of Quantization Supported in PyTorch starting version 1.3

  1. Dynamic Quantization

    The easiest method of quantization PyTorch supports is called dynamic quantization. This involves not just converting the weights to int8 – as happens in all quantization variants – but also converting the activations to int8 on the fly, just before doing the computation (hence “dynamic”). The computations will thus be performed using efficient int8 matrix multiplication and convolution implementations, resulting in faster compute. However, the activations are read and written to memory in floating point format.

    • PyTorch API: we have a simple API for dynamic quantization in PyTorch. torch.quantization.quantize_dynamic takes in a model, as well as a couple other arguments, and produces a quantized model! Our end-to-end tutorial illustrates this for a BERT model; while the tutorial is long and contains sections on loading pre-trained models and other concepts unrelated to quantization, the part the quantizes the BERT model is simply:
    import torch.quantization
    quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    
    • See the documentation for the function here an end-to-end example in our tutorials here and here.
  2. Post-Training Static Quantization

    One can further improve the performance (latency) by converting networks to use both integer arithmetic and int8 memory accesses. Static quantization performs the additional step of first feeding batches of data through the network and computing the resulting distributions of the different activations (specifically, this is done by inserting “observer” modules at different points that record these distributions). This information is used to determine how specifically the different activations should be quantized at inference time (a simple technique would be to simply divide the entire range of activations into 256 levels, but we support more sophisticated methods as well). Importantly, this additional step allows us to pass quantized values between operations instead of converting these values to floats – and then back to ints – between every operation, resulting in a significant speed-up.

    With this release, we’re supporting several features that allow users to optimize their static quantization:

    1. Observers: you can customize observer modules which specify how statistics are collected prior to quantization to try out more advanced methods to quantize your data.
    2. Operator fusion: you can fuse multiple operations into a single operation, saving on memory access while also improving the operation’s numerical accuracy.
    3. Per-channel quantization: we can independently quantize weights for each output channel in a convolution/linear layer, which can lead to higher accuracy with almost the same speed.
    • PyTorch API:

      • To fuse modules, we have torch.quantization.fuse_modules
      • Observers are inserted using torch.quantization.prepare
      • Finally, quantization itself is done using torch.quantization.convert

    We have a tutorial with an end-to-end example of quantization (this same tutorial also covers our third quantization method, quantization-aware training), but because of our simple API, the three lines that perform post-training static quantization on the pre-trained model myModel are:

    # set quantization config for server (x86)
    deploymentmyModel.qconfig = torch.quantization.get_default_config('fbgemm')
    
    # insert observers
    torch.quantization.prepare(myModel, inplace=True)
    # Calibrate the model and collect statistics
    
    # convert to quantized version
    torch.quantization.convert(myModel, inplace=True)
    
  3. Quantization Aware Training

    Quantization-aware training(QAT) is the third method, and the one that typically results in highest accuracy of these three. With QAT, all weights and activations are “fake quantized” during both the forward and backward passes of training: that is, float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. Thus, all the weight adjustments during training are made while “aware” of the fact that the model will ultimately be quantized; after quantizing, therefore, this method usually yields higher accuracy than the other two methods.

    • PyTorch API:

      • torch.quantization.prepare_qat inserts fake quantization modules to model quantization.
      • Mimicking the static quantization API, torch.quantization.convert actually quantizes the model once training is complete.

    For example, in the end-to-end example, we load in a pre-trained model as qat_model, then we simply perform quantization-aware training using:

    # specify quantization config for QAT
    qat_model.qconfig=torch.quantization.get_default_qat_qconfig('fbgemm')
    
    # prepare QAT
    torch.quantization.prepare_qat(qat_model, inplace=True)
    
    # convert to quantized version, removing dropout, to check for accuracy on each
    epochquantized_model=torch.quantization.convert(qat_model.eval(), inplace=False)
    

Device and Operator Support

Quantization support is restricted to a subset of available operators, depending on the method being used, for a list of supported operators, please see the documentation at https://pytorch.org/docs/stable/quantization.html.

The set of available operators and the quantization numerics also depend on the backend being used to run quantized models. Currently quantized operators are supported only for CPU inference in the following backends: x86 and ARM. Both the quantization configuration (how tensors should be quantized and the quantized kernels (arithmetic with quantized tensors) are backend dependent. One can specify the backend by doing:

import torchbackend='fbgemm'
# 'fbgemm' for server, 'qnnpack' for mobile
my_model.qconfig = torch.quantization.get_default_qconfig(backend)
# prepare and convert model
# Set the backend on which the quantized kernels need to be run
torch.backends.quantized.engine=backend

However, quantization aware training occurs in full floating point and can run on either GPU or CPU. Quantization aware training is typically only used in CNN models when post training static or dynamic quantization doesn’t yield sufficient accuracy. This can occur with models that are highly optimized to achieve small size (such as Mobilenet).

Integration in torchvision

We’ve also enabled quantization for some of the most popular models in torchvision: Googlenet, Inception, Resnet, ResNeXt, Mobilenet and Shufflenet. We have upstreamed these changes to torchvision in three forms:

  1. Pre-trained quantized weights so that you can use them right away.
  2. Quantization ready model definitions so that you can do post-training quantization or quantization aware training.
  3. A script for doing quantization aware training — which is available for any of these model though, as you will learn below, we only found it necessary for achieving accuracy with Mobilenet.
  4. We also have a tutorial showing how you can do transfer learning with quantization using one of the torchvision models.

Choosing an approach

The choice of which scheme to use depends on multiple factors:

  1. Model/Target requirements: Some models might be sensitive to quantization, requiring quantization aware training.
  2. Operator/Backend support: Some backends require fully quantized operators.

Currently, operator coverage is limited and may restrict the choices listed in the table below:
The table below provides a guideline.

Model Type Preferred scheme Why
LSTM/RNN Dynamic Quantization Throughput dominated by compute/memory bandwidth for weights
BERT/Transformer Dynamic Quantization Throughput dominated by compute/memory bandwidth for weights
CNN Static Quantization Throughput limited by memory bandwidth for activations
CNN Quantization Aware Training In the case where accuracy can’t be achieved with static quantization

Performance Results

Quantization provides a 4x reduction in the model size and a speedup of 2x to 3x compared to floating point implementations depending on the hardware platform and the model being benchmarked. Some sample results are:

Model Float Latency (ms) Quantized Latency (ms) Inference Performance Gain Device Notes
BERT 581 313 1.8x Xeon-D2191 (1.6GHz) Batch size = 1, Maximum sequence length= 128, Single thread, x86-64, Dynamic quantization
Resnet-50 214 103 2x Xeon-D2191 (1.6GHz) Single thread, x86-64, Static quantization
Mobilenet-v2 97 17 5.7x Samsung S9 Static quantization, Floating point numbers are based on Caffe2 run-time and are not optimized

Accuracy results

We also compared the accuracy of static quantized models with the floating point models on Imagenet. For dynamic quantization, we compared the F1 score of BERT on the GLUE benchmark for MRPC.

Computer Vision Model accuracy

Model Top-1 Accuracy (Float) Top-1 Accuracy (Quantized) Quantization scheme
Googlenet 69.8 69.7 Static post training quantization
Inception-v3 77.5 77.1 Static post training quantization
ResNet-18 69.8 69.4 Static post training quantization
Resnet-50 76.1 75.9 Static post training quantization
ResNext-101 32x8d 79.3 79 Static post training quantization
Mobilenet-v2 71.9 71.6 Quantization Aware Training
Shufflenet-v2 69.4 68.4 Static post training quantization

Speech and NLP Model accuracy

Model F1 (GLUEMRPC) Float F1 (GLUEMRPC) Quantized Quantization scheme
BERT 0.902 0.895 Dynamic quantization

Conclusion

To get started on quantizing your models in PyTorch, start with the tutorials on the PyTorch website. If you are working with sequence data start with dynamic quantization for LSTM, or BERT. If you are working with image data then we recommend starting with the transfer learning with quantization tutorial. Then you can explore static post training quantization. If you find that the accuracy drop with post training quantization is too high, then try quantization aware training.

If you run into issues you can get community help by posting in at discuss.pytorch.org, use the quantization category for quantization related issues.

This post is authored by Raghuraman Krishnamoorthi, James Reed, Min Ni, Chris Gottbrath and Seth Weidman. Special thanks to Jianyu Huang, Lingyi Liu and Haixin Liu for producing quantization metrics included in this post.

Further reading:

  1. PyTorch quantization presentation at Neurips: (https://research.fb.com/wp-content/uploads/2019/12/2.-Quantization.pptx)
  2. Quantized Tensors (https://github.com/pytorch/pytorch/wiki/
    Introducing-Quantized-Tensor)
  3. Quantization RFC on Github (https://github.com/pytorch/pytorch/
    issues/18318)

Read More

PyTorch 1.4 released, domain libraries updated

Today, we’re announcing the availability of PyTorch 1.4, along with updates to the PyTorch domain libraries. These releases build on top of the announcements from NeurIPS 2019, where we shared the availability of PyTorch Elastic, a new classification framework for image and video, and the addition of Preferred Networks to the PyTorch community. For those that attended the workshops at NeurIPS, the content can be found here.

PyTorch 1.4

The 1.4 release of PyTorch adds new capabilities, including the ability to do fine grain build level customization for PyTorch Mobile, and new experimental features including support for model parallel training and Java language bindings.

PyTorch Mobile – Build level customization

Following the open sourcing of PyTorch Mobile in the 1.3 release, PyTorch 1.4 adds additional mobile support including the ability to customize build scripts at a fine-grain level. This allows mobile developers to optimize library size by only including the operators used by their models and, in the process, reduce their on device footprint significantly. Initial results show that, for example, a customized MobileNetV2 is 40% to 50% smaller than the prebuilt PyTorch mobile library. You can learn more here about how to create your own custom builds and, as always, please engage with the community on the PyTorch forums to provide any feedback you have.

Example code snippet for selectively compiling only the operators needed for MobileNetV2:

# Dump list of operators used by MobileNetV2:
import torch, yaml
model = torch.jit.load('MobileNetV2.pt')
ops = torch.jit.export_opnames(model)
with open('MobileNetV2.yaml', 'w') as output:
    yaml.dump(ops, output)
# Build PyTorch Android library customized for MobileNetV2:
SELECTED_OP_LIST=MobileNetV2.yaml scripts/build_pytorch_android.sh arm64-v8a

# Build PyTorch iOS library customized for MobileNetV2:
SELECTED_OP_LIST=MobileNetV2.yaml BUILD_PYTORCH_MOBILE=1 IOS_ARCH=arm64 scripts/build_ios.sh

Distributed model parallel training (Experimental)

With the scale of models, such as RoBERTa, continuing to increase into the billions of parameters, model parallel training has become ever more important to help researchers push the limits. This release provides a distributed RPC framework to support distributed model parallel training. It allows for running functions remotely and referencing remote objects without copying the real data around, and provides autograd and optimizer APIs to transparently run backwards and update parameters across RPC boundaries.

To learn more about the APIs and the design of this feature, see the links below:

For the full tutorials, see the links below:

As always, you can connect with community members and discuss more on the forums.

Java bindings (Experimental)

In addition to supporting Python and C++, this release adds experimental support for Java bindings. Based on the interface developed for Android in PyTorch Mobile, the new bindings allow you to invoke TorchScript models from any Java program. Note that the Java bindings are only available for Linux for this release, and for inference only. We expect support to expand in subsequent releases. See the code snippet below for how to use PyTorch within Java:

Module mod = Module.load("demo-model.pt1");
Tensor data =
    Tensor.fromBlob(
        new int[] {1, 2, 3, 4, 5, 6}, // data
        new long[] {2, 3} // shape
        );
IValue result = mod.forward(IValue.from(data), IValue.from(3.0));
Tensor output = result.toTensor();
System.out.println("shape: " + Arrays.toString(output.shape()));
System.out.println("data: " + Arrays.toString(output.getDataAsFloatArray()));

Learn more about how to use PyTorch from Java here, and see the full Javadocs API documentation here.

For the full 1.4 release notes, see here.

Domain Libraries

PyTorch domain libraries like torchvision, torchtext, and torchaudio complement PyTorch with common datasets, models, and transforms. We’re excited to share new releases for all three domain libraries alongside the PyTorch 1.4 core release.

torchvision 0.5

The improvements to torchvision 0.5 mainly focus on adding support for production deployment including quantization, TorchScript, and ONNX. Some of the highlights include:

  • All models in torchvision are now torchscriptable making them easier to ship into non-Python production environments
  • ResNets, MobileNet, ShuffleNet, GoogleNet and InceptionV3 now have quantized counterparts with pre-trained models, and also include scripts for quantization-aware training.
  • In partnership with the Microsoft team, we’ve added ONNX support for all models including Mask R-CNN.

Learn more about torchvision 0.5 here.

torchaudio 0.4

Improvements in torchaudio 0.4 focus on enhancing the currently available transformations, datasets, and backend support. Highlights include:

  • SoX is now optional, and a new extensible backend dispatch mechanism exposes SoundFile as an alternative to SoX.
  • The interface for datasets has been unified. This enables the addition of two large datasets: LibriSpeech and Common Voice.
  • New filters such as biquad, data augmentation such as time and frequency masking, transforms such as MFCC, gain and dither, and new feature computation such as deltas, are now available.
  • Transformations now support batches and are jitable.
  • An interactive speech recognition demo with voice activity detection is available for experimentation.

Learn more about torchaudio 0.4 here.

torchtext 0.5

torchtext 0.5 focuses mainly on improvements to the dataset loader APIs, including compatibility with core PyTorch APIs, but also adds support for unsupervised text tokenization. Highlights include:

  • Added bindings for SentencePiece for unsupervised text tokenization .
  • Added a new unsupervised learning dataset – enwik9.
  • Made revisions to PennTreebank, WikiText103, WikiText2, IMDb to make them compatible with torch.utils.data. Those datasets are in an experimental folder and we welcome your feedback.

Learn more about torchtext 0.5 here.

We’d like to thank the entire PyTorch team and the community for all their contributions to this work.

Cheers!

Team PyTorch

Read More

OpenMined and PyTorch partner to launch fellowship funding for privacy-preserving ML community

OpenMined and PyTorch partner to launch fellowship funding for privacy-preserving ML community

Many applications of machine learning (ML) pose a range of security and privacy challenges. In particular, users may not be willing or allowed to share their data, which prevents them from taking full advantage of ML platforms like PyTorch. To take the field of privacy-preserving ML (PPML) forward, OpenMined and PyTorch are announcing plans to jointly develop a combined platform to accelerate PPML research as well as new funding for fellowships.

There are many techniques attempting to solve the problem of privacy in ML, each at various levels of maturity. These include (1) homomorphic encryption, (2) secure multi-party computation, (3) trusted execution environments, (4) on-device computation, (5) federated learning with secure aggregation, and (6) differential privacy. Additionally, a number of open source projects implementing these techniques were created with the goal of enabling research at the intersection of privacy, security, and ML. Among them, PySyft and CrypTen have taken an “ML-first” approach by presenting an API that is familiar to the ML community, while masking the complexities of privacy and security protocols. We are excited to announce that these two projects are now collaborating closely to build a mature PPML ecosystem around PyTorch.

Additionally, to bolster this ecosystem and take the field of privacy preserving ML forward, we are also calling for contributions and supporting research efforts on this combined platform by providing funding to support the OpenMined community and the researchers that contribute, build proofs of concepts and desire to be on the cutting edge of how privacy-preserving technology is applied. We will provide funding through the RAAIS Foundation, a non-profit organization with a mission to advance education and research in artificial intelligence for the common good. We encourage interested parties to apply to one or more of the fellowships listed below.

Tools Powering the Future of Privacy-Preserving ML

The next generation of privacy-preserving open source tools enable ML researchers to easily experiment with ML models using secure computing techniques without needing to be cryptography experts. By integrating with PyTorch, PySyft and CrypTen offer familiar environments for ML developers to research and apply these techniques as part of their work.

PySyft is a Python library for secure and private ML developed by the OpenMined community. It is a flexible, easy-to-use library that makes secure computation techniques like multi-party computation (MPC) and privacy-preserving techniques like differential privacy accessible to the ML community. It prioritizes ease of use and focuses on integrating these techniques into end-user use cases like federated learning with mobile phones and other edge devices, encrypted ML as a service, and privacy-preserving data science.

CrypTen is a framework built on PyTorch that enables private and secure ML for the PyTorch community. It is the first step along the journey towards a privacy-preserving mode in PyTorch that will make secure computing techniques accessible beyond cryptography researchers. It currently implements secure multiparty computation with the goal of offering other secure computing backends in the near future. Other benefits to ML researchers include:

  • It is ML first and presents secure computing techniques via a CrypTensor object that looks and feels exactly like a PyTorch Tensor. This allows the user to use automatic differentiation and neural network modules akin to those in PyTorch.
  • The framework focuses on scalability and performance and is built with real-world challenges in mind.

The focus areas for CrypTen and PySyft are naturally aligned and complement each other. The former focuses on building support for various secure and privacy preserving techniques on PyTorch through an encrypted tensor abstraction, while the latter focuses on end user use cases like deployment on edge devices and a user friendly data science platform.

Working together will enable PySyft to use CrypTen as a backend for encrypted tensors. This can lead to an increase in performance for PySyft and the adoption of CrypTen as a runtime by PySyft’s userbase. In addition to this, PyTorch is also adding cryptography friendly features such as support for cryptographically secure random number generation. Over the long run, this allows each library to focus exclusively on its core competencies while enjoying the benefits of the synergistic relationship.

New Funding for OpenMined Contributors

We are especially excited to announce that the PyTorch team has invested $250,000 to support OpenMined in furthering the development and proliferation of privacy-preserving ML. This gift will be facilitated via the RAAIS Foundation and will be available immediately to support paid fellowship grants for the OpenMined community.

How to get involved

Thanks to the support from the PyTorch team, OpenMined is able to offer three different opportunities for you to participate in the project’s development. Each of these fellowships furthers our shared mission to lower the barrier-to-entry for privacy-preserving ML and to create a more privacy-preserving world.

Core PySyft CrypTen Integration Fellowships

During these fellowships, we will integrate CrypTen as a supported backend for encrypted computation in PySyft. This will allow for the high-performance, secure multi-party computation capabilities of CrypTen to be used alongside other important tools in PySyft such as differential privacy and federated learning. For more information on the roadmap and how to apply for a paid fellowship, check out the project’s call for contributors.

Federated Learning on Mobile, Web, and IoT Devices

During these fellowships, we will be extending PyTorch with the ability to perform federated learning across mobile, web, and IoT devices. To this end, a PyTorch front-end will be able to coordinate across federated learning backends that run in Javascript, Kotlin, Swift, and Python. Furthermore, we will also extend PySyft with the ability to coordinate these backends using peer-to-peer connections, providing low latency and the ability to run secure aggregation as a part of the protocol. For more information on the roadmap and how to apply for a paid fellowship, check out the project’s call for contributors.

Development Challenges

Over the coming months, we will issue regular open competitions for increasing the performance and security of the PySyft and PyGrid codebases. For performance-related challenges, contestants will compete (for a cash prize) to make a specific PySyft demo (such as federated learning) as fast as possible. For security-related challenges, contestants will compete to hack into a PyGrid server. The first to demonstrate their ability will win the cash bounty! For more information on the challenges and to sign up to receive emails when each challenge is opened, sign up here.

To apply, select one of the above projects and identify a role that matches your strengths!

Cheers,

Andrew, Laurens, Joe, and Shubho

Read More

PyTorch adds new tools and libraries, welcomes Preferred Networks to its community

PyTorch continues to be used for the latest state-of-the-art research on display at the NeurIPS conference next week, making up nearly 70% of papers that cite a framework. In addition, we’re excited to welcome Preferred Networks, the maintainers of the Chainer framework, to the PyTorch community. Their teams are moving fully over to PyTorch for developing their ML capabilities and services.

This growth underpins PyTorch’s focus on building for the needs of the research community, and increasingly, supporting the full workflow from research to production deployment. To further support researchers and developers, we’re launching a number of new tools and libraries for large scale computer vision and elastic fault tolerant training. Learn more on GitHub and at our NeurIPS booth.

Preferred Networks joins the PyTorch community

Preferred Networks, Inc. (PFN) announced plans to move its deep learning framework from Chainer to PyTorch. As part of this change, PFN will collaborate with the PyTorch community and contributors, including people from Facebook, Microsoft, CMU, and NYU, to participate in the development of PyTorch.

PFN developed Chainer, a deep learning framework that introduced the concept of define-by-run (also referred to as eager execution), to support and speed up its deep learning development. Chainer has been used at PFN since 2015 to rapidly solve real-world problems with the latest, cutting-edge technology. Chainer was also one of the inspirations for PyTorch’s initial design, as outlined in the PyTorch NeurIPS paper.

PFN has driven innovative work with CuPy, ImageNet in 15 minutes, Optuna, and other projects that have pushed the boundaries of design and engineering. As part of the PyTorch community, PFN brings with them creative engineering capabilities and experience to help take the framework forward. In addition, PFN’s migration to PyTorch will allow it to efficiently incorporate the latest research results to accelerate its R&D activities, given PyTorch’s broad adoption with researchers, and to collaborate with the community to add support for PyTorch on MN-Core, a deep learning processor currently in development.

We are excited to welcome PFN to the PyTorch community, and to jointly work towards the common goal of furthering advances in deep learning technology. Learn more about the PFN’s migration to PyTorch here.

Tools for elastic training and large scale computer vision

PyTorch Elastic (Experimental)

Large scale model training is becoming commonplace with architectures like BERT and the growth of model parameter counts into the billions or even tens of billions. To achieve convergence at this scale in a reasonable amount of time, the use of distributed training is needed.

The current PyTorch Distributed Data Parallel (DDP) module enables data parallel training where each process trains the same model but on different shards of data. It enables bulk synchronous, multi-host, multi-GPU/CPU execution of ML training. However, DDP has several shortcomings; e.g. jobs cannot start without acquiring all the requested nodes; jobs cannot continue after a node fails due to error or transient issue; jobs cannot incorporate a node that joined later; and lastly; progress cannot be made with the presence of a slow/stuck node.

The focus of PyTorch Elastic, which uses Elastic Distributed Data Parallelism, is to address these issues and build a generic framework/APIs for PyTorch to enable reliable and elastic execution of these data parallel training workloads. It will provide better programmability, higher resilience to failures of all kinds, higher-efficiency and larger-scale training compared with pure DDP.

Elasticity, in this case, means both: 1) the ability for a job to continue after node failure (by running with fewer nodes and/or by incorporating a new host and transferring state to it); and 2) the ability to add/remove nodes dynamically due to resource availability changes or bottlenecks.

While this feature is still experimental, you can try it out on AWS EC2, with the instructions here. Additionally, the PyTorch distributed team is working closely with teams across AWS to support PyTorch Elastic training within services such as Amazon Sagemaker and Elastic Kubernetes Service (EKS). Look for additional updates in the near future.

New Classification Framework

Image and video classification are at the core of content understanding. To that end, you can now leverage a new end-to-end framework for large-scale training of state-of-the-art image and video classification models. It allows researchers to quickly prototype and iterate on large distributed training jobs at the scale of billions of images. Advantages include:

  • Ease of use – This framework features a modular, flexible design that allows anyone to train machine learning models on top of PyTorch using very simple abstractions. The system also has out-of-the-box integration with AWS on PyTorch Elastic, facilitating research at scale and making it simple to move between research and production.
  • High performance – Researchers can use the framework to train models such as Resnet50 on ImageNet in as little as 15 minutes.

You can learn more at the NeurIPS Expo workshop on Multi-Modal research to production or get started with the PyTorch Elastic Imagenet example here.

Come see us at NeurIPS

The PyTorch team will be hosting workshops at NeurIPS during the industry expo on 12/8. Join the sessions below to learn more, and visit the team at the PyTorch booth on the show floor and during the Poster Session. At the booth, we’ll be walking through an interactive demo of PyTorch running fast neural style transfer on a Cloud TPU – here’s a sneak peek.

We’re also publishing a paper that details the principles that drove the implementation of PyTorch and how they’re reflected in its architecture.

Multi-modal Research to Production – This workshop will dive into a number of modalities such as computer vision (large scale image classification and instance segmentation) and Translation and Speech (seq-to-seq Transformers) from the lens of taking cutting edge research to production. Lastly, we will also walk through how to use the latest APIs in PyTorch to take eager mode developed models into graph mode via Torchscript and quantize them for scale production deployment on servers or mobile devices. Libraries used include:

  • Classification Framework – a newly open sourced PyTorch framework developed by Facebook AI for research on large-scale image and video classification. It allows researchers to quickly prototype and iterate on large distributed training jobs. Models built on the framework can be seamlessly deployed to production.
  • Detectron2 – the recently released object detection library built by the Facebook AI Research computer vision team. We will articulate the improvements over the previous version including: 1) Support for latest models and new tasks; 2) Increased flexibility, to enable new computer vision research; 3) Maintainable and scalable, to support production use cases.
  • Fairseq – general purpose sequence-to-sequence library, can be used in many applications, including (unsupervised) translation, summarization, dialog and speech recognition.

Responsible and Reproducible AI – This workshop on Responsible and Reproducible AI will dive into important areas that are shaping the future of how we interpret, reproduce research, and build AI with privacy in mind. We will cover major challenges, walk through solutions, and finish each talk with a hands-on tutorial.

  • Reproducibility: As the number of research papers submitted to arXiv and conferences skyrockets, scaling reproducibility becomes difficult. We must address the following challenges: aid extensibility by standardizing code bases, democratize paper implementation by writing hardware agnostic code, facilitate results validation by documenting “tricks” authors use to make their complex systems function. To offer solutions, we will dive into tool like PyTorch Hub and PyTorch Lightning which are used by some of the top researchers in the world to reproduce the state of the art.
  • Interpretability: With the increase in model complexity and the resulting lack of transparency, model interpretability methods have become increasingly important. Model understanding is both an active area of research as well as an area of focus for practical applications across industries using machine learning. To get hands on, we will use the recently released Captum library that provides state-of-the-art algorithms to provide researchers and developers with an easy way to understand the importance of neurons/layers and the predictions made by our models.`
  • Private AI: Practical applications of ML via cloud-based or machine-learning-as-a-service platforms pose a range of security and privacy challenges. There are a number of technical approaches being studied including: homomorphic encryption, secure multi-party computation, trusted execution environments, on-device computation, and differential privacy. To provide an immersive understanding of how some of these technologies are applied, we will use the CrypTen project which provides a community based research platform to take the field of Private AI forward.

We’d like to thank the entire PyTorch team and the community for all their contributions to this work.

Cheers!

Team PyTorch

Read More

PyTorch 1.3 adds mobile, privacy, quantization, and named tensors

PyTorch 1.3 adds mobile, privacy, quantization, and named tensors

PyTorch continues to gain momentum because of its focus on meeting the needs of researchers, its streamlined workflow for production use, and most of all because of the enthusiastic support it has received from the AI community. PyTorch citations in papers on ArXiv grew 194 percent in the first half of 2019 alone, as noted by O’Reilly, and the number of contributors to the platform has grown more than 50 percent over the last year, to nearly 1,200. Facebook, Microsoft, Uber, and other organizations across industries are increasingly using it as the foundation for their most important machine learning (ML) research and production workloads.

We are now advancing the platform further with the release of PyTorch 1.3, which includes experimental support for features such as seamless model deployment to mobile devices, model quantization for better performance at inference time, and front-end improvements, like the ability to name tensors and create clearer code with less need for inline comments. We’re also launching a number of additional tools and libraries to support model interpretability and bringing multimodal research to production.

Additionally, we’ve collaborated with Google and Salesforce to add broad support for Cloud Tensor Processing Units, providing a significantly accelerated option for training large-scale deep neural networks. Alibaba Cloud also joins Amazon Web Services, Microsoft Azure, and Google Cloud as supported cloud platforms for PyTorch users. You can get started now at pytorch.org.

PyTorch 1.3

The 1.3 release of PyTorch brings significant new features, including experimental support for mobile device deployment, eager mode quantization at 8-bit integer, and the ability to name tensors. With each of these enhancements, we look forward to additional contributions and improvements from the PyTorch community.

Named tensors (experimental)

Cornell University’s Sasha Rush has argued that, despite its ubiquity in deep learning, the traditional implementation of tensors has significant shortcomings, such as exposing private dimensions, broadcasting based on absolute position, and keeping type information in documentation. He proposed named tensors as an alternative approach.

Today, we name and access dimensions by comment:

# Tensor[N, C, H, W]
 images = torch.randn(32, 3, 56, 56)
 images.sum(dim=1)
 images.select(dim=1, index=0)

But naming explicitly leads to more readable and maintainable code:

NCHW = [N, C, H, W]
   images = torch.randn(32, 3, 56, 56, names=NCHW)
   images.sum('C')
   images.select('C', index=0)

Quantization (experimental)

It’s important to make efficient use of both server-side and on-device compute resources when developing ML applications. To support more efficient deployment on servers and edge devices, PyTorch 1.3 now supports 8-bit model quantization using the familiar eager mode Python API. Quantization refers to techniques used to perform computation and storage at reduced precision, such as 8-bit integer. This currently experimental feature includes support for post-training quantization, dynamic quantization, and quantization-aware training. It leverages the FBGEMM and QNNPACK state-of-the-art quantized kernel back ends, for x86 and ARM CPUs, respectively, which are integrated with PyTorch and now share a common API.

To learn more about the design and architecture, check out the API docs here, and get started with any of the supported techniques using the tutorials available here.

PyTorch mobile (experimental)

Running ML on edge devices is growing in importance as applications continue to demand lower latency. It is also a foundational element for privacy-preserving techniques such as federated learning. To enable more efficient on-device ML, PyTorch 1.3 now supports an end-to-end workflow from Python to deployment on iOS and Android.

This is an early, experimental release, optimized for end-to-end development. Coming releases will focus on:

  • Optimization for size: Build level optimization and selective compilation depending on the operators needed for user applications (i.e., you pay binary size for only the operators you need)
  • Performance: Further improvements to performance and coverage on mobile CPUs and GPUs
  • High level API: Extend mobile native APIs to cover common preprocessing and integration tasks needed for incorporating ML in mobile applications. e.g. Computer vision and NLP

Learn more or get started on Android or iOS here.

New tools for model interpretability and privacy

Captum

As models become ever more complex, it is increasingly important to develop new methods for model interpretability. To help address this need, we’re launching Captum, a tool to help developers working in PyTorch understand why their model generates a specific output. Captum provides state-of-the-art tools to understand how the importance of specific neurons and layers and affect predictions made by the models. Captum’s algorithms include integrated gradients, conductance, SmoothGrad and VarGrad, and DeepLift.

The example below shows how to apply model interpretability algorithms on a pretrained ResNet model and then visualize the attributions for each pixel by overlaying them on the image.

noise_tunnel = NoiseTunnel(integrated_gradients)

attributions_ig_nt, delta = noise_tunnel.attribute(input, n_samples=10, nt_type='smoothgrad_sq', target=pred_label_idx)
_ = viz.visualize_image_attr_multiple(["original_image", "heat_map"],
                                      ["all", "positive"],
                                      np.transpose(attributions_ig_nt.squeeze().cpu().detach().numpy(), (1,2,0)),
                                      np.transpose(transformed_img.squeeze().cpu().detach().numpy(), (1,2,0)),
                                      cmap=default_cmap,
                                      show_colorbar=True)

Learn more about Captum at captum.ai.

CrypTen

Practical applications of ML via cloud-based or machine-learning-as-a-service (MLaaS) platforms pose a range of security and privacy challenges. In particular, users of these platforms may not want or be able to share unencrypted data, which prevents them from taking full advantage of ML tools. To address these challenges, the ML community is exploring a number of technical approaches, at various levels of maturity. These include homomorphic encryption, secure multiparty computation, trusted execution environments, on-device computation, and differential privacy.

To provide a better understanding of how some of these technologies can be applied, we are releasing CrypTen, a new community-based research platform for taking the field of privacy-preserving ML forward. Learn more about CrypTen here. It is available on GitHub here.

Tools for multimodal AI systems

Digital content is often made up of several modalities, such as text, images, audio, and video. For example, a single public post might contain an image, body text, a title, a video, and a landing page. Even one particular component may have more than one modality, such as a video that contains both visual and audio signals, or a landing page that is composed of images, text, and HTML sources.

The ecosystem of tools and libraries that work with PyTorch offer enhanced ways to address the challenges of building multimodal ML systems. Here are some of the latest libraries launching today:

Detectron2

Object detection and segmentation are used for tasks ranging from autonomous vehicles to content understanding for platform integrity. To advance this work, Facebook AI Research (FAIR) is releasing Detectron2, an object detection library now implemented in PyTorch. Detectron2 provides support for the latest models and tasks, increased flexibility to aid computer vision research, and improvements in maintainability and scalability to support production use cases.

Detectron2 is available here and you can learn more here.

Speech extensions to fairseq

Language translation and audio processing are critical components in systems and applications such as search, translation, speech, and assistants. There has been tremendous progress in these fields recently thanks to the development of new architectures like transformers, as well as large-scale pretraining methods. We’ve extended fairseq, a framework for sequence-to-sequence applications such as language translation, to include support for end-to-end learning for speech and audio recognition tasks.These extensions to fairseq enable faster exploration and prototyping of new speech research ideas while offering a clear path to production.

Get started with fairseq here.

Cloud provider and hardware ecosystem support

Cloud providers such as Amazon Web Services, Microsoft Azure, and Google Cloud provide extensive support for anyone looking to develop ML on PyTorch and deploy in production. We’re excited to share the general availability of Google Cloud TPU support and a newly launched integration with Alibaba Cloud. We’re also expanding hardware ecosystem support.

  • Google Cloud TPU support now broadly available. To accelerate the largest-scale machine learning (ML) applications deployed today and enable rapid development of the ML applications of tomorrow, Google created custom silicon chips called Tensor Processing Units (TPUs). When assembled into multi-rack ML supercomputers called Cloud TPU Pods, these TPUs can complete ML workloads in minutes or hours that previously took days or weeks on other systems. Engineers from Facebook, Google, and Salesforce worked together to enable and pilot Cloud TPU support in PyTorch, including experimental support for Cloud TPU Pods. PyTorch support for Cloud TPUs is also available in Colab. Learn more about how to get started with PyTorch on Cloud TPUs here.
  • Alibaba adds support for PyTorch in Alibaba Cloud. The initial integration involves a one-click solution for PyTorch 1.x, Data Science Workshop notebook service, distributed training with Gloo/NCCL, as well as seamless integration with Alibaba IaaS such as OSS, ODPS, and NAS. Together with the toolchain provided by Alibaba, we look forward to significantly reducing the overhead necessary for adoption, as well as helping Alibaba Cloud’s global customer base leverage PyTorch to develop new AI applications.
  • ML hardware ecosystem expands. In addition to key GPU and CPU partners, the PyTorch ecosystem has also enabled support for dedicated ML accelerators. Updates from Intel and Habana showcase how PyTorch, connected to the Glow optimizing compiler, enables developers to utilize these market-specific solutions.

Growth in the PyTorch community

As an open source, community-driven project, PyTorch benefits from wide range of contributors bringing new capabilities to the ecosystem. Here are some recent examples:

  • Mila SpeechBrain aims to provide an open source, all-in-one speech toolkit based on PyTorch. The goal is to develop a single, flexible, user-friendly toolkit that can be used to easily develop state-of-the-art systems for speech recognition (both end to end and HMM-DNN), speaker recognition, speech separation, multi-microphone signal processing (e.g., beamforming), self-supervised learning, and many others. Learn more
  • SpaCy is a new wrapping library with consistent and easy-to-use interfaces to several models, in order to extract features to power NLP pipelines. Support is provided for via spaCy’s standard training API. The library also calculates an alignment so the transformer features can be related back to actual words instead of just wordpieces. Learn more
  • HuggingFace PyTorch-Transformers (formerly known as pytorch-pretrained-bert is a library of state-of-the-art pretrained models for Natural Language Processing (NLP). The library currently contains PyTorch implementations, pretrained model weights, usage scripts, and conversion utilities for models such as BERT, GPT-2, RoBERTa, and DistilBERT. It has also grown quickly, with more than 13,000 GitHub stars and a broad set of users. Learn more
  • PyTorch Lightning is a Keras-like ML library for PyTorch. It leaves core training and validation logic to you and automates the rest. Reproducibility is a crucial requirement for many fields of research, including those based on ML techniques. As the number of research papers submitted to arXiv and conferences skyrockets into the tens of thousands, scaling reproducibility becomes difficult. Learn more.

We recently held the first online Global PyTorch Summer Hackathon, where researchers and developers around the world were invited to build innovative new projects with PyTorch. Nearly 1,500 developers participated, submitting projects ranging from livestock disease detection to AI-powered financial assistants. The winning projects were:

  • Torchmeta, which provides extensions for PyTorch to simplify the development of meta-learning algorithms in PyTorch. It features a unified interface inspired by TorchVision for both few-shot classification and regression problems, to allow easy benchmarking on multiple data sets to aid with reproducibility.
  • Open-Unmix, a system for end-to-end music demixing with PyTorch. Demixing separates the individual instruments or vocal track from any stereo recording.
  • Endless AI-Generated Tees, a store featuring AI-generated T-shirt designs that can be purchased and delivered worldwide. The system uses a state-of-the-art generative model (StyleGAN) that was built with PyTorch and then trained on modern art.

Visit pytorch.org to learn more and get started with PyTorch 1.3 and the latest libraries and ecosystem projects. We look forward to the contributions, exciting research advancements, and real-world applications that the community builds with PyTorch.

We’d like to thank the entire PyTorch team and the community for all their contributions to this work.

Read More

New Releases: PyTorch 1.2, torchtext 0.4, torchaudio 0.3, and torchvision 0.4

New Releases: PyTorch 1.2, torchtext 0.4, torchaudio 0.3, and torchvision 0.4

Since the release of PyTorch 1.0, we’ve seen the community expand to add new tools, contribute to a growing set of models available in the PyTorch Hub, and continually increase usage in both research and production.

From a core perspective, PyTorch has continued to add features to support both research and production usage, including the ability to bridge these two worlds via TorchScript. Today, we are excited to announce that we have four new releases including PyTorch 1.2, torchvision 0.4, torchaudio 0.3, and torchtext 0.4. You can get started now with any of these releases at pytorch.org.

PyTorch 1.2

With PyTorch 1.2, the open source ML framework takes a major step forward for production usage with the addition of an improved and more polished TorchScript environment. These improvements make it even easier to ship production models, expand support for exporting ONNX formatted models, and enhance module level support for Transformers. In addition to these new features, TensorBoard is now no longer experimental – you can simply type from torch.utils.tensorboard import SummaryWriter to get started.

TorchScript Improvements

Since its release in PyTorch 1.0, TorchScript has provided a path to production for eager PyTorch models. The TorchScript compiler converts PyTorch models to a statically typed graph representation, opening up opportunities for
optimization and execution in constrained environments where Python is not available. You can incrementally convert your model to TorchScript, mixing compiled code seamlessly with Python.

PyTorch 1.2 significantly expands TorchScript’s support for the subset of Python used in PyTorch models and delivers a new, easier-to-use API for compiling your models to TorchScript. See the migration guide for details. Below is an example usage of the new API:

import torch

class MyModule(torch.nn.Module):
    def __init__(self, N, M):
        super(MyModule, self).__init__()
        self.weight = torch.nn.Parameter(torch.rand(N, M))

    def forward(self, input):
        if input.sum() > 0:
          output = self.weight.mv(input)
        else:
          output = self.weight + input
        return output

# Compile the model code to a static representation
my_script_module = torch.jit.script(MyModule(3, 4))

# Save the compiled code and model data so it can be loaded elsewhere
my_script_module.save("my_script_module.pt")

To learn more, see our Introduction to TorchScript and Loading a
PyTorch Model in C++
tutorials.

Expanded ONNX Export

The ONNX community continues to grow with an open governance structure and additional steering committee members, special interest groups (SIGs), and working groups (WGs). In collaboration with Microsoft, we’ve added full support to export ONNX Opset versions 7(v1.2), 8(v1.3), 9(v1.4) and 10 (v1.5). We’ve have also enhanced the constant folding pass to support Opset 10, the latest available version of ONNX. ScriptModule has also been improved including support for multiple outputs, tensor factories, and tuples as inputs and outputs. Additionally, users are now able to register their own symbolic to export custom ops, and specify the dynamic dimensions of inputs during export. Here is a summary of the all of the major improvements:

  • Support for multiple Opsets including the ability to export dropout, slice, flip, and interpolate in Opset 10.
  • Improvements to ScriptModule including support for multiple outputs, tensor factories, and tuples as inputs and outputs.
  • More than a dozen additional PyTorch operators supported including the ability to export a custom operator.
  • Many big fixes and test infra improvements.

You can try out the latest tutorial here, contributed by @lara-hdr at Microsoft. A big thank you to the entire Microsoft team for all of their hard work to make this release happen!

nn.Transformer

In PyTorch 1.2, we now include a standard nn.Transformer module, based on the paper “Attention is All You Need”. The nn.Transformer module relies entirely on an attention mechanism to draw global dependencies between input and output. The individual components of the nn.Transformer module are designed so they can be adopted independently. For example, the nn.TransformerEncoder can be used by itself, without the larger nn.Transformer. The new APIs include:

  • nn.Transformer
  • nn.TransformerEncoder and nn.TransformerEncoderLayer
  • nn.TransformerDecoder and nn.TransformerDecoderLayer

See the Transformer Layers documentation for more information. See here for the full PyTorch 1.2 release notes.

Domain API Library Updates

PyTorch domain libraries like torchvision, torchtext, and torchaudio provide convenient access to common datasets, models, and transforms that can be used to quickly create a state-of-the-art baseline. Moreover, they also provide common abstractions to reduce boilerplate code that users might have to otherwise repeatedly write. Since research domains have distinct requirements, an ecosystem of specialized libraries called domain APIs (DAPI) has emerged around PyTorch to simplify the development of new and existing algorithms in a number of fields. We’re excited to release three updated DAPI libraries for text, audio, and vision that compliment the PyTorch 1.2 core release.

Torchaudio 0.3 with Kaldi Compatibility, New Transforms

Torchaudio specializes in machine understanding of audio waveforms. It is an ML library that provides relevant signal processing functionality (but is not a general signal processing library). It leverages PyTorch’s GPU support to provide many tools and transformations for waveforms to make data loading and standardization easier and more readable. For example, it offers data loaders for waveforms using sox, and transformations such as spectrograms, resampling, and mu-law encoding and decoding.

We are happy to announce the availability of torchaudio 0.3.0, with a focus on standardization and complex numbers, a transformation (resample) and two new functionals (phase_vocoder, ISTFT), Kaldi compatibility, and a new tutorial. Torchaudio was redesigned to be an extension of PyTorch and a part of the domain APIs (DAPI) ecosystem.

Standardization

Significant effort in solving machine learning problems goes into data preparation. In this new release, we’ve updated torchaudio’s interfaces for its transformations to standardize around the following vocabulary and conventions.

Tensors are assumed to have channel as the first dimension and time as the last dimension (when applicable). This makes it consistent with PyTorch’s dimensions. For size names, the prefix n_ is used (e.g. “a tensor of size (n_freq, n_mel)”) whereas dimension names do not have this prefix (e.g. “a tensor of dimension (channel, time)”). The input of all transforms and functions now assumes channel first. This is done to be consistent with PyTorch, which has channel followed by the number of samples. The channel parameter of all transforms and functions is now deprecated.

The output of STFT is (channel, frequency, time, 2), meaning for each channel, the columns are the Fourier transform of a certain window, so as we travel horizontally we can see each column (the Fourier transformed waveform) change over time. This matches the output of librosa so we no longer need to transpose in our test comparisons with Spectrogram, MelScale, MelSpectrogram, and MFCC. Moreover, because of these new conventions, we deprecated LC2CL and BLC2CBL which were used to transfer from one shape of signal to another.

As part of this release, we’re also introducing support for complex numbers via tensors of dimension (…, 2), and providing magphase to convert such a tensor into its magnitude and phase, and similarly complex_norm and angle.

The details of the standardization are provided in the README.

Functionals, Transformations, and Kaldi Compatibility

Prior to the standardization, we separated state and computation into torchaudio.transforms and torchaudio.functional.

As part of the transforms, we’re adding a new transformation in 0.3.0: Resample. Resample can upsample or downsample a waveform to a different frequency.

As part of the functionals, we’re introducing: phase_vocoder, a phase vocoder to change the speed of a waveform without changing its pitch, and ISTFT, the inverse STFT implemented to be compatible with STFT provided by PyTorch. This separation allows us to make functionals weak scriptable and to utilize JIT in 0.3.0. We thus have JIT and CUDA support for the following transformations: Spectrogram, AmplitudeToDB (previously named SpectrogramToDB), MelScale,
MelSpectrogram, MFCC, MuLawEncoding, MuLawDecoding (previously named MuLawExpanding).

We now also provide a compatibility interface with Kaldi to ease onboarding and reduce a user’s code dependency on Kaldi. We now have an interface for spectrogram, fbank, and resample_waveform.

New Tutorial

To showcase the new conventions and transformations, we have a new tutorial demonstrating how to preprocess waveforms using torchaudio. This tutorial walks through an example of loading a waveform and applying some of the available transformations to it.

We are excited to see an active community around torchaudio and eager to further grow and support it. We encourage you to go ahead and experiment for yourself with this tutorial and the two datasets that are available: VCTK and YESNO! They have an interface to download the datasets and preprocess them in a convenient format. You can find the details in the release notes here.

Torchtext 0.4 with supervised learning datasets

A key focus area of torchtext is to provide the fundamental elements to help accelerate NLP research. This includes easy access to commonly used datasets and basic preprocessing pipelines for working on raw text based data. The torchtext 0.4.0 release includes several popular supervised learning baselines with “one-command” data loading. A tutorial is included to show how to use the new datasets for text classification analysis. We also added and improved on a few functions such as get_tokenizer and build_vocab_from_iterator to make it easier to implement future datasets. Additional examples can be found here.

Text classification is an important task in Natural Language Processing with many applications, such as sentiment analysis. The new release includes several popular text classification datasets for supervised learning including:

  • AG_NEWS
  • SogouNews
  • DBpedia
  • YelpReviewPolarity
  • YelpReviewFull
  • YahooAnswers
  • AmazonReviewPolarity
  • AmazonReviewFull

Each dataset comes with two parts (train vs. test), and can be easily loaded with a single command. The datasets also support an ngrams feature to capture the partial information about the local word order. Take a look at the tutorial here to learn more about how to use the new datasets for supervised problems such as text classification analysis.

from torchtext.datasets.text_classification import DATASETS
train_dataset, test_dataset = DATASETS['AG_NEWS'](ngrams=2)

In addition to the domain library, PyTorch provides many tools to make data loading easy. Users now can load and preprocess the text classification datasets with some well supported tools, like torch.utils.data.DataLoader and torch.utils.data.IterableDataset. Here are a few lines to wrap the data with DataLoader. More examples can be found here.

from torch.utils.data import DataLoader
data = DataLoader(train_dataset, collate_fn=generate_batch)

Check out the release notes here to learn more and try out the tutorial here.

Torchvision 0.4 with Support for Video

Video is now a first-class citizen in torchvision, with support for data loading, datasets, pre-trained models, and transforms. The 0.4 release of torchvision includes:

  • Efficient IO primitives for reading/writing video files (including audio), with support for arbitrary encodings and formats.
  • Standard video datasets, compatible with torch.utils.data.Dataset and torch.utils.data.DataLoader.
  • Pre-trained models built on the Kinetics-400 dataset for action classification on videos (including the training scripts).
  • Reference training scripts for training your own video models.

We wanted working with video data in PyTorch to be as straightforward as possible, without compromising too much on performance.
As such, we avoid the steps that would require re-encoding the videos beforehand, as it would involve:

  • A preprocessing step which duplicates the dataset in order to re-encode it.
  • An overhead in time and space because this re-encoding is time-consuming.
  • Generally, an external script should be used to perform the re-encoding.

Additionally, we provide APIs such as the utility class, VideoClips, that simplifies the task of enumerating all possible clips of fixed size in a list of video files by creating an index of all clips in a set of videos. It also allows you to specify a fixed frame-rate for the videos. An example of the API is provided below:

from torchvision.datasets.video_utils import VideoClips

class MyVideoDataset(object):
    def __init__(self, video_paths):
        self.video_clips = VideoClips(video_paths,
                                      clip_length_in_frames=16,
                                      frames_between_clips=1,
                                      frame_rate=15)

    def __getitem__(self, idx):
        video, audio, info, video_idx = self.video_clips.get_clip(idx)
        return video, audio

    def __len__(self):
        return self.video_clips.num_clips()

Most of the user-facing API is in Python, similar to PyTorch, which makes it easily extensible. Plus, the underlying implementation is fast — torchvision decodes as little as possible from the video on-the-fly in order to return a clip from the video.

Check out the torchvision 0.4 release notes here for more details.

We look forward to continuing our collaboration with the community and hearing your feedback as we further improve and expand the PyTorch deep learning platform.

We’d like to thank the entire PyTorch team and the community for all of the contributions to this work!

Read More

Mapillary Research: Seamless Scene Segmentation and In-Place Activated BatchNorm

Mapillary Research: Seamless Scene Segmentation and In-Place Activated BatchNorm

With roads in developed countries like the US changing up to 15% annually, Mapillary addresses a growing demand for keeping maps updated by combining images from any camera into a 3D visualization of the world. Mapillary’s independent and collaborative approach enables anyone to collect, share, and use street-level images for improving maps, developing cities, and advancing the automotive industry.

Today, people and organizations all over the world have contributed more than 600 million images toward Mapillary’s mission of helping people understand the world’s places through images and making this data available, with clients and partners including the World Bank, HERE, and Toyota Research Institute.

Mapillary’s computer vision technology brings intelligence to maps in an unprecedented way, increasing our overall understanding of the world. Mapillary runs state-of-the-art semantic image analysis and image-based 3d modeling at scale and on all its images. In this post we discuss two recent works from Mapillary Research and their implementations in PyTorch – Seamless Scene Segmentation [1] and In-Place Activated BatchNorm [2] – generating Panoptic segmentation results and saving up to 50% of GPU memory during training, respectively.

Seamless Scene Segmentation

Github project page: https://github.com/mapillary/seamseg/

The objective of Seamless Scene Segmentation is to predict a “panoptic” segmentation [3] from an image, that is a complete labeling where each pixel is assigned with a class id and, where possible, an instance id. Like many modern CNNs dealing with instance detection and segmentation, we adopt the Mask R-CNN framework [4], using ResNet50 + FPN [5] as a backbone. This architecture works in two stages: first, the “Proposal Head” selects a set of candidate bounding boxes on the image (i.e. the proposals) that could contain an object; then, the “Mask Head” focuses on each proposal, predicting its class and segmentation mask. The output of this process is a “sparse” instance segmentation, covering only the parts of the image that contain countable objects (e.g. cars and pedestrians).

To complete our panoptic approach coined Seamless Scene Segmentation, we add a third stage to Mask R-CNN. Stemming from the same backbone, the “Semantic Head” predicts a dense semantic segmentation over the whole image, also accounting for the uncountable or amorphous classes (e.g. road and sky). The outputs of the Mask and Semantic heads are finally fused using a simple non-maximum suppression algorithm to generate the final panoptic prediction. All details about the actual network architecture, used losses and underlying math can be found at the project website for our CVPR 2019 paper [1].

While several versions of Mask R-CNN are publicly available, including an official implementation written in Caffe2, at Mapillary we decided to build Seamless Scene Segmentation from scratch using PyTorch, in order to have full control and understanding of the whole pipeline. While doing so we encountered a couple of main stumbling blocks, and had to come up with some creative workarounds we are going to describe next.

Dealing with variable-sized tensors

Something that sets aside panoptic segmentation networks from traditional CNNs is the prevalence of variable-sized data. In fact, many of the quantities we are dealing with cannot be easily represented with fixed sized tensors: each image contains a different number of objects, the Proposal head can produce a different number of proposals for each image, and the images themselves can have different sizes. While this is not a problem per-se – one could just process images one at a time – we would still like to exploit batch-level parallelism as much as possible. Furthermore, when performing distributed training with multiple GPUs, DistributedDataParallel expects its inputs to be batched, uniformly-sized tensors.

Our solution to these issues is to wrap each batch of variable-sized tensors in a PackedSequence. PackedSequence is little more than a glorified list class for tensors, tagging its contents as “related”, ensuring that they all share the same type, and providing useful methods like moving all the tensors to a particular device, etc. When performing light-weight operations that wouldn’t be much faster with batch-level parallelism, we simply iterate over the contents of the PackedSequence in a for loop. When performance is crucial, e.g. in the body of the network, we simply concatenate the contents of the PackedSequence, adding zero padding as required (like in RNNs with variable-length inputs), and keeping track of the original dimensions of each tensor.

PackedSequences also help us deal with the second problem highlighted above. We slightly modify DistributedDataParallel to recognize PackedSequence inputs, splitting them in equally sized chunks and distributing their contents across the GPUs.

Asymmetric computational graphs with Distributed Data Parallel

Another, perhaps more subtle, peculiarity of our network is that it can generate asymmetric computational graphs across GPUs. In fact, some of the modules that compose the network are “optional”, in the sense that they are not always computed for all images. As an example, when the Proposal head doesn’t output any proposal, the Mask head is not traversed at all. If we are training on multiple GPUs with DistributedDataParallel, this results in one of the replicas not computing gradients for the Mask head parameters.

Prior to PyTorch 1.1, this resulted in a crash, so we had to develop a workaround. Our simple but effective solution was to compute a “fake forward pass” when no actual forward is required, i.e. something like this:

def fake_forward():
    fake_input = get_correctly_shaped_fake_input()
    fake_output = mask_head(fake_input)
    fake_loss = fake_output.sum() * 0
    return fake_loss

Here, we generate a batch of bogus data, pass it through the Mask head, and return a loss that always back-progates zeros to all parameters.

Starting from PyTorch 1.1 this workaround is no longer required: by setting find_unused_parameters=True in the constructor, DistributedDataParallel is told to identify parameters whose gradients have not been computed by all replicas and correctly handle them. This leads to some substantial simplifications in our code base!

In-place Activated BatchNorm

Github project page: https://github.com/mapillary/inplace_abn/

Most researchers would probably agree that there are always constraints in terms of available GPU resources, regardless if their research lab has access to only a few or multiple thousands of GPUs. In a time where at Mapillary we still worked at rather few and mostly 12GB Titan X – style prosumer GPUs, we were searching for a solution that virtually enhances the usable memory during training, so we would be able to obtain and push state-of-the-art results on dense labeling tasks like semantic segmentation. In-place activated BatchNorm is enabling us to use up to 50% more memory (at little computational overhead) and is therefore deeply integrated in all our current projects (including Seamless Scene Segmentation described above).

When processing a BN-Activation-Convolution sequence in the forward pass, most deep learning frameworks (including PyTorch) need to store two big buffers, i.e. the input x of BN and the input z of Conv. This is necessary because the standard implementations of the backward passes of BN and Conv depend on their inputs to calculate the gradients. Using InPlace-ABN to replace the BN-Activation sequence, we can safely discard x, thus saving up to 50% GPU memory at training time. To achieve this, we rewrite the backward pass of BN in terms of its output y, which is in turn reconstructed from z by inverting the activation function.

The only limitation of InPlace-ABN is that it requires using an invertible activation function, such as leaky relu or elu. Except for this, it can be used as a direct, drop-in replacement for BN+activation modules in any network. Our native CUDA implementation offers minimal computational overhead compared to PyTorch’s standard BN, and is available for anyone to use from here: https://github.com/mapillary/inplace_abn/.

Synchronized BN with asymmetric graphs and unbalanced batches

When training networks with synchronized SGD over multiple GPUs and/or multiple nodes, it’s common practice to compute BatchNorm statistics separately on each device. However, in our experience working with semantic and panoptic segmentation networks, we found that accumulating mean and variance across all workers can bring a substantial boost in accuracy. This is particularly true when dealing with small batches, like in Seamless Scene Segmentation where we train with a single, super-high resolution image per GPU.

InPlace-ABN supports synchronized operation over multiple GPUs and multiple nodes, and, since version 1.1, this can also be achieved in the standard PyTorch library using SyncBatchNorm. Compared to SyncBatchNorm, however, we support some additional functionality which is particularly important for Seamless Scene Segmentation: unbalanced batches and asymmetric graphs.

As mentioned before, Mask R-CNN-like networks naturally give rise to variable-sized tensors. Thus, in InPlace-ABN we calculate synchronized statistics using a variant of the parallel algorithm described here, which properly takes into account the fact that each GPU can hold a different number of samples. PyTorch’s SyncBatchNorm is currently being revised to support this, and the improved functionality will be available in a future release.

Asymmetric graphs (in the sense mentioned above) are another complicating factor one has to deal with when creating a synchronized BatchNorm implementation. Luckily, PyTorch’s distributed group functionality allows us to restrict distributed communication to a subset of workers, easily excluding those that are currently inactive. The only missing piece is that, in order to create a distributed group, each process needs to know the ids of all processes that will participate in the group, and even processes that are not part of the group need to call the new_group() function. In InPlace-ABN we handle it with a function like this:

import torch
import torch.distributed as distributed

def active_group(active):
    """Initialize a distributed group where each process can independently decide whether to participate or not"""
    world_size = distributed.get_world_size()
    rank = distributed.get_rank()

    # Gather active status from all workers
    active = torch.tensor(rank if active else -1, dtype=torch.long, device=torch.cuda.current_device())
    active_workers = torch.empty(world_size, dtype=torch.long, device=torch.cuda.current_device())
    distributed.all_gather(list(active_workers.unbind(0)), active)

    # Create group
    active_workers = [int(i) for i in active_workers.tolist() if i != -1]
    group = distributed.new_group(active_workers)
    return group

First each process, including inactive ones, communicates its status to all others through an all_gather call, then it creates the distributed group with the shared information. In the actual implementation we also include a caching mechanism for groups, since new_group() is usually too expensive to call at each batch.

References

[1] Seamless Scene Segmentation; Lorenzo Porzi, Samuel Rota Bulò, Aleksander Colovic, Peter Kontschieder; Computer Vision and Pattern Recognition (CVPR), 2019

[2] In-place Activated BatchNorm for Memory-Optimized Training of DNNs; Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder; Computer Vision and Pattern Recognition (CVPR), 2018

[3] Panoptic Segmentation; Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, Piotr Dollar; Computer Vision and Pattern Recognition (CVPR), 2019

[4] Mask R-CNN; Kaiming He, Georgia Gkioxari, Piotr Dollar, Ross Girshick; International Conference on Computer Vision (ICCV), 2017

[5] Feature Pyramid Networks for Object Detection; Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie; Computer Vision and Pattern Recognition (CVPR), 2017

Read More