An overview of the ML models introduced in TorchVision v0.9

TorchVision v0.9 has been released and it is packed with numerous new Machine Learning models and features, speed improvements and bug fixes. In this blog post, we provide a quick overview of the newly introduced ML models and discuss their key features and characteristics.

Classification

  • MobileNetV3 Large & Small: These two classification models are optimized for Mobile use-cases and are used as backbones on other Computer Vision tasks. The implementation of the new MobileNetV3 architecture supports the Large & Small variants and the depth multiplier parameter as described in the original paper. We offer pre-trained weights on ImageNet for both Large and Small networks with depth multiplier 1.0 and resolution 224×224. Our previous training recipes have been updated and can be used to easily train the models from scratch (shoutout to Ross Wightman for inspiring some of our training configuration). The Large variant offers a competitive accuracy comparing to ResNet50 while being over 6x faster on CPU, meaning that it is a good candidate for applications where speed is important. For applications where speed is critical, one can sacrifice further accuracy for speed and use the Small variant which is 15x faster than ResNet50.

  • Quantized MobileNetV3 Large: The quantized version of MobilNetV3 Large reduces the number of parameters by 45% and it is roughly 2.5x faster than the non-quantized version while remaining competitive in terms of accuracy. It was fitted on ImageNet using Quantization Aware Training by iterating on the non-quantized version and it can be trained from scratch using the existing reference scripts.

Usage:

model = torchvision.models.mobilenet_v3_large(pretrained=True)
# model = torchvision.models.mobilenet_v3_small(pretrained=True)
# model = torchvision.models.quantization.mobilenet_v3_large(pretrained=True)
model.eval()
predictions = model(img)

Object Detection

  • Faster R-CNN MobileNetV3-Large FPN: Combining the MobileNetV3 Large backbone with a Faster R-CNN detector and a Feature Pyramid Network leads to a highly accurate and fast object detector. The pre-trained weights are fitted on COCO 2017 using the provided reference scripts and the model is 5x faster on CPU than the equivalent ResNet50 detector while remaining competitive in terms of accuracy.
  • Faster R-CNN MobileNetV3-Large 320 FPN: This is an iteration of the previous model that uses reduced resolution (min_size=320 pixel) and sacrifices accuracy for speed. It is 25x faster on CPU than the equivalent ResNet50 detector and thus it is good for real mobile use-cases.

Usage:

model = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(pretrained=True)
# model = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_320_fpn(pretrained=True)
model.eval()
predictions = model(img)

Semantic Segmentation

  • DeepLabV3 with Dilated MobileNetV3 Large Backbone: A dilated version of the MobileNetV3 Large backbone combined with DeepLabV3 helps us build a highly accurate and fast semantic segmentation model. The pre-trained weights are fitted on COCO 2017 using our standard training recipes. The final model has the same accuracy as the FCN ResNet50 but it is 8.5x faster on CPU and thus making it an excellent replacement for the majority of applications.
  • Lite R-ASPP with Dilated MobileNetV3 Large Backbone: We introduce the implementation of a new segmentation head called Lite R-ASPP and combine it with the dilated MobileNetV3 Large backbone to build a very fast segmentation model. The new model sacrifices some accuracy to achieve a 15x speed improvement comparing to the previously most lightweight segmentation model which was the FCN ResNet50.

Usage:

model = torchvision.models.segmentation.deeplabv3_mobilenet_v3_large(pretrained=True)
# model = torchvision.models.segmentation.lraspp_mobilenet_v3_large(pretrained=True)
model.eval()
predictions = model(img)

In the near future we plan to publish an article that covers the details of how the above models were trained and discuss their tradeoffs and design choices. Until then we encourage you to try out the new models and provide your feedback.

Read More

Introducing PyTorch Profiler – the new and improved performance tool

Along with PyTorch 1.8.1 release, we are excited to announce PyTorch Profiler – the new and improved performance debugging profiler for PyTorch. Developed as part of a collaboration between Microsoft and Facebook, the PyTorch Profiler is an open-source tool that enables accurate and efficient performance analysis and troubleshooting for large-scale deep learning models.

Analyzing and improving large-scale deep learning model performance is an ongoing challenge that grows in importance as the model sizes increase. For a long time, PyTorch users had a hard time solving this challenge due to the lack of available tools. There were standard performance debugging tools that provide GPU hardware level information but missed PyTorch-specific context of operations. In order to recover missed information, users needed to combine multiple tools together or manually add minimum correlation information to make sense of the data. There was also the autograd profiler (torch.autograd.profiler) which can capture information about PyTorch operations but does not capture detailed GPU hardware-level information and cannot provide support for visualization.

The new PyTorch Profiler (torch.profiler) is a tool that brings both types of information together and then builds experience that realizes the full potential of that information. This new profiler collects both GPU hardware and PyTorch related information, correlates them, performs automatic detection of bottlenecks in the model, and generates recommendations on how to resolve these bottlenecks. All of this information from the profiler is visualized for the user in TensorBoard. The new Profiler API is natively supported in PyTorch and delivers the simplest experience available to date where users can profile their models without installing any additional packages and see results immediately in TensorBoard with the new PyTorch Profiler plugin. Below is the screenshot of PyTorch Profiler – automatic bottleneck detection.

Getting started

PyTorch Profiler is the next version of the PyTorch autograd profiler. It has a new module namespace torch.profiler but maintains compatibility with autograd profiler APIs. The Profiler uses a new GPU profiling engine, built using Nvidia CUPTI APIs, and is able to capture GPU kernel events with high fidelity. To profile your model training loop, wrap the code in the profiler context manager as shown below.

 with torch.profiler.profile(
    schedule=torch.profiler.schedule(
        wait=2,
        warmup=2,
        active=6,
        repeat=1),
    on_trace_ready=tensorboard_trace_handler,
    with_trace=True
) as profiler:
    for step, data in enumerate(trainloader, 0):
        print("step:{}".format(step))
        inputs, labels = data[0].to(device=device), data[1].to(device=device)

        outputs = model(inputs)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        profiler.step()

The schedule parameter allows you to limit the number of training steps included in the profile to reduce the amount of data collected and simplify visual analysis by focusing on what’s important. The tensorboard_trace_handler automatically saves profiling results to disk for analysis in TensorBoard.

To view results of the profiling session in TensorBoard, install PyTorch Profiler TensorBoard Plugin package.

pip install torch_tb_profiler

Visual Studio Code Integration

Microsoft Visual Studio Code is one of the most popular code editors for Python developers and data scientists. The Python extension for VS Code recently added the integration of TensorBoard into the code editor, including support for the PyTorch Profiler. Once you have VS Code and the Python extension installed, you can quickly open the TensorBoard Profiler plugin by launching the Command Palette using the keyboard shortcut CTRL + SHIFT + P (CMD + SHIFT + P on a Mac) and typing the “Launch TensorBoard” command.

This integration comes with a built-in lifecycle management feature. VS Code will install the TensorBoard package and the PyTorch Profiler plugin package (coming in mid-April) automatically if you don’t have them on your system. VS Code will also launch TensorBoard process for you and automatically look for any TensorBoard log files within your current directory. When you’re done, just close the tab and VS Code will automatically close the process. No more Terminal windows running on your system to provide a backend for the TensorBoard UI! Below is PyTorch Profiler Trace View running in TensorBoard.

Learn more about TensorBoard support in VS Code in this blog.

Feedback

Review PyTorch Profiler documentation, give Profiler a try and let us know about your experience. Provide your feedback on PyTorch Discussion Forum or file issues on PyTorch GitHub.

Read More

PyTorch for AMD ROCm™ Platform now available as Python package

With the PyTorch 1.8 release, we are delighted to announce a new installation option for users of
PyTorch on the ROCm™ open software platform. An installable Python package is now hosted on
pytorch.org, along with instructions for local installation in the same simple, selectable format as
PyTorch packages for CPU-only configurations and other GPU platforms. PyTorch on ROCm includes full
capability for mixed-precision and large-scale training using AMD’s MIOpen & RCCL libraries. This
provides a new option for data scientists, researchers, students, and others in the community to get
started with accelerated PyTorch using AMD GPUs.

The ROCm Ecosystem

ROCm is AMD’s open source software platform for GPU-accelerated high performance computing and
machine learning. Since the original ROCm release in 2016, the ROCm platform has evolved to support
additional libraries and tools, a wider set of Linux® distributions, and a range of new GPUs. This includes
the AMD Instinct™ MI100, the first GPU based on AMD CDNA™ architecture.

The ROCm ecosystem has an established history of support for PyTorch, which was initially implemented
as a fork of the PyTorch project, and more recently through ROCm support in the upstream PyTorch
code. PyTorch users can install PyTorch for ROCm using AMD’s public PyTorch docker image, and can of
course build PyTorch for ROCm from source. With PyTorch 1.8, these existing installation options are
now complemented by the availability of an installable Python package.

The primary focus of ROCm has always been high performance computing at scale. The combined
capabilities of ROCm and AMD’s Instinct family of data center GPUs are particularly suited to the
challenges of HPC at data center scale. PyTorch is a natural fit for this environment, as HPC and ML
workflows become more intertwined.

Getting started with PyTorch for ROCm

The scope for this build of PyTorch is AMD GPUs with ROCm support, running on Linux. The GPUs
supported by ROCm include all of AMD’s Instinct family of compute-focused data center GPUs, along
with some other select GPUs. A current list of supported GPUs can be found in the ROCm Github
repository
. After confirming that the target system includes supported GPUs and the current 4.0.1
release of ROCm, installation of PyTorch follows the same simple Pip-based installation as any other
Python package. As with PyTorch builds for other platforms, the configurator at https://pytorch.org/getstarted/locally/ provides the specific command line to be run.

PyTorch for ROCm is built from the upstream PyTorch repository, and is a full featured implementation.
Notably, it includes support for distributed training across multiple GPUs and supports accelerated
mixed precision training.

More information

A list of ROCm supported GPUs and operating systems can be found at
https://github.com/RadeonOpenCompute/ROCm
General documentation on the ROCm platform is available at https://rocmdocs.amd.com/en/latest/
ROCm Learning Center at https://developer.amd.com/resources/rocm-resources/rocm-learning-center/ General information on AMD’s offerings for HPC and ML can be found at https://amd.com/hpc

Feedback

An engaged user base is a tremendously important part of the PyTorch ecosystem. We would be deeply
appreciative of feedback on the PyTorch for ROCm experience in the PyTorch discussion forum and, where appropriate, reporting any issues via Github.

Read More

Announcing PyTorch Ecosystem Day

We’re proud to announce our first PyTorch Ecosystem Day. The virtual, one-day event will focus completely on our Ecosystem and Industry PyTorch communities!

PyTorch is a deep learning framework of choice for academics and companies, all thanks to its rich ecosystem of tools and strong community. As with our developers, our ecosystem partners play a pivotal role in the development and growth of the community.

We will be hosting our first PyTorch Ecosystem Day, a virtual event designed for our ecosystem and industry communities to showcase their work and discover new opportunities to collaborate.

PyTorch Ecosystem Day will be held on April 21, with both a morning and evening session, to ensure we reach our global community. Join us virtually for a day filled with discussions on new developments, trends, challenges, and best practices through keynotes, breakout sessions, and a unique networking opportunity hosted through Gather.Town .

Event Details

April 21, 2021 (Pacific Time)
Fully digital experience

  • Morning Session: (EMEA)
    Opening Talks – 8:00 am-9:00 am PT
    Poster Exhibition & Breakout Sessions – 9:00 am-12:00 pm PT

  • Evening Session (APAC/US)
    Opening Talks – 3:00 pm-4:00 pm PT
    Poster Exhibition & Breakout Sessions – 3:00 pm-6:00 pm PT

  • Networking – 9:00 am-7:00 pm PT

There are two ways to participate in PyTorch Ecosystem Day:

  1. Poster Exhibition from the PyTorch ecosystem and industry communities covering a variety of topics. Posters are available for viewing throughout the duration of the event. To be part of the poster exhibition, please see below for submission details. If your poster is accepted, we highly recommend tending your poster during one of the morning or evening sessions or both!

  2. Breakout Sessions are 40-min sessions freely designed by the community. The breakouts can be talks, demos, tutorials or discussions. Note: you must have an accepted poster to apply for the breakout sessions.

Call for posters now open! Submit your proposal today! Please send us the title and summary of your projects, tools, and libraries that could benefit PyTorch researchers in academia and industry, application developers, and ML engineers for consideration. The focus must be on academic papers, machine learning research, or open-source projects. Please no sales pitches. Deadline for submission is March 18, 2021.

Visit pytorchecosystemday.fbreg.com for more information and we look forward to welcoming you to PyTorch Ecosystem Day on April 21st!

Read More

New PyTorch library releases including TorchVision Mobile, TorchAudio I/O, and more

Today, we are announcing updates to a number of PyTorch libraries, alongside the PyTorch 1.8 release. The updates include new releases for the domain libraries including TorchVision, TorchText and TorchAudio as well as new version of TorchCSPRNG. These releases include a number of new features and improvements and, along with the PyTorch 1.8 release, provide a broad set of updates for the PyTorch community to build on and leverage.

Some highlights include:

  • TorchVision – Added support for PyTorch Mobile including Detectron2Go (D2Go), auto-augmentation of data during training, on the fly type conversion, and AMP autocasting.
  • TorchAudio – Major improvements to I/O, including defaulting to sox_io backend and file-like object support. Added Kaldi Pitch feature and support for CMake based build allowing TorchAudio to better support no-Python environments.
  • TorchText – Updated the dataset loading API to be compatible with standard PyTorch data loading utilities.
  • TorchCSPRNG – Support for cryptographically secure pseudorandom number generators for PyTorch is now stable with new APIs for AES128 ECB/CTR and CUDA support on Windows.

Please note that, starting in PyTorch 1.6, features are classified as Stable, Beta, and Prototype. Prototype features are not included as part of the binary distribution and are instead available through either building from source, using nightlies or via compiler flag. You can see the detailed announcement here.

TorchVision 0.9.0

[Stable] TorchVision Mobile: Operators, Android Binaries, and Tutorial

We are excited to announce the first on-device support and binaries for a PyTorch domain library. We have seen significant appetite in both research and industry for on-device vision support to allow low latency, privacy friendly, and resource efficient mobile vision experiences. You can follow this new tutorial to build your own Android object detection app using TorchVision operators, D2Go, or your own customer operators and model.

[Stable] New Mobile models for Classification, Object Detection and Semantic Segmentation

We have added support for the MobileNetV3 architecture and provided pre-trained weights for Classification, Object Detection and Segmentation. It is easy to get up and running with these models, just import and load them as you would any torchvision model:

import torch
import torchvision

# Classification
x = torch.rand(1, 3, 224, 224)
m_classifier = torchvision.models.mobilenet_v3_large(pretrained=True)
m_classifier.eval()
predictions = m_classifier(x)

# Quantized Classification
x = torch.rand(1, 3, 224, 224)
m_classifier = torchvision.models.quantization.mobilenet_v3_large(pretrained=True)
m_classifier.eval()
predictions = m_classifier(x)

# Object Detection: Highly Accurate High Resolution Mobile Model
x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
m_detector = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(pretrained=True)
m_detector.eval()
predictions = m_detector(x)

# Semantic Segmentation: Highly Accurate Mobile Model
x = torch.rand(1, 3, 520, 520)
m_segmenter = torchvision.models.segmentation.deeplabv3_mobilenet_v3_large(pretrained=True)
m_segmenter.eval()
predictions = m_segmenter(x)

These models are highly competitive with TorchVision’s existing models on resource efficiency, speed, and accuracy. See our release notes for detailed performance metrics.

[Stable] AutoAugment

AutoAugment is a common Data Augmentation technique that can increase the accuracy of Scene Classification models. Though the data augmentation policies are directly linked to their trained dataset, empirical studies show that ImageNet policies provide significant improvements when applied to other datasets. We’ve implemented 3 policies learned on the following datasets: ImageNet, CIFA10 and SVHN. These can be used standalone or mixed-and-matched with existing transforms:

from torchvision import transforms

t = transforms.AutoAugment()
transformed = t(image)


transform=transforms.Compose([
   transforms.Resize(256),
   transforms.AutoAugment(),
   transforms.ToTensor()])

Other New Features for TorchVision

  • [Stable] All read and decode methods in the io.image package now support:
    • Palette, Grayscale Alpha and RBG Alpha image types during PNG decoding
    • On-the-fly conversion of image from one type to the other during read
  • [Stable] WiderFace dataset
  • [Stable] Improved FasterRCNN speed and accuracy by introducing a score threshold on RPN
  • [Stable] Modulation input for DeformConv2D
  • [Stable] Option to write audio to a video file
  • [Stable] Utility to draw bounding boxes
  • [Beta] Autocast support in all Operators
    Find the full TorchVision release notes here.

TorchAudio 0.8.0

I/O Improvements

We have continued our work from the previous release to improve TorchAudio’s I/O support, including:

  • [Stable] Changing the default backend to “sox_io” (for Linux/macOS), and updating the “soundfile” backend’s interface to align with that of “sox_io”. The legacy backend and interface are still accessible, though it is strongly discouraged to use them.
  • [Stable] File-like object support in both “sox_io” backend, “soundfile” backend and sox_effects.
  • [Stable] New options to change the format, encoding, and bits_per_sample when saving.
  • [Stable] Added GSM, HTK, AMB, AMR-NB and AMR-WB format support to the “sox_io” backend.
  • [Beta] A new functional.apply_codec function which can degrade audio data by applying audio codecs supported by “sox_io” backend in an in-memory fashion.
    Here are some examples of features landed in this release:
# Load audio over HTTP
with requests.get(URL, stream=True) as response:
    waveform, sample_rate = torchaudio.load(response.raw)
 
# Saving to Bytes buffer as 32-bit floating-point PCM
buffer_ = io.BytesIO()
torchaudio.save(
    buffer_, waveform, sample_rate,
    format="wav", encoding="PCM_S", bits_per_sample=16)
 
# Apply effects while loading audio from S3
client = boto3.client('s3')
response = client.get_object(Bucket=S3_BUCKET, Key=S3_KEY)
waveform, sample_rate = torchaudio.sox_effects.apply_effect_file(
    response['Body'],
    [["lowpass", "-1", "300"], ["rate", "8000"]])
 
# Apply GSM codec to Tensor
encoded = torchaudio.functional.apply_codec(
    waveform, sample_rate, format="gsm")

Check out the revamped audio preprocessing tutorial, Audio Manipulation with TorchAudio.

[Stable] Switch to CMake-based build

In the previous version of TorchAudio, it was utilizing CMake to build third party dependencies. Starting in 0.8.0, TorchaAudio uses CMake to build its C++ extension. This will open the door to integrate TorchAudio in non-Python environments (such as C++ applications and mobile). We will continue working on adding example applications and mobile integrations.

[Beta] Improved and New Audio Transforms

We have added two widely requested operators in this release: the SpectralCentroid transform and the Kaldi Pitch feature extraction (detailed in “A pitch extraction algorithm tuned for automatic speech recognition”). We’ve also exposed a normalization method to Mel transforms, and additional STFT arguments to Spectrogram. We would like to ask our community to continue to raise feature requests for core audio processing features like these!

Community Contributions

We had more contributions from the open source community in this release than ever before, including several completely new features. We would like to extend our sincere thanks to the community. Please check out the newly added CONTRIBUTING.md for ways to contribute code, and remember that reporting bugs and requesting features are just as valuable. We will continue posting well-scoped work items as issues labeled “help-wanted” and “contributions-welcome” for anyone who would like to contribute code, and are happy to coach new contributors through the contribution process.

Find the full TorchAudio release notes here.

TorchText 0.9.0

[Beta] Dataset API Updates

In this release, we are updating TorchText’s dataset API to be compatible with PyTorch data utilities, such as DataLoader, and are deprecating TorchText’s custom data abstractions such as Field. The updated datasets are simple string-by-string iterators over the data. For guidance about migrating from the legacy abstractions to use modern PyTorch data utilities, please refer to our migration guide.

The text datasets listed below have been updated as part of this work. For examples of how to use these datasets, please refer to our end-to-end text classification tutorial.

  • Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
  • Sequence tagging: UDPOS, CoNLL2000Chunking
  • Translation: IWSLT2016, IWSLT2017
  • Question answer: SQuAD1, SQuAD2

Find the full TorchText release notes here.

[Stable] TorchCSPRNG 0.2.0

We released TorchCSPRNG in August 2020, a PyTorch C++/CUDA extension that provides cryptographically secure pseudorandom number generators for PyTorch. Today, we are releasing the 0.2.0 version and designating the library as stable. This release includes a new API for encrypt/decrypt with AES128 ECB/CTR as well as CUDA 11 and Windows CUDA support.

Find the full TorchCSPRNG release notes here.

Thanks for reading, and if you are excited about these updates and want to participate in the future of PyTorch, we encourage you to join the discussion forums and open GitHub issues.

Cheers!

Team PyTorch

Read More

PyTorch 1.8 Release, including Compiler and Distributed Training updates, and New Mobile Tutorials

We are excited to announce the availability of PyTorch 1.8. This release is composed of more than 3,000 commits since 1.7. It includes major updates and new features for compilation, code optimization, frontend APIs for scientific computing, and AMD ROCm support through binaries that are available via pytorch.org. It also provides improved features for large-scale training for pipeline and model parallelism, and gradient compression. A few of the highlights include:

  1. Support for doing python to python functional transformations via torch.fx;
  2. Added or stabilized APIs to support FFTs (torch.fft), Linear Algebra functions (torch.linalg), added support for autograd for complex tensors and updates to improve performance for calculating hessians and jacobians; and
  3. Significant updates and improvements to distributed training including: Improved NCCL reliability; Pipeline parallelism support; RPC profiling; and support for communication hooks adding gradient compression.
    See the full release notes here.

Along with 1.8, we are also releasing major updates to PyTorch libraries including TorchCSPRNG, TorchVision, TorchText and TorchAudio. For more on the library releases, see the post here. As previously noted, features in PyTorch releases are classified as Stable, Beta and Prototype. You can learn more about the definitions in the post here.

New and Updated APIs

The PyTorch 1.8 release brings a host of new and updated API surfaces ranging from additional APIs for NumPy compatibility, also support for ways to improve and scale your code for performance at both inference and training time. Here is a brief summary of the major features coming in this release:

[Stable] Torch.fft support for high performance NumPy style FFTs

As part of PyTorch’s goal to support scientific computing, we have invested in improving our FFT support and with PyTorch 1.8, we are releasing the torch.fft module. This module implements the same functions as NumPy’s np.fft module, but with support for hardware acceleration and autograd.

[Beta] Support for NumPy style linear algebra functions via torch.linalg

The torch.linalg module, modeled after NumPy’s np.linalg module, brings NumPy-style support for common linear algebra operations including Cholesky decompositions, determinants, eigenvalues and many others.

[Beta] Python code Transformations with FX

FX allows you to write transformations of the form transform(input_module : nn.Module) -> nn.Module, where you can feed in a Module instance and get a transformed Module instance out of it.

This kind of functionality is applicable in many scenarios. For example, the FX-based Graph Mode Quantization product is releasing as a prototype contemporaneously with FX. Graph Mode Quantization automates the process of quantizing a neural net and does so by leveraging FX’s program capture, analysis and transformation facilities. We are also developing many other transformation products with FX and we are excited to share this powerful toolkit with the community.

Because FX transforms consume and produce nn.Module instances, they can be used within many existing PyTorch workflows. This includes workflows that, for example, train in Python then deploy via TorchScript.

Below is an FX transform example:

import torch
import torch.fx

def transform(m: nn.Module,
             tracer_class : type = torch.fx.Tracer) -> torch.nn.Module:
   # Step 1: Acquire a Graph representing the code in `m`
  
   # NOTE: torch.fx.symbolic_trace is a wrapper around a call to
   # fx.Tracer.trace and constructing a GraphModule. We'll
   # split that out in our transform to allow the caller to
   # customize tracing behavior.
   graph : torch.fx.Graph = tracer_class().trace(m)
  
   # Step 2: Modify this Graph or create a new one
   graph = ...
  
   # Step 3: Construct a Module to return
   return torch.fx.GraphModule(m, graph)

You can read more about FX in the official documentation. You can also find several examples of program transformations implemented using torch.fx here. We are constantly improving FX and invite you to share any feedback you have about the toolkit on the forums or issue tracker.

Distributed Training

The PyTorch 1.8 release added a number of new features as well as improvements to reliability and usability. Concretely, support for: Stable level async error/timeout handling was added to improve NCCL reliability; and stable support for RPC based profiling. Additionally, we have added support for pipeline parallelism as well as gradient compression through the use of communication hooks in DDP. Details are below:

[Beta] Pipeline Parallelism

As machine learning models continue to grow in size, traditional Distributed DataParallel (DDP) training no longer scales as these models don’t fit on a single GPU device. The new pipeline parallelism feature provides an easy to use PyTorch API to leverage pipeline parallelism as part of your training loop.

[Beta] DDP Communication Hook

The DDP communication hook is a generic interface to control how to communicate gradients across workers by overriding the vanilla allreduce in DistributedDataParallel. A few built-in communication hooks are provided including PowerSGD, and users can easily apply any of these hooks to optimize communication. Additionally, the communication hook interface can also support user-defined communication strategies for more advanced use cases.

Additional Prototype Features for Distributed Training

In addition to the major stable and beta distributed training features in this release, we also have a number of prototype features available in our nightlies to try out and provide feedback. We have linked in the draft docs below for reference:

  • (Prototype) ZeroRedundancyOptimizer – Based on and in partnership with the Microsoft DeepSpeed team, this feature helps reduce per-process memory footprint by sharding optimizer states across all participating processes in the ProcessGroup gang. Refer to this documentation for more details.
  • (Prototype) Process Group NCCL Send/Recv – The NCCL send/recv API was introduced in v2.7 and this feature adds support for it in NCCL process groups. This feature will provide an option for users to implement collective operations at Python layer instead of C++ layer. Refer to this documentation and code examples to learn more.
  • (Prototype) CUDA-support in RPC using TensorPipe – This feature should bring consequent speed improvements for users of PyTorch RPC with multiple-GPU machines, as TensorPipe will automatically leverage NVLink when available, and avoid costly copies to and from host memory when exchanging GPU tensors between processes. When not on the same machine, TensorPipe will fall back to copying the tensor to host memory and sending it as a regular CPU tensor. This will also improve the user experience as users will be able to treat GPU tensors like regular CPU tensors in their code. Refer to this documentation for more details.
  • (Prototype) Remote Module – This feature allows users to operate a module on a remote worker like using a local module, where the RPCs are transparent to the user. In the past, this functionality was implemented in an ad-hoc way and overall this feature will improve the usability of model parallelism on PyTorch. Refer to this documentation for more details.

PyTorch Mobile

Support for PyTorch Mobile is expanding with a new set of tutorials to help new users launch models on-device quicker and give existing users a tool to get more out of our framework. These include:

Our new demo apps also include examples of image segmentation, object detection, neural machine translation, question answering, and vision transformers. They are available on both iOS and Android:

In addition to performance improvements on CPU for MobileNetV3 and other models, we also revamped our Android GPU backend prototype for broader models coverage and faster inferencing:

Lastly, we are launching the PyTorch Mobile Lite Interpreter as a prototype feature in this release. The Lite Interpreter allows users to reduce the runtime binary size. Please try these out and send us your feedback on the PyTorch Forums. All our latest updates can be found on the PyTorch Mobile page

[Prototype] PyTorch Mobile Lite Interpreter

PyTorch Lite Interpreter is a streamlined version of the PyTorch runtime that can execute PyTorch programs in resource constrained devices, with reduced binary size footprint. This prototype feature reduces binary sizes by up to 70% compared to the current on-device runtime in the current release.

Performance Optimization

In 1.8, we are releasing the support for benchmark utils to enable users to better monitor performance. We are also opening up a new automated quantization API. See the details below:

(Beta) Benchmark utils

Benchmark utils allows users to take accurate performance measurements, and provides composable tools to help with both benchmark formulation and post processing. This expected to be helpful for contributors to PyTorch to quickly understand how their contributions are impacting PyTorch performance.

Example:

from torch.utils.benchmark import Timer

results = []
for num_threads in [1, 2, 4]:
    timer = Timer(
        stmt="torch.add(x, y, out=out)",
        setup="""
            n = 1024
            x = torch.ones((n, n))
            y = torch.ones((n, 1))
            out = torch.empty((n, n))
        """,
        num_threads=num_threads,
    )
    results.append(timer.blocked_autorange(min_run_time=5))
    print(
        f"{num_threads} thread{'s' if num_threads > 1 else ' ':<4}"
        f"{results[-1].median * 1e6:>4.0f} us   " +
        (f"({results[0].median / results[-1].median:.1f}x)" if num_threads > 1 else '')
    )

1 thread     376 us   
2 threads    189 us   (2.0x)
4 threads     99 us   (3.8x)

(Prototype) FX Graph Mode Quantization

FX Graph Mode Quantization is the new automated quantization API in PyTorch. It improves upon Eager Mode Quantization by adding support for functionals and automating the quantization process, although people might need to refactor the model to make the model compatible with FX Graph Mode Quantization (symbolically traceable with torch.fx).

Hardware Support

[Beta] Ability to Extend the PyTorch Dispatcher for a new backend in C++

In PyTorch 1.8, you can now create new out-of-tree devices that live outside the pytorch/pytorch repo. The tutorial linked below shows how to register your device and keep it in sync with native PyTorch devices.

[Beta] AMD GPU Binaries Now Available

Starting in PyTorch 1.8, we have added support for ROCm wheels providing an easy onboarding to using AMD GPUs. You can simply go to the standard PyTorch installation selector and choose ROCm as an installation option and execute the provided command.

Thanks for reading, and if you are excited about these updates and want to participate in the future of PyTorch, we encourage you to join the discussion forums and open GitHub issues.

Cheers!

Team PyTorch

Read More

The torch.fft module: Accelerated Fast Fourier Transforms with Autograd in PyTorch

The Fast Fourier Transform (FFT) calculates the Discrete Fourier Transform in O(n log n) time. It is foundational to a wide variety of numerical algorithms and signal processing techniques since it makes working in signals’ “frequency domains” as tractable as working in their spatial or temporal domains.

As part of PyTorch’s goal to support hardware-accelerated deep learning and scientific computing, we have invested in improving our FFT support, and with PyTorch 1.8, we are releasing the torch.fft module. This module implements the same functions as NumPy’s np.fft module, but with support for accelerators, like GPUs, and autograd.

Getting started

Getting started with the new torch.fft module is easy whether you are familiar with NumPy’s np.fft module or not. While complete documentation for each function in the module can be found here, a breakdown of what it offers is:

  • fft, which computes a complex FFT over a single dimension, and ifft, its inverse
  • the more general fftn and ifftn, which support multiple dimensions
  • The “real” FFT functions, rfft, irfft, rfftn, irfftn, designed to work with signals that are real-valued in their time domains
  • The “Hermitian” FFT functions, hfft and ihfft, designed to work with signals that are real-valued in their frequency domains
  • Helper functions, like fftfreq, rfftfreq, fftshift, ifftshift, that make it easier to manipulate signals

We think these functions provide a straightforward interface for FFT functionality, as vetted by the NumPy community, although we are always interested in feedback and suggestions!

To better illustrate how easy it is to move from NumPy’s np.fft module to PyTorch’s torch.fft module, let’s look at a NumPy implementation of a simple low-pass filter that removes high-frequency variance from a 2-dimensional image, a form of noise reduction or blurring:

import numpy as np
import numpy.fft as fft

def lowpass_np(input, limit):
    pass1 = np.abs(fft.rfftfreq(input.shape[-1])) < limit
    pass2 = np.abs(fft.fftfreq(input.shape[-2])) < limit
    kernel = np.outer(pass2, pass1)
    
    fft_input = fft.rfft2(input)
    return fft.irfft2(fft_input * kernel, s=input.shape[-2:])

Now let’s see the same filter implemented in PyTorch:

import torch
import torch.fft as fft

def lowpass_torch(input, limit):
    pass1 = torch.abs(fft.rfftfreq(input.shape[-1])) < limit
    pass2 = torch.abs(fft.fftfreq(input.shape[-2])) < limit
    kernel = torch.outer(pass2, pass1)
    
    fft_input = fft.rfft2(input)
    return fft.irfft2(fft_input * kernel, s=input.shape[-2:])

Not only do current uses of NumPy’s np.fft module translate directly to torch.fft, the torch.fft operations also support tensors on accelerators, like GPUs and autograd. This makes it possible to (among other things) develop new neural network modules using the FFT.

Performance

The torch.fft module is not only easy to use — it is also fast! PyTorch natively supports Intel’s MKL-FFT library on Intel CPUs, and NVIDIA’s cuFFT library on CUDA devices, and we have carefully optimized how we use those libraries to maximize performance. While your own results will depend on your CPU and CUDA hardware, computing Fast Fourier Transforms on CUDA devices can be many times faster than computing it on the CPU, especially for larger signals.

In the future, we may add support for additional math libraries to support more hardware. See below for where you can request additional hardware support.

Updating from older PyTorch versions

Some PyTorch users might know that older versions of PyTorch also offered FFT functionality with the torch.fft() function. Unfortunately, this function had to be removed because its name conflicted with the new module’s name, and we think the new functionality is the best way to use the Fast Fourier Transform in PyTorch. In particular, torch.fft() was developed before PyTorch supported complex tensors, while the torch.fft module was designed to work with them.

PyTorch also has a “Short Time Fourier Transform”, torch.stft, and its inverse torch.istft. These functions are being kept but updated to support complex tensors.

Future

As mentioned, PyTorch 1.8 offers the torch.fft module, which makes it easy to use the Fast Fourier Transform (FFT) on accelerators and with support for autograd. We encourage you to try it out!

While this module has been modeled after NumPy’s np.fft module so far, we are not stopping there. We are eager to hear from you, our community, on what FFT-related functionality you need, and we encourage you to create posts on our forums at https://discuss.pytorch.org/, or file issues on our Github with your feedback and requests. Early adopters have already started asking about Discrete Cosine Transforms and support for more hardware platforms, for example, and we are investigating those features now.

We look forward to hearing from you and seeing what the community does with PyTorch’s new FFT functionality!

Read More

Prototype Features Now Available – APIs for Hardware Accelerated Mobile and ARM64 Builds

Today, we are announcing four PyTorch prototype features. The first three of these will enable Mobile machine-learning developers to execute models on the full set of hardware (HW) engines making up a system-on-chip (SOC). This gives developers options to optimize their model execution for unique performance, power, and system-level concurrency.

These features include enabling execution on the following on-device HW engines:

  • DSP and NPUs using the Android Neural Networks API (NNAPI), developed in collaboration with Google
  • GPU execution on Android via Vulkan
  • GPU execution on iOS via Metal

This release also includes developer efficiency benefits with newly introduced support for ARM64 builds for Linux.

Below, you’ll find brief descriptions of each feature with the links to get you started. These features are available through our nightly builds. Reach out to us on the PyTorch Forums for any comment or feedback. We would love to get your feedback on those and hear how you are using them!

NNAPI Support with Google Android

The Google Android and PyTorch teams collaborated to enable support for Android’s Neural Networks API (NNAPI) via PyTorch Mobile. Developers can now unlock high-performance execution on Android phones as their machine-learning models will be able to access additional hardware blocks on the phone’s system-on-chip. NNAPI allows Android apps to run computationally intensive neural networks on the most powerful and efficient parts of the chips that power mobile phones, including DSPs (Digital Signal Processors) and NPUs (specialized Neural Processing Units). The API was introduced in Android 8 (Oreo) and significantly expanded in Android 10 and 11 to support a richer set of AI models. With this integration, developers can now seamlessly access NNAPI directly from PyTorch Mobile. This initial release includes fully-functional support for a core set of features and operators, and Google and Facebook will be working to expand capabilities in the coming months.

Links

PyTorch Mobile GPU support

Inferencing on GPU can provide great performance on many models types, especially those utilizing high-precision floating-point math. Leveraging the GPU for ML model execution as those found in SOCs from Qualcomm, Mediatek, and Apple allows for CPU-offload, freeing up the Mobile CPU for non-ML use cases. This initial prototype level support provided for on device GPUs is via the Metal API specification for iOS, and the Vulkan API specification for Android. As this feature is in an early stage: performance is not optimized and model coverage is limited. We expect this to improve significantly over the course of 2021 and would like to hear from you which models and devices you would like to see performance improvements on.

Links

ARM64 Builds for Linux

We will now provide prototype level PyTorch builds for ARM64 devices on Linux. As we see more ARM usage in our community with platforms such as Raspberry Pis and Graviton(2) instances spanning both at the edge and on servers respectively. This feature is available through our nightly builds.

We value your feedback on these features and look forward to collaborating with you to continuously improve them further!

Thank you,

Team PyTorch

Read More

Announcing PyTorch Developer Day 2020

Announcing PyTorch Developer Day 2020

Starting this year, we plan to host two separate events for PyTorch: one for developers and users to discuss core technical development, ideas and roadmaps called “Developer Day”, and another for the PyTorch ecosystem and industry communities to showcase their work and discover opportunities to collaborate called “Ecosystem Day” (scheduled for early 2021).

The PyTorch Developer Day (#PTD2) is kicking off on November 12, 2020, 8AM PST with a full day of technical talks on a variety of topics, including updates to the core framework, new tools and libraries to support development across a variety of domains. You’ll also see talks covering the latest research around systems and tooling in ML.

For Developer Day, we have an online networking event limited to people composed of PyTorch maintainers and contributors, long-time stakeholders and experts in areas relevant to PyTorch’s future. Conversations from the networking event will strongly shape the future of PyTorch. Hence, invitations are required to attend the networking event.

All talks will be livestreamed and available to the public.

Visit https://pytorchdeveloperday.fbreg.com/ to learn more. We look forward to welcoming you to PyTorch Developer Day on November 12th!

Thank you,

The PyTorch team

Read More

Adding a Contributor License Agreement for PyTorch

Adding a Contributor License Agreement for PyTorch

To ensure the ongoing growth and success of the framework, we’re introducing the use of the Apache Contributor License Agreement (CLA) for PyTorch. We care deeply about the broad community of contributors who make PyTorch such a great framework, so we want to take a moment to explain why we are adding a CLA.

Why Does PyTorch Need a CLA?

CLAs help clarify that users and maintainers have the relevant rights to use and maintain code contributed to an open source project, while allowing contributors to retain ownership rights to their code.

PyTorch has grown from a small group of enthusiasts to a now global community with over 1,600 contributors from dozens of countries, each bringing their own diverse perspectives, values and approaches to collaboration. Looking forward, clarity about how this collaboration is happening is an important milestone for the framework as we continue to build a stronger, safer and more scalable community around PyTorch.

The text of the Apache CLA can be found here, together with an accompanying FAQ. The language in the PyTorch CLA is identical to the Apache template. Although CLAs have been the subject of significant discussion in the open source community, we are seeing that using a CLA, and particularly the Apache CLA, is now standard practice when projects and communities reach a certain scale. Popular projects that have adopted some type of CLA include: Visual Studio Code, Flutter, TensorFlow, kubernetes, Ubuntu, Django, Python, Go, Android and many others.

What is Not Changing

PyTorch’s BSD license is not changing. There is no impact to PyTorch users. CLAs will only be required for new contributions to the project. For past contributions, no action is necessary. Everything else stays the same, whether it’s IP ownership, workflows, contributor roles or anything else that you’ve come to expect from PyTorch.

How the New CLA will Work

Moving forward, all contributors to projects under the PyTorch GitHub organization will need to sign a CLA to merge their contributions.

If you’ve contributed to other Facebook Open Source projects, you may have already signed the CLA, and no action is required. If you have not signed the CLA, a GitHub check will prompt you to sign it before your pull requests can be merged. You can reach the CLA from this link.

If you’re contributing as an individual, meaning the code is not something you worked on as part of your job, you should sign the individual contributor agreement. This agreement associates your GitHub username with future contributions and only needs to be signed once.

If you’re contributing as part of your employment, you may need to sign the corporate contributor agreement. Check with your legal team on filling this out. Also you will include a list of github ids from your company.

As always, we continue to be humbled and grateful for all your support, and we look forward to scaling PyTorch together to even greater heights in the years to come.

Thank you!

Team PyTorch

Read More