Optimizing Production PyTorch Models’ Performance with Graph Transformations

Optimizing Production PyTorch Models’ Performance with Graph Transformations

1. Introduction

PyTorch supports two execution modes [1]: eager mode and graph mode. In eager mode, operators in a model are immediately executed as they are encountered. In contrast, in graph mode, operators are first synthesized into a graph, which will then be compiled and executed as a whole. Eager mode is easier to use, more suitable for ML researchers, and hence is the default mode of execution. On the other hand, graph mode typically delivers higher performance and hence is heavily used in production.

Specifically, graph mode enables operator fusion [2], wherein one operator is merged with another to reduce/localize memory reads as well as total kernel launch overhead. Fusion can be horizontal—taking a single operation (e.g., BatchNorm) that is independently applied to many operands and merging those operands into an array; and vertical—merging a kernel with another kernel that consumes the output of the first kernel (e.g., Convolution followed by ReLU).

Torch.FX [3, 4] (abbreviated as FX) is a publicly available toolkit as part of the PyTorch package that supports graph mode execution. In particular, it (1) captures the graph from a PyTorch program and (2) allows developers to write transformations on the captured graph. It is used inside Meta to optimize the training throughput of production models. By introducing a number of FX-based optimizations developed at Meta, we demonstrate the approach of using graph transformation to optimize PyTorch’s performance for production.

2. Background

Embedding tables are ubiquitous in recommendation systems. Section 3 will discuss three FX transformations that optimize accesses to embedding tables. In this section, we provide some background on FX (Section 2.1) and embedding tables (Section 2.2).

2.1 FX

Figure 1 is a simple example adopted from [3] which illustrates using FX to transform a PyTorch program. It contains three steps: (1) capturing the graph from a program, (2) modifying the graph (in this example, all uses of RELU are replaced by GELU), and (3) generating a new program from the modified graph.

Figure 1: A FX example which replaces all uses of RELU by GELU in a PyTorch module.

The FX API [4] provides many more functionalities for inspecting and transforming PyTorch program graphs.

2.2 Embedding Tables

Figure 2: Illustration of an embedding table for a sparse feature with batch size = 1

In a recommendation system, sparse features (e.g., User ID, Story ID) are represented by embedding tables. An embedding table E is an HxD matrix, where H is the hash size, D is the embedding dimension. Each row of E is a vector of floats. Feature hashing [5] is used to map a sparse feature to a list of indices to E, say [S1,S2, …, Sk], where 0<=Si<H. Its output value is computed as f(E[S1], E[S2], …, E[Sk]), where E[Si] is the vector at row Si, and f is called the pooling function, which is typically one of the following functions: sum, average, maximum. See Figure 2 for an illustration.

To fully utilize the GPU, sparse features are usually processed in a batch. Each entity in a batch has its own list of indices. If a batch has B entities, a naive representation has B lists of indices. A more compact representation is to combine the B lists of indices into a single list of indices and add a list of the lengths of indices (one length for each entity in the batch). For example, if a batch has 3 entities whose lists of indices are as follows:

  • Entity 1: indices = [10, 20]
  • Entity 2: indices = [5, 9, 77, 81]
  • Entity 3: indices = [15, 20, 45]

Then the indices and lengths for the entire batch will be:

  • Indices = [10, 20, 5, 9, 77, 81, 15, 20, 45]
  • Lengths = [2, 4, 3]

And the output of the embedding table lookup for the whole batch is a BxD matrix.

3. Three FX Transformations

We have developed three FX transformations that accelerate accesses to embedding tables. Section 3.1 discusses a transformation that combines multiple small input tensors into a single big tensor; Section 3.2 a transformation that fuses multiple, parallel compute chains into a single compute chain; and Section 3.3 a transformation that overlaps communication with computation.

3.1 Combining Input Sparse Features

Recall that an input sparse feature in a batch is represented by two lists: a list of indices and a list of B lengths, where B is the batch size. In PyTorch, these two lists are implemented as two tensors. When a PyTorch model is run on a GPU, embedding tables are commonly stored in the GPU memory (which is closer to the GPU and has much higher read/write bandwidth than the CPU memory). To use an input sparse feature, its two tensors need to be first copied from CPU to GPU. Nevertheless, per host-to-device memory copying requires a kernel launch, which is relatively expensive compared to the actual data transfer time. If a model uses many input sparse features, this copying could become a performance bottleneck (e.g., 1000 input sparse features would require copying 2000 tensors from host to device).

An optimization that reduces the number of host-to-device memcpy is to combine multiple input sparse features before sending them to the device. For instance, given the following three input features:

  • Feature_A: indices = [106, 211, 7], lengths = [2, 1]
  • Feature_B: indices = [52, 498, 616, 870, 1013], lengths = [3, 2]
  • Feature_C: indices = [2011, 19, 351, 790], lengths = [1, 3]

The combined form is:

  • Features_A_B_C: indices = [106, 211, 7, 52, 498, 616, 870, 1013, 2011, 19, 351, 790], lengths = [2, 1, 3, 2, 1, 3]

So, instead of copying 3×2=6 tensors from host to device, we only need to copy 2 tensors.

Figure 3(b) describes an implementation of this optimization, which has two components:

  • On the CPU side: The input pipeline is modified to combine all the indices of sparse features into a single tensor and similarly all the lengths into another tensor. Then the two tensors are copied to the GPU.
  • On the GPU side: Using FX, we insert a Permute_and_Split op into the model graph to recover the indices and lengths tensors of individual features from the combined tensors, and route them to the corresponding nodes downstream.

(a). Without the optimization

(b). With the optimization

Figure 3: Combining input sparse features

3.2 Horizontal fusion of computation chains started with accesses to embedding tables

In a production model, it is fairly common to have 10s of embedding tables residing on each GPU. For performance reasons, lookups to these tables are grouped together so that their outputs are concatenated in a single big tensor (see the red part in Figure 4(a)). To apply computations to individual feature outputs, a Split op is used to divide the big tensors into N smaller tensors (where N is the number of features) and then the desired computations are applied to each tensor. This is shown in Figure 4(a), where the computation applied to each feature output O is Tanh(LayerNorm(O)). All the computation results are concatenated back to a big tensor, which is then passed to downstream ops (Op1 in Figure 4(a)).

The main runtime cost here is the GPU kernel launch overhead. For instance, the number of GPU kernel launches in Figure 4(a) is 2*N + 3 (each oval in the figure is a GPU kernel). This could become a performance issue because execution times of LayerNorm and Tanh on the GPU are short compared to their kernel launch times. In addition, the Split op may create an extra copy of the embedding output tensor, consuming additional GPU memory.

We use FX to implement an optimization called horizontal fusion which dramatically reduces the number of GPU kernel launches (in this example, the optimized number of GPU kernel launches is 5, see Figure 4(b)). Instead of doing an explicit Split, we use the Add_middle_dim op to reshape the 2D embedding tensor of shape (B, NxD) to a 3D tensor of shape (B, N, D). Then a single LayerNorm is applied to the last dimension of it. Then a single Tanh is applied to the result of the LayerNorm. At the end, we use the Remove_middle_dim op to reshape the Tanh’s result back to a 2D tensor. In addition, since Add_middle_dim and Remove_middle_dim only reshape the tensor without creating an extra copy, the amount of GPU memory consumption could be reduced as well.

(a). Without the optimization

(b). With the optimization

Figure 4: Horizontal fusion

3.3 Overlapping Computation with Communication

Training of a production recommendation model is typically done on a distributed GPU system. Since the capacity of the device memory per GPU is not big enough to hold all the embedding tables in the model, they need to be distributed among the GPUs.

Within a training step, a GPU needs to read/write feature values from/to the embedding tables on the other GPUs. This is known as all-to-all communication [6] and can be a major performance bottleneck.

We use FX to implement a transformation that can overlap computation with all-to-all communication. Figure 5(a) shows the example of a model graph which has embedding table accesses (EmbeddingAllToAll) and other ops. Without any optimization, they are sequentially executed on a GPU stream, as shown in Figure 5(b). Using FX, we break EmbeddingAllToAll into EmbeddingAllToAll_Request and EmbeddingAllToAll_Wait, and schedule independent ops in between them.

(a) Model graph

(b) Original execution order

(c)Optimized execution order

Figure 5: Overlapping Computation with Communication

3.4 Summary

Table 1 summarizes the optimizations discussed in this section and the corresponding performance bottlenecks addressed.

Optimization Performance Bottleneck Addressed
Combining Input Sparse Features Host-to-device memory copy
Horizontal fusion GPU kernel launch overhead
Overlapping Computation with Communication Embedding all-to-all access time

Table 1: Summary of the optimizations and the performance bottlenecks addressed

We have also developed other FX transformations which are not discussed in this section due to space limitations.

To discover which models would benefit from these transformations, we analyzed the performance data collected by MAIProf [7] from the models that run at Meta’s data centers. Altogether, these transformations provide up to 2-3x of speedups compared to eager mode on a set of production models.

4. Concluding Remarks

The graph mode in PyTorch is preferred over the eager mode for production use for performance reasons. FX is a powerful tool for capturing and optimizing the graph of a PyTorch program. We demonstrate three FX transformations that are used to optimize production recommendation models inside Meta. We hope that this blog can motivate other PyTorch model developers to use graph transformations to boost their models’ performance.

References

[1] End-to-end Machine Learning Framework

[2] DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion

[3] Torch.FX: Practical Program Capture and Transformation for Deep Learning In Python, MLSys 2022.

[4] Torch.fx—PyTorch 1.12 documentation

[5] Feature Hashing for Large Scale Multitask Learning

[6] NVIDIA Collective Communication Library Documentation

[7] Performance Debugging of Production PyTorch Models at Meta

Read More

Introducing TorchMultimodal – a library for accelerating exploration in Multimodal AI

We are announcing TorchMultimodal Beta, a PyTorch domain library for training SoTA multi-task multimodal models at scale. The library provides composable building blocks (modules, transforms, loss functions) to accelerate model development, SoTA model architectures (FLAVA, MDETR, Omnivore) from published research, training and evaluation scripts, as well as notebooks for exploring these models. The library is under active development, and we’d love to hear your feedback! You can find more details on how to get started here.

Why TorchMultimodal?

Interest is rising around AI models that understand multiple input types (text, images, videos and audio signals), and optionally use this understanding to generate different forms of outputs (sentences, pictures, videos). Recent work from FAIR such as FLAVA, Omnivore and data2vec have shown that multimodal models for understanding are competitive with unimodal counterparts, and in some cases are establishing the new state-of-the art. Generative models such as Make-a-video and Make-a-scene are redefining what modern AI systems can do.

As interest in multimodal AI has grown, researchers are looking for tools and libraries to quickly experiment with ideas, and build on top of the latest research in the field. While the PyTorch ecosystem has a rich repository of libraries and frameworks, it’s not always obvious how components from these interoperate with each other, or how they can be stitched together to build SoTA multimodal models.

TorchMultimodal solves this problem by providing:

  • Composable and easy-to-use building blocks which researchers can use to accelerate model development and experimentation in their own workflows. These are designed to be modular, and can be easily extended to handle new modalities.

  • End-to-end examples for training and evaluating the latest models from research. These should serve as starting points for ongoing/future research, as well as examples for using advanced features such as integrating with FSDP and activation checkpointing for scaling up model and batch sizes.

Introducing TorchMultimodal

TorchMultimodal is a PyTorch domain library for training multi-task multimodal models at scale. In the repository, we provide:

  • Building Blocks. A collection of modular and composable building blocks like models, fusion layers, loss functions, datasets and utilities. Some examples include:

    • Contrastive Loss with Temperature. Commonly used function for training models like CLIP and FLAVA. We also include variants such as ImageTextContrastiveLoss used in models like ALBEF.

    • Codebook layers which compresses high dimensional data by nearest neighbor lookup in an embedding space and is a vital component of VQVAEs (provided as a model in the repository).

    • Shifted-window Attention window based multi-head self attention which is a vital component of encoders like Swin 3D Transformers.

    • Components for CLIP. A popular model published by OpenAI which has proven to be extremely effective at learning text and image representations.

    • Multimodal GPT. An abstraction that extends OpenAI’s GPT architecture for multimodal generation when combined with the generation utility.

    • MultiHeadAttention. A critical component for attention-based models with support for fast auto-regressive decoding.

  • Examples. A collection of examples that show how to combine these building blocks with components and common infrastructure (Lightning, TorchMetrics) from across the PyTorch Ecosystem to replicate state-of-the-art models published in literature. We currently provide five examples, which include.

    • FLAVA [paper]. Official code for the paper accepted at CVPR, including a tutorial on finetuning FLAVA.

    • MDETR [paper]. Collaboration with authors from NYU to provide an example which alleviates interoperability pain points in the PyTorch ecosystem, including a notebook on using MDETR for phrase grounding and visual question answering.

    • Omnivore [paper]. First example in TorchMultimodal of a model which deals with Video and 3D data, including a notebook for exploring the model.

    • MUGEN [paper]. Foundational work for auto-regressive generation and retrieval, including demos for text-video generation and retrieval with a large-scale synthetic dataset enriched from OpenAI coinrun.

    • ALBEF [paper] Code for the model, including a notebook for using this model for Visual Question Answering.

The following code snippet showcases an example usage of several TorchMultimodal components related to CLIP:


# instantiate clip transform
clip_transform = CLIPTransform()

# pass the transform to your dataset. Here we use coco captions
dataset = CocoCaptions(root= ..., annFile=..., transforms=clip_transform)
dataloader = DataLoader(dataset, batch_size=16)

# instantiate model. Here we use clip with vit-L as the image encoder
model= clip_vit_l14()

# define loss and other things needed for training
clip_loss = ContrastiveLossWithTemperature()
optim = torch.optim.AdamW(model.parameters(), lr = 1e-5)
epochs = 1

# write your train loop
for _ in range(epochs):
	for batch_idx, batch in enumerate(dataloader):
		image, text = batch
		image_embeddings, text_embeddings = model(image, text)
		loss = contrastive_loss_with_temperature(image_embeddings, text_embeddings)
		loss.backward()
		optimizer.step()

Apart from the code, we are also releasing a tutorial for fine-tuning multimodal foundation models, and a blog post (with code pointers) on how to scale up such models using techniques from PyTorch Distributed (FSDP and activation checkpointing). We hope such examples and tutorials will serve to demystify a number of advanced features available in the PyTorch ecosystem.

What’s Next?

While this is an exciting launch, there’s a lot more to come. The library is under development and we are working on adding some of the exciting developments in the space of diffusion models, and examples to showcase common trends from research. As you explore and use the library, we’d love to hear any feedback you might have! You can find more details on how to get started here.

Team

The primary contributors and developers of TorchMultimodal include Ankita De, Evan Smothers, Kartikay Khandelwal, Lan Gong, Laurence Rouesnel, Nahiyan Malik, Rafi Ayub and Yosua Michael Maranatha.

Read More

PyTorch Enterprise Support Program Update

PyTorch Enterprise Support Program Update

On May 25, 2021, we announced the PyTorch Enterprise Support Program (ESP) that enabled providers to develop and offer tailored enterprise-grade support to their customers.

The program enabled Program certified service providers to develop and offer tailored enterprise-grade support to their customers through contribution of hotfixes and other improvements requested by PyTorch enterprise users who were developing models in production at scale for mission-critical applications. However, as we evaluate community feedback, we found ongoing ESP support was not necessary at this time and will immediately divert these resources to other areas to improve the user experience for the entire community.

Today, we are removing the PyTorch long-term support (LTS 1.8.2) download link from the “Get Started” page from the “Start Locally” download option in order to simplify the user experience. One can download previous versions of PyTorch starting from the first public release until the latest one. Please note that it is only supported for Python while it is being deprecated. If there are any updates to ESP/LTS, we will update future blogs.

Please reach out to marketing@pytorch.org with any questions.

Read More

alt_text

Efficient Large-Scale Training with Pytorch FSDP and AWS

Cutting-edge AI models are becoming extremely large. The cost and overhead of training these models is increasing rapidly, and involves large amounts of engineering and guesswork to find the right training regime. FSDP reduces these costs significantly by enabling you to train much larger models with the same amount of resources. FSDP lowers the memory footprint on your GPUs, and is usable via a lightweight configuration that requires substantially less effort, typically with just a few lines of code.

The main performance gains in FSDP come from maximizing the overlap between network communication and model computation, and eliminating the memory redundancy inherent in traditional data parallel training (DDP). PyTorch FSDP can train models approximately 4x larger on the same server resources as DDP and 20x larger if we combine activation checkpointing and activation offloading.

Since PyTorch 1.12, FSDP is now in beta status, and has added a number of new features that can be tuned to further accelerate your model training.

In this series of blog posts, we will explain multiple performance optimizations you can run with FSDP to boost your distributed training speed and model sizes within the context of your available server resources. We use the HuggingFace T5 3B, 11B and DeepVit, in fine-tuning mode, as the running examples throughout the series.

As a preview of some of the optimizations discussed in this series, we show the before and after performance scaled in Flops below (Note that these results can vary based on your server resources and model architecture).

>>>>> gd2md-html alert: inline image link here (to images/image1.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

*T5 3B Performance measured on AWS A100 and A10 servers. Original with no optimizations and Tuned with the applied optimization

>>>>> gd2md-html alert: inline image link here (to images/image2.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

*T5 11B Performance measured on A100 servers. Original with no optimizations and Tuned with the applied optimization

In this first post, we will provide a quick overview of FSDP and how it can make training large- scale AI models more efficient. We will highlight briefly the multiple performance options available, and dive deeper into the details on these in upcoming posts. We will then conclude with an overview on how to leverage AWS parallel cluster for large- scale training with FSDP.

Optimization T5 Model Throughput Improvement
Mixed Precision 3 B 5x
11 B 10x
Activation Checkpointing (AC) 3 B 10x
11 B 100x
Transformer Wrapping Policy 3 B 2x
11 B Unable to run the experiment without the Transformer wrapping policy.
Full Shard Strategy 3 B 1.5x
11 B Not able to run with Zero2

Performance optimization gains on T5 models over non-optimized.

In our experiments with the T5 3B model, using the transformer wrapping policy resulted in >2x higher throughput measured in TFLOPS versus the default wrapping policy. Activation checkpointing resulted in 10x improvement by reinvesting the freed memory from the checkpoints into larger batch size. _Mixed precision _with BFloat16 resulted in ~5x improvement versus FP32 and finally the _full sharding strategy_ versus zero2 (DDP) resulted in 1.5x improvement.

We ran similar experiments for a larger model, T5 11B, but the larger model size resulted in some changes to the experiment space. Specifically, we found that two optimizations, transformer wrapping policy and activation checkpointing, were needed to enable us to run these experiments on 3 nodes (each node had 8 A100 gpus with 80 GB of memory). With these optimizations, we could fit a batch size of 50 and get higher throughput compared to removing each one of them. Thus rather than running on/off solely for a single optimization test as with the 3B model, the larger model experiments were done with 1 of 3 optimizations turned on/off while always running the other two in order to allow a usable batch size for both test states for each item.

Based on TFLOP comparisons, with the 11B model, we saw even more payoff from the optimizations. Mixed precision(~10x improvement) and activation checkpointing (~100x improvement) had a much larger impact with the 11B model compared to the 3B parameter model. With mixed precision we could fit ~2x larger batch sizes and with activation checkpointing >15x batch sizes (from 3 with no activation checkpointing to 50 with activation checkpointing) which translated into large throughput improvements.

We also have observed that for these larger models > 3B, using Zero2 sharding strategy would result in minimal room left in memory for the batch data, and had to go with very small batch sizes (e.g 1-2) that essentially makes full sharding strategy a necessity to enable fitting larger batches sizes.

Note – this tutorial assumes a basic understanding of FSDP. To learn more about basics of FSDP please refer to the getting started and advanced FSDP tutorials.

**What is FSDP? How does it make Large-Scale Training More Efficient **

FSDP expands upon distributed data parallel, by parallelizing not just data, but the model parameters, the optimizer states and gradients associated with the model. Specifically – each GPU only stores a subset of the entire model **and the associated subset of optimizer states and gradients. **

To show the evolution of distributed training, we can start from the beginning, where AI models were simply trained on a single GPU.

DDP (Distributed Data Parallel) was the initial step up from training with only a single GPU, and was an effort to address the data and model size growth, where multiple GPUs each housed their own copy of the same model. The gain here is that the data for each batch could be split and processed independently on each GPU, all at the same time,thus parallelizing the processing of the data set and increasing training speed by the increasing number of GPUs. The tradeoff is the need to communicate the gradients between each GPU to synchronize the models after the backward pass.

FSDP expands on scaling models by removing the redundancy of optimizer calculations and state storage, as well as gradient and memory storage of model parameters that are present in DDP (DDP = Distributed Data Parallel). This redundancy reduction, along with increased communication overlap where model parameter communication takes place at the same time as model computation, is what allows FSDP to train much larger models with the same resources as DDP.

A key point is that this efficiency also allows for AI models that are larger than a single GPU to be trained. The model size available for training is now increased to the aggregate memory of all GPUs, rather than the size of a single GPU. (And as a point of note, FSDP can go beyond aggregated GPU memory by leveraging CPU memory as well, though we will not directly cover this aspect here).

As discussed in a previous blog post, with DDP the largest model that we could train on 32, A100 gpus with 40 GB memory (4 nodes) was up to 3B parameters, and batch size of 128, with the help of activation checkpointing. By contrast, using FSDP we were able to train up to 81B model size, combining activation checkpointing, along with activation and parameter offloading. In another experiment, we benchmarked a 1T parameter model with FSDP using 512 gpus.

>>>>> gd2md-html alert: inline image link here (to images/image3.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

For intuition on the parameter level workings of FSDP, below we show an animation detailing how the model parameters are sharded and communicated assuming a two GPU scenario and a simple 8 parameter model:

>>>>> gd2md-html alert: inline image link here (to images/image4.gif). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Above – the animations walk through the steps involved with the initial sharding of the model amongst ranks, and we start the all_gathers and forward pass.

>>>>> gd2md-html alert: inline image link here (to images/image5.gif). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

We continue through the model with the forward pass. After each FSDP unit completes, non-locally owned params are dropped to free memory, and optionally activations can be checkpointed. This continues until we finish the forward pass and compute the loss.

>>>>> gd2md-html alert: inline image link here (to images/image6.gif). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

_During the backward pass, another all_gather is used to load the parameters and the gradients are computed. These gradients are then reduce_scattered so that the local owners of each param can aggregate and prepare to update the weights. _

>>>>> gd2md-html alert: inline image link here (to images/image7.gif). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

_Finally, each rank passes the summed gradients through the optimizer states and updates the weights to complete the mini-batch. _

With the model now distributed across the entire set of available GPUs, the logical question is how data moves through the model given this sharding of model parameters.

This is accomplished by FSDP coordinating with all GPUs to effectively share (communicate) the respective parts of the model. The model is decomposed into FSDP units and parameters within each unit are flattened and then sharded across all GPUs. Within each FSDP unit, GPU’s are assigned interleaving ownership of individual model parameters.

By interleaving, we mean the following – assuming 2 gpus with an id of 1 and 2, the FSDP unit ownership pattern would be [12121212], rather than a contiguous chunk of [111222].

During training, an all_gather is initiated and the locally owned model parameters within a FSDP unit are shared by the owner GPU with the other non-owners, when they need it, on a ‘just in time’ type basis. FSDP prefetches parameters to overlap all_gather communication with computation.

When those requested parameters arrive, the GPU uses the delivered parameters, in combination with the parameters it already owns, to create a fully populated FSDP unit. Thus there is a moment where each GPU hits peak memory usage while holding a fully populated FSDP unit.

It then processes the data through the FSDP unit, and drops the parameters it received from other GPU’s to free up memory for the next unit…the process continues over and over proceeding through the entire model to complete the forward pass. The process is then repeated (in general) for the backward pass.
(note – this is a simplified version for understanding..there is additional complexity but this should help construct a basic mental model of the FSDP process).

This eliminates much of the memory redundancy present in DDP, but imposes the cost of higher amounts of network communication to shuttle these requested parameters back and forth amongst all the GPUs. ** Overlapping the communication timing with the computation taking place is the basis of many of the performance improvements we’ll discuss in this series.** The key gains are frequently based on the fact that communication can often take place at the same time as computation. As you can surmise, having high communication speed is vital for FSDP performance.

How do I optimize my training with FSDP?

There are four main performance improvements we will cover – the transformer wrapper, activation checkpointing, mixed precision, and selecting the proper sharding strategy. The flowchart below will help as a checklist for tuning options that we will discuss in this post.

>>>>> gd2md-html alert: inline image link here (to images/image8.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Wrapping policy – for transformers, use Transformer wrapping policy

The first performance optimization is leveraging the FSDP transformer wrapper for transformer models.

One of the pre-defined wrapping policy is size_based_autowrap_policy. With size_based_autowrap_policy, FSDP will traverse the module structure from bottom to top, a new FSDP unit will be created once the current unit has at least the ‘min_num_params’ specified within the size policy (this defaults to 1e8, or 100M). If the module can not be created as an FSDP unit, FSDP will continue to check its parent module. This size based wrapping policy may not be ideal for some model structures, PyTorch distributed team is actively working on a new default wrapping policy in the next release which is based on size and also module execution order, users can simply tune the size and achieve the optimized performance.

In the current release, you can greatly improve your performance when running Transformer models by using the ‘transformer wrapper’. You will need to provide the appropriate layer class for your model. Here, layer class is the class that houses the Multi-Head Attention and Feed Forward Network.

FSDP will then form the FSDP units around the layer class rather than arbitrary breaks based on parameter size. By sharding the model around layer classes that are uniformly repeated within the transformer, FSDP can create uniform FSDP units that better balance the overlap of computation and communication. By contrast, size based wrapping can produce very uneven or skewed shards for models, which then have uneven matching of compute vs communication overlap. As discussed earlier, the main driver of FSDP high performance is the overlap of communication and computation, and hence why the Transformer wrapper provides improved performance. Note that the Transformer wrapper can also be used for non-transformer models if these models have a list of uniform layers.

Let’s compare the performance difference on a T5, 3B parameter model when running under the default wrapper and the transformer wrapper.

For default wrapping, we don’t need to take any action – we simply pass the model to FSDP as shown:

model = FSDP(
      model,
      device_id=torch.cuda.current_device(),
  )

In this case FSDP will simply wrap the whole model in a single FSDP unit.

Running on an NVIDIA A100-SXM4–40GB with 8 GPUs, we are able to reach 2.3 TFlops and 95% GPU memory utilization with a batch size of 14.

However, since T5 is a transformer model, we are better served to leverage the transformer wrapper for this model.

To use that, we need to isolate the layer class for the transformer, and then pass it in to create our transformer wrapper.

from transformers.models.t5.modeling_t5 import T5Block

And now we can create our Transformer wrapper:

transformer_auto_wrapper_policy = functools.partial(
        transformer_auto_wrap_policy,
        transformer_layer_cls={
            T5Block,  # < ---- Your Transformer layer class
        },
    )
With our model aware wrapper ready, we can initialize FSDP:
# invoke FSDP with your transformer wrapper policy:
model = FSDP(
        model,
        auto_wrap_policy=transformer_auto_wrapper_policy,
        device_id=torch.cuda.current_device(),  # streaming init
    )

Running this wrapped model, we can see some substantial performance gains.
We can fit nearly double the batch size, going to 28, and with better memory and communication efficiency, we see a TFlops increase to 5.07 from 2.3.

Thus, we’ve increased our training throughput by over 200% (2.19x) due to providing greater model info to FSDP! The transformer wrapping policy results in more fine-grained and balanced FSDP units each holding a layer class, which leads to a more effective communication-computation overlap.

>>>>> gd2md-html alert: inline image link here (to images/image9.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Above: Graphical comparison of TFlops based on wrapper type

If you are training a Transformer model, it pays to configure your training with FSDP using the transformer wrapper. For more information on how to isolate your layer class, please see our in depth video on Transformer wrapping here, where we walk through a number of transformers showing where the layer class can be found.

Mixed precision -_ use BF16 if you have an Ampere architecture GPU_

FSDP supports a flexible mixed precision policy that gives you granular control over parameters, gradients and buffer data types. This lets you easily leverage BFloat16 or FP16 to increase your training speed by up to 70%.

*Note that BFloat 16 is only available on Ampere type GPUs. On AWS this is available with p4dn and g5 instances.

By way of comparison, we can show a 77% speed improvement when comparing fully tuned BFloat16 vs FP32 on an 8B DeepVit model.

>>>>> gd2md-html alert: inline image link here (to images/image10.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

We have obtained even greater acceleration using BFloat16 in fine-tuning a 3B HuggingFace T5 model as shown in the figures below. We observed that because of the lower precision the validation loss of BFloat16 is slightly behind in the first few epochs, but it is able to catch up and results in the same final accuracy as FP32.

>>>>> gd2md-html alert: inline image link here (to images/image11.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

To use mixed precision, we create a policy with our desired data types, and pass it in during the FSDP initialization.
To create our policy, we need to import the MixedPrecision class, and then define our custom policy using our customized class:

from torch.distributed.fsdp import MixedPrecision

bfSixteen = MixedPrecision(
   param_dtype=torch.bfloat16,
   # Gradient communication precision.
   reduce_dtype=torch.bfloat16,
   # Buffer precision.
   buffer_dtype=torch.bfloat16,
)

model = FSDP(
       model,
       auto_wrap_policy=transformer_auto_wrapper_policy,
       mixed_precision=bfloatPolicy)

You can mix and match the precision for parameters, gradients and buffers as you prefer:

comboPolicy = MixedPrecision(
        # Param precision
        param_dtype=torch.bfloat16,
        # Gradient communication precision.
        reduce_dtype=torch.float32,
        # Buffer precision.
        buffer_dtype=torch.float32,
    )

For training with FP16, you will need to also use the ShardedGradScaler, which we will cover in subsequent posts. For BFloat16, it is a drop-in replacement.

AnyPrecision Optimizer – _going beyond mixed precision with full BF16 training _

Mixed precision training, both in FSDP and elsewhere, maintains the working weights in the reduced datatype (BF16 or FP16) while keeping the master weights in full FP32. The reason for the master weights in FP32 is that running in pure BF16 will result in ‘weight stagnation’, where very small weight updates are lost due to the lower precision, and the accuracy flatlines over time while FP32 weights can continue to improve from these small updates.

In order to resolve this dilemma, we can use the new AnyPrecision optimizer available in TorchDistX (Torch Distributed Experimental) that allows you to successfully train and keep the master weights in pure BF16 instead of FP32. In addition, unlike the typical storage of optimizer states in FP32, AnyPrecision is able to maintain states in pure BF16 as well.

AnyPrecision enables pure BF16 training by maintaining an extra buffer that tracks the precision lost during the weight updates and re-applies that during the next update…effectively resolving the weight stagnation issue without requiring FP32.

As a comparison of the throughput gains available with pure BF16 training using AnyPrecision, we ran experiments using FSDP with the T5 11B model with regular FP32 training, Mixed Precision training with BF16, and pure BF16 training using the AnyPrecision optimizer on 3 nodes with A100 gpus as mentioned previously.

>>>>> gd2md-html alert: inline image link here (to images/image12.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text
As shown above, training with AnyPrecision and pure BF16 resulted in 2x the throughput vs Mixed Precision, and over 20x improvement vs FP32.

The potential tradeoff is the impact on final accuracy – in the cases we tested, the accuracy was equal or better than FP32 due to a regularization effect from the slightly reduced precision, but your results may vary.

AnyPrecision optimizer is available for you to test with here, and is a drop in replacement for AdamW optimizer.

Activation checkpointing – increasing throughput by trading compute for memory

>>>>> gd2md-html alert: inline image link here (to images/image13.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

FSDP supports activation checkpointing once the model has been sharded, and makes it easy to implement. The graph above shows ~4x throughput improvement using activation checkpointing.

Activation checkpointing is where the intermediate activations are freed during the forward pass, and a checkpoint is left as a placeholder. This generally increases available GPU memory by over 30%.

The tradeoff is that during the backward pass, these previously removed intermediate activations must be re-calculated again using information in the checkpoint (duplicate compute), but by leveraging the increased GPU memory, one can increase the batch size such that the net throughput can increase substantially.

# verify we have FSDP activation support ready by importing:
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
   checkpoint_wrapper,
   CheckpointImpl,
   apply_activation_checkpointing_wrapper,
)

The steps required to implement activation checkpointing is to first import the FSDP checkpointing functions. We need declare our checkpointer wrapper type which is non-reentrant and create a check function to identify which layer to wrap as follows

non_reentrant_wrapper = partial(
    checkpoint_wrapper,
    offload_to_cpu=False,
    checkpoint_impl=CheckpointImpl.NO_REENTRANT,
)

check_fn = lambda submodule: isinstance(submodule, T5Block)

apply_activation_checkpointing_wrapper(
       model, checkpoint_wrapper_fn=non_reentrant_wrapper, check_fn=check_fn
   )

Important note – this must be run after the model has been initialized with FSDP.

However, hopefully you’ve seen how some initial tuning with FSDP options can have a large impact on your training performance.

With that, we turn our attention from how to scale within FSDP, to how to scale your server hardware for FSDP using AWS.

Large Scale Training with FSDP on AWS – For multi-node prioritize high speed network

AWS provides several services that can be used to run distributed training with FSDP:
Amazon EC2 Accelerated Computing instances, AWS ParallelCluster, and Amazon Sagemaker.

In this series of blog posts, we used Amazon EC2 p4d instances in a single-instance multi-GPU configuration and in a multi-instance configuration using AWS ParallelCluster and SageMaker in order to run our training jobs.

Here, we’ll focus specifically on AWS parallel cluster and provide an overview of how to utilize it for training purposes.

AWS ParallelCluster Setup

AWS ParallelCluster is an open source, cluster management tool that makes it easy for you to deploy and manage High Performance Computing (HPC) clusters on AWS. AWS ParallelCluster uses yaml configuration files to provision all the necessary resources. It also supports multiple instance types, job submission queues, shared file systems like [Amazon EFS](https://aws.amazon.com/efs/?trk=3c5ce89c-8865-47a3-bec3-f6820351aa6d&sc_channel=ps&sc_campaign=acquisition&sc_medium=ACQ-P PS-GO Non-Brand Desktop SU Storage Solution US EN DSA&ef_id=Cj0KCQjwuaiXBhCCARIsAKZLt3l6dtldpE152xuxTMa3mbUbaqtTXwsBdfDRIzCL8cw3NO5DO_y1vOgaAj1pEALw_wcB:G:s&s_kwcid=AL!4422!3!579408162404!!!g!!) (NFS) or Amazon FSx for Lustre, and job schedulers like AWS Batch and Slurm.

>>>>> gd2md-html alert: inline image link here (to images/image14.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Workflow on Clusters

The high level idea is to have a cluster that has a head node which controls the compute nodes. The actual training job runs on the compute nodes. Overall steps to run a training job on a cluster are as follows:

  1. Set up an AWS ParallelCuster (we discuss below)
  2. Connect to the head node, and import the training code/ setup the environment.
  3. Pull the data and place it in a shared folder that compute nodes can access (FSx Lustre drive).
  4. Run the training job using a job scheduler (in this case Slurm).

Setup AWS ParallelCuster

To setup AWS** **ParallelCluster,

  1. Deploy a network stack. This step is optional since you could use your account default VPC and let AWS ParallelCluster create your subnets and security groups. However, we prefer to compartmentalize our desired network infrastructure and do this deployment via a CloudFormation stack.

    Since we deploy a public and a private subnet, we want to create them into an Availability Zone that contains our target instances, in this case p4d. We consult their availability in the region we use (us-east-1) through the following AWS CLI command:

    aws ec2 describe-instance-type-offerings --location-type availability-zone --filters Name=instance-type,Values=p4d.24xlarge --region us-east-1 --output table

    We see three availability zones containing p4d instances, we pick one of them (us-east-1c, yours may be different) when deploying our network stack. This can be done with the AWS Console or the AWS CLI. In our case we use the latter as follows

    ` aws cloudformation create-stack –stack-name VPC-Large-Scale –capabilities CAPABILITY_IAM –template-body file://VPC-Large-Scale.yaml –parameters ParameterKey=SubnetsAZ,ParameterValue=us-east-1c`

    CloudFormation will deploy our new VPC, subnets, security groups and endpoints on our behalf. Once done, you can retrieve the IDs of the public and private subnets by querying the stack outputs and the values PublicSubnet and PrivateSubnet.

    For example, using the AWS CLI for the private subnet:

    aws cloudformation describe-stacks --stack-name VPC-Large-Scale --query "Stacks[0].Outputs[?OutputKey=='PrivateSubnet'].OutputValue" --output text

  2. Create ParallelCluster, The cluster configuration file specifies the resources for our cluster. These resources include instance type for Head node, compute nodes, access to S3 buckets, shared storage where our data will be located. We will use Amazon FSx for Lustre that offers a fully managed shared storage service with Lustre.

    Here is an example of a cluster configuration file. We can use AWs ParallelCluster CLI to create the cluster. Please note that the private and public subnet IDs will need to be replaced by the ones you retrieved earlier. You will be able to control the cluster using the AWS ParallelCluster CLI to start, stop, pause, etc.

    pcluster create-cluster --cluster-name my-hpc-cluster --cluster-configuration cluster.yaml
    
  3. SSH to Head node – once the cluster is ready, we can connect to the Head node using the SSH protocol, pull our training code with and place the data in the shared storage specified in the cluster configuration file.

    pcluster ssh --cluster-name cluster -i your-key_pair
    
  4. Launch the training job – now that we have the data and training code, we can launch the slurm job for training. Here is an example of a slurm script to launch the job using torchrun.

More details on how to set up the cluster is out of the scope of this post, however we will have a separate post on it.

What’s next?

With this post we provided a high level overview of FSDP and how it efficiently scales distributed AI training. The flowchart included will help provide a checklist for you to review tuning options discussed such as the transformer wrapper and activation checkpointing.
In the next posts, we will continue with the T5 model and go deeper into each of the topics above, specifically with sharding strategy and other optimizations to provide more insight and details. For now, a good reference for the sharding strategy is in our video tutorial here:

If you have questions or find an issue, please find the authors Less, Hamid and Geeta or open an issue on PyTorch github.

Special thanks to:

Pytorch Distributed team, Shen Li, Rohan Varma, Yanli Zhao, Andrew Gu, Anjali Sridhar, Ana Simoes, Pierre-Yves Aquilanti, Sundar Ranganathan, and the broader AWS team for supporting us with providing infrastructure and technical support for running the large scale experiments.

Resources:

FSDP video series

Getting started with FSDP

Advanced tutorial on FSDP

API documentation

Read More

Extending TorchVision’s Transforms to Object Detection, Segmentation & Video tasks

TorchVision is extending its Transforms API! Here is what’s new:

  • You can use them not only for Image Classification but also for Object Detection, Instance & Semantic Segmentation and Video Classification.
  • You can import directly from TorchVision several SoTA data-augmentations such as MixUp, CutMix, Large Scale Jitter and SimpleCopyPaste.
  • You can use new functional transforms for transforming Videos, Bounding Boxes and Segmentation Masks.

The interface remains the same to assist the migration and adoption. The new API is currently in Prototype and we would love to get early feedback from you to improve its functionality. Please reach out to us if you have any questions or suggestions.

Limitations of current Transforms

The stable Transforms API of TorchVision (aka V1) only supports single images. As a result it can only be used for classification tasks:

from torchvision import transforms
trans = transforms.Compose([
   transforms.ColorJitter(contrast=0.5),
   transforms.RandomRotation(30),
   transforms.CenterCrop(480),
])
imgs = trans(imgs)

The above approach doesn’t support Object Detection, Segmentation or Classification transforms that require the use of Labels (such as MixUp & CutMix). This limitation made any non-classification Computer Vision tasks second-class citizens as one couldn’t use the Transforms API to perform the necessary augmentations. Historically this made it difficult to train high-accuracy models using TorchVision’s primitives and thus our Model Zoo lagged by several points from SoTA.

To circumvent this limitation, TorchVision offered custom implementations in its reference scripts that show-cased how one could perform augmentations in each task. Though this practice enabled us to train high accuracy classification, object detection & segmentation models, it was a hacky approach which made those transforms impossible to import from the TorchVision binary.

The new Transforms API

The Transforms V2 API supports videos, bounding boxes, labels and segmentation masks meaning that it offers native support for many Computer Vision tasks. The new solution is a drop-in replacement:

from torchvision.prototype import transforms
# Exactly the same interface as V1:
trans = transforms.Compose([
    transforms.ColorJitter(contrast=0.5),
    transforms.RandomRotation(30),
    transforms.CenterCrop(480),
])
imgs, bboxes, labels = trans(imgs, bboxes, labels)

The new Transform Classes can receive any arbitrary number of inputs without enforcing specific order or structure:

# Already supported:
trans(imgs)  # Image Classification
trans(videos)  # Video Tasks
trans(imgs_or_videos, labels)  # MixUp/CutMix-style Transforms
trans(imgs, bboxes, labels)  # Object Detection
trans(imgs, bboxes, masks, labels)  # Instance Segmentation
trans(imgs, masks)  # Semantic Segmentation
trans({"image": imgs, "box": bboxes, "tag": labels})  # Arbitrary Structure
# Future support:
trans(imgs, bboxes, labels, keypoints)  # Keypoint Detection
trans(stereo_images, disparities, masks)  # Depth Perception
trans(image1, image2, optical_flows, masks)  # Optical Flow

The Transform Classes make sure that they apply the same random transforms to all the inputs to ensure consistent results.

The functional API has been updated to support all necessary signal processing kernels (resizing, cropping, affine transforms, padding etc) for all inputs:

from torchvision.prototype.transforms import functional as F
# High-level dispatcher, accepts any supported input type, fully BC
F.resize(inpt, resize=[224, 224])
# Image tensor kernel
F.resize_image_tensor(img_tensor, resize=[224, 224], antialias=True)
# PIL image kernel
F.resize_image_pil(img_pil, resize=[224, 224], interpolation=BILINEAR)
# Video kernel
F.resize_video(video, resize=[224, 224], antialias=True)
# Mask kernel
F.resize_mask(mask, resize=[224, 224])
# Bounding box kernel
F.resize_bounding_box(bbox, resize=[224, 224], spatial_size=[256, 256])

The API uses Tensor subclassing to wrap input, attach useful meta-data and dispatch to the right kernel. Once the Datasets V2 work is complete, which makes use of TorchData’s Data Pipes, the manual wrapping of input won’t be necessary. For now, users can manually wrap the input by:

from torchvision.prototype import features
imgs = features.Image(images, color_space=ColorSpace.RGB)
vids = features.Video(videos, color_space=ColorSpace.RGB)
masks = features.Mask(target["masks"])
bboxes = features.BoundingBox(target["boxes"], format=BoundingBoxFormat.XYXY, spatial_size=imgs.spatial_size)
labels = features.Label(target["labels"], categories=["dog", "cat"])

In addition to the new API, we now provide importable implementations for several data augmentations that are used in SoTA research such as MixUp, CutMix, Large Scale Jitter, SimpleCopyPaste, AutoAugmentation methods and several new Geometric, Colour and Type Conversion transforms.

The API continues to support both PIL and Tensor backends for Images, single or batched input and maintains JIT-scriptability on the functional API. It allows deferring the casting of images from uint8 to float which can lead to performance benefits. It is currently available in the prototype area of TorchVision and can be imported from the nightly builds. The new API has been verified to achieve the same accuracy as the previous implementation.

Current Limitations

Though the functional API (kernels) remain JIT-scriptable and fully-BC, the Transform Classes, though they offer the same interface, can’t be scripted. This is because they use Tensor Subclassing and receive arbitrary number of inputs which are not supported by JIT. We are currently working to reduce the dispatching overhead of the new API and to improve the speed of existing kernels.

An end-to-end example

Here is an example of the new API using the following image. It works both with PIL images and Tensors:

import PIL
from torchvision import io, utils
from torchvision.prototype import features, transforms as T
from torchvision.prototype.transforms import functional as F
# Defining and wrapping input to appropriate Tensor Subclasses
path = "COCO_val2014_000000418825.jpg"
img = features.Image(io.read_image(path), color_space=features.ColorSpace.RGB)
# img = PIL.Image.open(path)
bboxes = features.BoundingBox(
    [[2, 0, 206, 253], [396, 92, 479, 241], [328, 253, 417, 332],
     [148, 68, 256, 182], [93, 158, 170, 260], [432, 0, 438, 26],
     [422, 0, 480, 25], [419, 39, 424, 52], [448, 37, 456, 62],
     [435, 43, 437, 50], [461, 36, 469, 63], [461, 75, 469, 94],
     [469, 36, 480, 64], [440, 37, 446, 56], [398, 233, 480, 304],
     [452, 39, 463, 63], [424, 38, 429, 50]],
    format=features.BoundingBoxFormat.XYXY,
    spatial_size=F.get_spatial_size(img),
)
labels = features.Label([59, 58, 50, 64, 76, 74, 74, 74, 74, 74, 74, 74, 74, 74, 50, 74, 74])
# Defining and applying Transforms V2
trans = T.Compose(
    [
        T.ColorJitter(contrast=0.5),
        T.RandomRotation(30),
        T.CenterCrop(480),
    ]
)
img, bboxes, labels = trans(img, bboxes, labels)
# Visualizing results
viz = utils.draw_bounding_boxes(F.to_image_tensor(img), boxes=bboxes)
F.to_pil_image(viz).show()

Development milestones and future work

Here is where we are in development:

  • Design API
  • Write Kernels for transforming Videos, Bounding Boxes, Masks and Labels
  • Rewrite all existing Transform Classes (stable + references) on the new API:
    • Image Classification
    • Video Classification
    • Object Detection
    • Instance Segmentation
    • Semantic Segmentation
  • Verify the accuracy of the new API for all supported Tasks and Backends
  • Speed Benchmarks and Performance Optimizations (in progress – planned for Dec)
  • Graduate from Prototype (planned for Q1)
  • Add support of Depth Perception, Keypoint Detection, Optical Flow and more (future)

We are currently in the process of Benchmarking each Transform Class and Functional Kernel in order to measure and improve their performance. The scope includes optimizing existing kernels which will be adopted from V1. Early findings indicate that some improvements might need to be upstreamed on the C++ kernels of PyTorch Core. Our plan is to continue iterating throughout Q4 to improve the speed performance of the new API and enhance it with additional SoTA transforms with the help of the community.

We would love to get early feedback from you to improve its functionality. Please reach out to us if you have any questions or suggestions.

Read More

New Library Updates in PyTorch 1.13

Summary

We are bringing a number of improvements to the current PyTorch libraries, alongside the PyTorch 1.13 release. These updates demonstrate our focus on developing common and extensible APIs across all domains to make it easier for our community to build ecosystem projects on PyTorch.

Along with 1.13, we are releasing updates to the PyTorch Libraries, please find them below.

TorchAudio

(Beta) Hybrid Demucs Model and Pipeline

Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony® Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)

The TorchAudio v0.13 release includes the following features

  • MUSDB_HQ Dataset, which is used in Hybrid Demucs training (docs)
  • Hybrid Demucs model architecture (docs)
  • Three factory functions suitable for different sample rate ranges
  • Pre-trained pipelines (docs)
  • SDR Results of pre-trained pipelines on MUSDB_HQ test set
  • Tutorial that steps through music source separation using the pretrained pipeline (docs)
Pipeline All Drums Bass Other Vocals
HDEMUCS_HIGH_MUSDB* 6.42 7.76 6.51 4.47 6.93
HDEMUCS_HIGH_MUSDB_PLUS** 9.37 11.38 10.53 7.24 8.32

* Trained on the training data of MUSDB-HQ dataset.
** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.

from torchaudio.pipelines import HDEMUCS_HIGH_MUSDB_PLUS

bundle = HDEMUCS_HIGH_MUSDB_PLUS
model = bundle.get_model()
sources_list = model.sources

mixture, samplerate = torchaudio.load(song.wav)
sources = model(mixture)
audios = dict(zip(sources_list, sources)

Special thanks to Alexandre Defossez for the guidance.

(Beta) Datasets and Metadata Mode for SUPERB Benchmark

TorchAudio adds support for various audio-related datasets used in downstream tasks for benchmarking self-supervised learning models. With the addition of several new datasets, there is now support for the downstream tasks in version 1 of the SUPERB benchmark, which can be found in the s3prl repository.

For these datasets, we also add metadata support through a get_metadata function, enabling faster dataset iteration or preprocessing without the need to load waveforms. The function returns the same features as __getitem__, except it returns the relative waveform path rather than the loaded waveform.

Datasets with metadata functionality

(Beta) Custom Language Model support in CTC Beam Search Decoding

TorchAudio released a CTC beam search decoder in release 0.12, with KenLM language model support. This release, there is added functionality for creating custom Python language models that are compatible with the decoder, using the torchaudio.models.decoder.CTCDecoderLM wrapper.

For more information on using a custom language model, please refer to the documentation and tutorial.

(Beta) StreamWriter

torchaudio.io.StreamWriter is a class for encoding media including audio and video. This can handle a wide variety of codecs, chunk-by-chunk encoding and GPU encoding.

writer = StreamWriter("example.mp4")
writer.add_audio_stream(
    sample_rate=16_000,
    num_channels=2,
)
writer.add_video_stream(
    frame_rate=30,
    height=96,
    width=128,
    format="rgb24",
)
with writer.open():
    writer.write_audio_chunk(0, audio)
    writer.write_video_chunk(1, video)

For more information, refer to the documentation and the following tutorials

TorchData

For a complete list of changes and new features, please visit our repository’s 0.5.0 release note.

(Prototype) DataLoader2

DataLoader2 was introduced in the last release to execute DataPipe graph, with support for dynamic sharding for multi-process/distributed data loading, multiple backend ReadingServices, and DataPipe graph in-place modification (e.g. shuffle control).

In this release, we further consolidated the API for DataLoader2 and a detailed documentation is now available here. We continue to welcome early adopters and feedback, as well as potential contributors. If you are interested in trying it out, we encourage you to install the nightly version of TorchData.

(Beta) Data Loading from Cloud Service Providers

We extended our support to load data from additional cloud storage providers via DataPipes, now covering AWS, Google Cloud Storage, and Azure. A tutorial is also available. We are open to feedback and feature requests.

We also performed a simple benchmark, comparing the performance of data loading from AWS S3 and attached volume on an AWS EC2 instance. The results are visible here.

torch::deploy (Beta)

torch::deploy is now in Beta! torch::deploy is a C++ library for Linux based operating systems that allows you to run multiple Python interpreters in a single process. You can run your existing eager PyTorch models without any changes for production inference use cases. Highlights include:

  • Existing models work out of the box–no need to modify your python code to support tracing.
  • Full support for your existing Python environment including C extensions.
  • No need to cross process boundaries to load balance in multi-GPU serving environments.
  • Model weight can be shared between multiple Python interpreters.
  • A vastly improved installation and setup process.
torch::deploy::InterpreterManager manager(4);

// access one of the 4 interpreters
auto I = manager.acquireOne();

// run infer from your_model.py
I.global("your_model", "infer")({at::randn({10, 240, 320})});

Learn more here.

(Beta) CUDA/ROCm/CPU Backends

torch::deploy now links against standard PyTorch Python distributions so all accelerators that PyTorch core supports such as CUDA and AMD/HIP work out of the box.

(Prototype) aarch64/arm64 support

torch::deploy now has basic support for aarch64 Linux systems.

TorchEval

(Prototype) Introducing Native Metrics Support for PyTorch

TorchEval is a library built for users who want highly performant implementations of common metrics to evaluate machine learning models. It also provides an easy to use interface for building custom metrics with the same toolkit. Building your metrics with TorchEval makes running distributed training loops with torch.distributed a breeze.

Learn more with our docs, see our examples, or check out our GitHub repo.

TorchMultimodal Release (Beta)

Please watch for upcoming blogs in early November that will introduce TorchMultimodal, a PyTorch domain library for training SoTA multi-task multimodal models at scale, in more details; in the meantime, play around with the library and models through our tutorial.

TorchRec

(Prototype) Simplified Optimizer Fusion APIs

We’ve provided a simplified and more intuitive API for setting fused optimizer settings via apply_optimizer_in_backward. This new approach enables the ability to specify optimizer settings on a per-parameter basis and sharded modules will configure FBGEMM’s TableBatchedEmbedding modules accordingly. Additionally, this now let’s TorchRec’s planner account for optimizer memory usage. This should alleviate reports of sharding jobs OOMing after using Adam using a plan generated from planner.

(Prototype) Simplified Sharding APIs

We’re introducing the shard API, which now allows you to shard only the embedding modules within a model, and provides an alternative to the current main entry point – DistributedModelParallel. This lets you have a finer grained control over the rest of the model, which can be useful for customized parallelization logic, and inference use cases (which may not require any parallelization on the dense layers). We’re also introducing construct_module_sharding_plan, providing a simpler interface to the TorchRec sharder.

(Beta) Quantized Comms

Applying quantization or mixed precision to tensors in a collective call during model parallel training greatly improves training efficiency, with little to no effect on model quality. TorchRec now integrates with the quantized comms library provided by FBGEMM GPU and provides an interface to construct encoders and decoders (codecs) that surround the all_to_all, and reduce_scatter collective calls in the output_dist of a sharded module. We also allow you to construct your own codecs to apply to your sharded module. The codces provided by FBGEMM allow FP16, BF16, FP8, and INT8 compressions, and you may use different quantizations for the forward pass and backward pass.

TorchSnapshot (Beta)

Along with PyTorch 1.13, we are releasing the beta version of TorchSnapshot, which is a performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind. Highlights include:

  • Performance: TorchSnapshot provides a fast checkpointing implementation employing various optimizations, including zero-copy serialization for most tensor types, overlapped device-to-host copy and storage I/O, parallelized storage I/O
  • Memory Use: TorchSnapshot’s memory usage adapts to the host’s available resources, greatly reducing the chance of out-of-memory issues when saving and loading checkpoints
  • Usability: Simple APIs that are consistent between distributed and non-distributed workloads

Learn more with our tutorial.

TorchVision

We are happy to introduce torchvision v0.14 (release note). This version introduces a new model registration API to help users retrieving and listing models and weights. It also includes new image and video classification models such as MViT, S3D, Swin Transformer V2, and MaxViT. Last but not least, we also have new primitives and augmentation such as PolynomicalLR scheduler and SimpleCopyPaste.

(Beta) Model Registration API

Following up on the multi-weight support API that was released on the previous version, we have added a new model registration API to help users retrieve models and weights. There are now 4 new methods under the torchvision.models module: get_model, get_model_weights, get_weight, and list_models. Here are examples of how we can use them:

import torchvision
from torchvision.models import get_model, get_model_weights, list_models


max_params = 5000000

tiny_models = []
for model_name in list_models(module=torchvision.models):
    weights_enum = get_model_weights(model_name)
    if len([w for w in weights_enum if w.meta["num_params"] <= max_params]) > 0:
        tiny_models.append(model_name)

print(tiny_models)
# ['mnasnet0_5', 'mnasnet0_75', 'mnasnet1_0', 'mobilenet_v2', ...]

model = get_model(tiny_models[0], weights="DEFAULT")
print(sum(x.numel() for x in model.state_dict().values()))
# 2239188

(Beta) New Video Classification Models

We added two new video classification models, MViT and S3D. MViT is a state of the art video classification transformer model which has 80.757% accuracy on the Kinetics400 dataset, while S3D is a relatively small model with good accuracy for its size. These models can be used as follows:

import torch
from torchvision.models.video import *

video = torch.rand(3, 32, 800, 600)
model = mvit_v2_s(weights="DEFAULT")
# model = s3d(weights="DEFAULT")
model.eval()
prediction = model(images)

Here is the table showing the accuracy of the new video classification models tested in the Kinetics400 dataset.

Model Acc@1 Acc@5
mvit_v1_b 81.474 95.776
mvit_v2_s 83.196 96.36
s3d 83.582 96.64

We would like to thank Haoqi Fan, Yanghao Li, Christoph Feichtenhofer and Wan-Yen Lo for their work on PyTorchVideo and their support during the development of the MViT model. We would like to thank Sophia Zhi for her contribution implementing the S3D model in torchvision.

(Stable) New Architecture and Model Variants

For Classification Models, we’ve added the Swin Transformer V2 architecture along with pre-trained weights for its tiny/small/base variants. In addition, we have added support for the MaxViT transformer. Here is an example on how to use the models:

import torch
from torchvision.models import *

image = torch.rand(1, 3, 224, 224)
model = swin_v2_t(weights="DEFAULT").eval()
# model = maxvit_t(weights="DEFAULT").eval()
prediction = model(image)

Here is the table showing the accuracy of the models tested on ImageNet1K dataset.

Model Acc@1 Acc@1 change over V1 Acc@5 Acc@5 change over V1
swin_v2_t 82.072 + 0.598 96.132 + 0.356
swin_v2_s 83.712 + 0.516 96.816 + 0.456
swin_v2_b 84.112 + 0.530 96.864 + 0.224
maxvit_t 83.700 96.722

We would like to thank Ren Pang and Teodor Poncu for contributing the 2 models to torchvision.

(Stable) New Primitives & Augmentations

In this release we’ve added the SimpleCopyPaste augmentation in our reference scripts and we up-streamed the PolynomialLR scheduler to PyTorch Core. We would like to thank Lezwon Castelino and Federico Pozzi for their contributions. We are continuing our efforts to modernize TorchVision by adding more SoTA primitives, Augmentations and architectures with the help of our community. If you are interested in contributing, have a look at the following issue.

Torch-TensorRT

(Prototype) TensorRT with FX2TRT frontend

Torch-TensorRT is the PyTorch integration for TensorRT, providing high performance inference on NVIDIA GPUs. Torch-TRT allows for optimizing models directly in PyTorch for deployment providing up to 6x performance improvement.

Torch-TRT is an AoT compiler which ingests an nn.Module or TorchScript module, optimizes compatible subgraphs in TensorRT & leaves the rest to run in PyTorch. This gives users the performance of TensorRT, but the usability and familiarity of Torch.

Torch-TensorRT is part of the PyTorch ecosystem, and was released as v1.0 in November ‘21. There are currently two distinct front-ends: Torchscript & FX. Each provides the same value proposition and underlying operation with the primary difference being the input & output formats (TS vs FX / Python).

The Torchscript front-end was included in v1.0 and should be considered stable. The FX front-end is first released in v1.2 and should be considered a Beta.

Relevant Links:

(Stable) Introducing Torch-TensorRT

Torch-TensorRT is an integration for PyTorch that leverages inference optimizations of TensorRT on NVIDIA GPUs. It takes advantage of TensorRT optimizations, such as FP16 and INT8 reduced precision, graph optimization, operation fusion, etc. while offering a fallback to native PyTorch when TensorRT does not support the model subgraphs. Currently, there are two frontend paths existing in the library that help to convert a PyTorch model to tensorRT engine. One path is through Torch Script (TS) and the other is through FX frontend. That being said, the models are traced by either TS or FX into their IR graph and then converted to TensorRT from it.

Learn more with our tutorial.

TorchX

TorchX 0.3 updates include a new list API, experiment tracking, elastic training and improved scheduler support. There’s also a new Multi-Objective NAS tutorial using TorchX + Ax.

(Prototype) List

The newly added list command and API allows you to list recently launched jobs and their statuses for a given scheduler directly from within TorchX.

  • This removes the need for using secondary tools to list the jobs.
  • Full programmatic access to recent jobs for integration with custom tools.
$ torchx list -s kubernetes
APP HANDLE                                                       APP STATUS
-----------------------------------------------            -----------------
kubernetes://torchx/default:train-f2nx4459p5crr   SUCCEEDED

Learn more with our documentation.

(Prototype) Tracker

TorchX Tracker is a new prototype library that provides a flexible and customizable experiment and artifact tracking interface. This allows you to track inputs and outputs for jobs across multiple steps to make it easier to use TorchX with pipelines and other external systems.

from torchx import tracker

app_run = tracker.app_run_from_env()
app_run.add_metadata(lr=lr, gamma=gamma) # hyper parameters
app_run.add_artifact("model", "storage://path/mnist_cnn.pt") # logs / checkpoints
app_run.add_source(parent_run_id, "model") # lineage

Example:

(Prototype) Elastic Training and Autoscaling

Elasticity on Ray and Kubernetes – automatic scale up of distributed training jobs when using a supported scheduler. Learn more with our documentation.

(Prototype) Scheduler Improvements: IBM® Spectrum LSF

Added prototype support for the IBM Spectrum LSF scheduler.

(Beta) AWS Batch Scheduler

The AWS Batch scheduler integration is now in beta.

(Prototype) AnyPrecision Optimizer

Drop in replacement for AdamW optimizer that reduces GPU memory, enables two main features:

  • Ability to successfully train the entire model pipeline in full BFloat16.
    Kahan summation ensures precision. This can improve training throughput, especially on huge models, by reduced memory and increased computation speed.
  • Ability to change the variance state to BFloat16. This can reduce overall memory required for model training with additional speed improvements.

Find more information here.

Read More

PyTorch 1.13 release, including beta versions of functorch and improved support for Apple’s new M1 chips.

We are excited to announce the release of PyTorch® 1.13 (release note)! This includes Stable versions of BetterTransformer. We deprecated CUDA 10.2 and 11.3 and completed migration of CUDA 11.6 and 11.7. Beta includes improved support for Apple M1 chips and functorch, a library that offers composable vmap (vectorization) and autodiff transforms, being included in-tree with the PyTorch release. This release is composed of over 3,749 commits and 467 contributors since 1.12.1. We want to sincerely thank our dedicated community for your contributions.

Summary:

  • The BetterTransformer feature set supports fastpath execution for common Transformer models during Inference out-of-the-box, without the need to modify the model. Additional improvements include accelerated add+matmul linear algebra kernels for sizes commonly used in Transformer models and Nested Tensors is now enabled by default.

  • Timely deprecating older CUDA versions allows us to proceed with introducing the latest CUDA version as they are introduced by Nvidia®, and hence allows support for C++17 in PyTorch and new NVIDIA Open GPU Kernel Modules.

  • Previously, functorch was released out-of-tree in a separate package. After installing PyTorch, a user will be able to import functorch and use functorch without needing to install another package.

  • PyTorch is offering native builds for Apple® silicon machines that use Apple’s new M1 chip as a beta feature, providing improved support across PyTorch’s APIs.

Along with 1.13, we are also releasing major updates to the PyTorch libraries, more details can be found in this blog.

Stable Features

(Stable) BetterTransformer API

The BetterTransformer feature set, first released in PyTorch 1.12, is stable. PyTorch BetterTransformer supports fastpath execution for common Transformer models during Inference out-of-the-box, without the need to modify the model. To complement the improvements in Better Transformer, we have also accelerated add+matmul linear algebra kernels for sizes commonly used in Transformer models.

Reflecting the performance benefits for many NLP users, Nested Tensors use for Better Transformer is now enabled by default. To ensure compatibility, a mask check is performed to ensure a contiguous mask is supplied. In Transformer Encoder, the mask check for src_key_padding_mask may be suppressed by setting mask_check=False. This accelerates processing for users than can guarantee that only aligned masks are provided. Finally, better error messages are provided to diagnose incorrect inputs, together with improved diagnostics why fastpath execution cannot be used.

Better Transformer is directly integrated into the PyTorch TorchText library, enabling TorchText users to transparently and automatically take advantage of BetterTransformer speed and efficiency performance. (Tutorial)

Figure: BetterTransformer fastpath execution is now stable and enables sparsity optimization using Nested Tensor representation as default

Introduction of CUDA 11.6 and 11.7 and deprecation of CUDA 10.2 and 11.3

Timely deprecating older CUDA versions allows us to proceed with introducing the latest CUDA version as they are introduced by Nvidia®, and hence allows developers to use the latest features of CUDA and benefit from correctness fixes provided by the latest version.

Decommissioning of CUDA 10.2. CUDA 11 is the first CUDA version to support C++17. Hence decommissioning legacy CUDA 10.2 was a major step in adding support for C++17 in PyTorch. It also helps to improve PyTorch code by eliminating legacy CUDA 10.2 specific instructions.

Decommissioning of CUDA 11.3 and introduction of CUDA 11.7 brings compatibility support for the new NVIDIA Open GPU Kernel Modules and another significant highlight is the lazy loading support. CUDA 11.7 is shipped with cuDNN 8.5.0 which contains a number of optimizations accelerating transformer-based models, 30% reduction in library size , and various improvements in the runtime fusion engine. Learn more on CUDA 11.7 with our release notes.

Beta Features

(Beta) functorch

Inspired by Google® JAX, functorch is a library that offers composable vmap (vectorization) and autodiff transforms. It enables advanced autodiff use cases that would otherwise be tricky to express in PyTorch. Examples include:


We’re excited to announce that, as a first step towards closer integration with PyTorch, functorch has moved to inside the PyTorch library and no longer requires the installation of a separate functorch package. After installing PyTorch via conda or pip, you’ll be able to `import functorch’ in your program. Learn more with our detailed instructions, nightly and release notes.

(Beta) Intel® VTune™ Profiler’s Instrumentation and Tracing Technology APIs (ITT) integration

PyTorch users are able to visualize op-level timeline of PyTorch scripts execution in Intel® VTune™ Profiler when they need to analyze per-op performance with low-level performance metrics on Intel platforms.

with torch.autograd.profiler.emit_itt():
    for i in range(10):
        torch.itt.range_push('step_{}'.format(i))
        model(input)
        torch.itt.range_pop()


Learn more with our tutorial.

(Beta) NNC: Add BF16 and Channels last support

TorchScript graph-mode inference performance on x86 CPU is boosted by adding channels last and BF16 support to NNC. PyTorch users may benefit from channels last optimization on most popular x86 CPUs and benefit from BF16 optimization on Intel Cooper Lake Processor and Sapphire Rapids Processor. >2X geomean performance boost is observed on broad vision models with these two optimizations on Intel Cooper Lake Processor.

The performance benefit can be obtained with existing TorchScript, channels last and BF16 Autocast APIs. See code snippet below. We will migrate the optimizations in NNC to the new PyTorch DL Compiler TorchInductor.

import torch
import torchvision.models as models
model = models.resnet50(pretrained=True)
# Convert the model to channels-last
model = model.to(memory_format=torch.channels_last)
model.eval()
data = torch.rand(1, 3, 224, 224)
# Convert the data to channels-lastdata = data.to(memory_format=torch.channels_last)
# Enable autocast to run with BF16
with torch.cpu.amp.autocast(), torch.no_grad():
# Trace the model
model = torch.jit.trace(model, torch.rand(1, 3, 224, 224))
	model = torch.jit.freeze(model)
	# Run the traced model
	model(data)

(Beta) Support for M1 Devices

Since v1.12, PyTorch has been offering native builds for Apple® silicon machines that use Apple’s new M1 chip as a prototype feature. In this release, we bring this feature to beta, providing improved support across PyTorch’s APIs.

We now run tests for all submodules except torch.distributed on M1 macOS 12.6 instances. With this improved testing, we were able to fix features such as cpp extension and convolution correctness for certain inputs.

To get started, just install PyTorch v1.13 on your Apple silicon Mac running macOS 12 or later with a native version (arm64) of Python. Learn more with our release notes.

Prototype Features

(Prototype) Arm® Compute Library (ACL) backend support for AWS Graviton

We achieved substantial improvements for CV and NLP inference on aarch64 cpu with Arm Compute Library (acl) to enable acl backend for pytorch and torch-xla modules. Highlights include:

  • Enabled mkldnn + acl as the default backend for aarch64 torch wheel.
  • Enabled mkldnn matmul operator for aarch64 bf16 device.
  • Brought TensorFlow xla+acl feature into torch-xla. We enhanced the TensorFlow xla with Arm Compute Library runtime for aarch64 cpu. These changes are included in TensorFlow master and then the upcoming TF 2.10. Once the torch-xla repo is updated for the tensorflow commit, it will have compiling support for torch-xla. We observed ~2.5-3x improvement for MLPerf Bert inference compared to the torch 1.12 wheel on Graviton3.

(Prototype) CUDA Sanitizer

When enabled, the sanitizer begins to analyze low-level CUDA operations invoked as a result of the user’s PyTorch code to detect data race errors caused by unsynchronized data access from different CUDA streams. The errors found are then printed along with stack traces of faulty accesses, much like Thread Sanitizer does. An example of a simple error and the output produced by the sanitizer can be viewed here. It will be especially useful for machine learning applications, where corrupted data can be easy to miss for a human and the errors may not always manifest themselves; the sanitizer will always be able to detect them.

(Prototype) Limited Python 3.11 support

Binaries for Linux with Python 3.11 support are available to download via pip. Please follow the instructions on the get started page. Please note that Python 3.11 support is only a preview. In particular, features including Distributed, Profiler, FX and JIT might not be fully functional yet.

Read More

PyTorch’s Tracing Based Selective Build

Introduction

TL;DR: It can be challenging to run PyTorch on mobile devices, SBCs (Single Board Computers), and IOT devices. When compiled, the PyTorch library is huge and includes dependencies that might not be needed for the on-device use case.

To run a specific set of models on-device, we actually require only a small subset of the features in the PyTorch library. We found that using a PyTorch runtime generated using selective build can achieve up to 90% reduction in binary size (for the CPU and QuantizedCPU backends on an x86-64 build on Linux). In this blog, we share our experience of generating model-specific minimal runtimes using Selective Build and show you how to do the same.

Why is this important for app developers?

Using a PyTorch runtime generated by selective build can reduce the size of AI-powered apps by 30+ MB – a significant reduction for a typical mobile app! Making mobile applications more lightweight has many benefits – they are runnable on a wider variety of devices, consume less cellular data, and can be downloaded and updated faster on user’s devices.

What does the Developer Experience look like?

This method can work seamlessly with any existing PyTorch Mobile deployment workflows. All you need to do is replace the general PyTorch runtime library with a runtime customized for the specific models you wish to use in your application. The general steps in this process are:

  1. Build the PyTorch Runtime in instrumentation mode (this is called an instrumentation build of PyTorch). This will record the used operators, kernels and features.
  2. Run your models through this instrumentation build by using the provided model_tracer binary. This will generate a single YAML file that stores all the features used by your model. These features will be preserved in the minimal runtime.
  3. Build PyTorch using this YAML file as input. This is the selective build technique, and it greatly reduces the size of the final PyTorch binary.
  4. Use this selectively-built PyTorch library to reduce the size of your mobile application!

Building the PyTorch Runtime in a special “instrumentation” mode ( by passing the TRACING_BASED=1 build option) generates an instrumentation build runtime of PyTorch, along with a model_tracer binary. Running a model with this build allows us to trace the parts of PyTorch used by the model.

Figure 1: Instrumentation build of PyTorch

# Clone the PyTorch repo
git clone https://github.com/pytorch/pytorch.git
cd pytorch

# Build the model_tracer
USE_NUMPY=0 USE_DISTRIBUTED=0 USE_CUDA=0 TRACING_BASED=1 
  python setup.py develop

Now this instrumentation build is used to run a model inference with representative inputs. The model_tracer binary observes parts of the instrumentation build that were activated during the inference run, and dumps it to a YAML file.

Figure 2: YAML file generated by running model(s) on an instrumentation build

# Generate YAML file
./build/bin/model_tracer 
  --model_input_path /tmp/path_to_model.ptl 
  --build_yaml_path /tmp/selected_ops.yaml

Now we build the PyTorch Runtime again, but this time using the YAML file generated by the tracer. The runtime now only includes those parts that are needed for this model. This is called “Selectively built PyTorch runtime” in the diagram below.

# Clean out cached configuration
make clean

# Build PyTorch using Selected Operators (from the YAML file)
# using the host toolchain, and use this generated library
BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN=1 
USE_LIGHTWEIGHT_DISPATCH=0 
BUILD_LITE_INTERPRETER=1 
SELECTED_OP_LIST=/tmp/selected_ops.yaml 
TRACING_BASED=1 
  ./scripts/build_mobile.sh

Figure 3: Selective Build of PyTorch and model execution on a selectively built PyTorch runtime

Show me the code!

We’ve put together a notebook to illustrate what the process above looks like in code using a simple PyTorch model.

For a more hands-on tutorial to deploy this on Android/iOS this tutorial should be helpful.

Technical FAQs

Why is Tracing needed for a Selective Build of PyTorch?

In PyTorch, CPU kernels can call other operators via the PyTorch Dispatcher. Simply including the set of root operators called directly by the model is not sufficient as there might be many more being called under-the-hood transitively. Running the model on representative inputs and observing the actual list of operators called (aka “tracing”) is the most accurate way of determining what parts of PyTorch are used.

Additionally, factors such as which dtypes a kernel should handle are also runtime features that depend on actual input provided to the model. Hence, the tracing mechanism is extremely suitable for this purpose.

Which features can be selected (in or out) by using Tracing Based Selective Build?

The following features can be selected for the PyTorch runtime during the tracing based selective build process:

  1. CPU/QuantizedCPU kernels for PyTorch’s ATen Operators: If a PyTorch Operator is not needed by a model targeted at a selectively built runtime, then the registration of that CPU kernel is omitted in the runtime. This is controlled via Torchgen code-gen.
  2. Primary Operators: This is controlled by a macro named TORCH_SELECTIVE_SCHEMA (via templated selective build) that either selects a primary operator or de-selects it based on information in a generated header file.
  3. Code that handles specific dtypes in CPU kernels: This is performed by generating exception throws in specific case statements in the switch case generated by the macro AT_PRIVATE_CHECK_SELECTIVE_BUILD.
  4. Registration of Custom C++ Classes that extend PyTorch: This is controlled by the macro TORCH_SELECTIVE_CLASS, which can be used when registering Custom C++ Classes. The torch::selective_class_<> helper is to be used in conjunction with the macro TORCH_SELECTIVE_CLASS.

What is the structure of the YAML file used during the build?

The YAML file generated after tracing looks like the example below. It encodes all the elements of the “selectable” build feature as specified above.

include_all_non_op_selectives: false
build_features: []
operators:
    aten::add.Tensor:
        is_used_for_training: false
        is_root_operator: true
        include_all_overloads: false
    aten::len.t:
        is_used_for_training: false
        is_root_operator: true
        include_all_overloads: false
kernel_metadata:
    _local_scalar_dense_cpu:
    - Float
    add_stub:
    - Float
    copy_:
    - Bool
    - Byte
    mul_cpu:
    - Float
custom_classes: []

How exactly is code eliminated from the generated binary?

Depending on the specific scenario, there are 2 main techniques that are used to hint the compiler and linker about unused and unreachable code. This code is then cleaned up by the compiler or linker as unreachable code.

[1] Unreferenced functions removed by the Linker

When a function that isn’t transitively referenced from any visible function is present in the compiled object files that are being linked together, the linker will remove it (if the right build flags are provided). This is leveraged in 2 scenarios by the selective build system.

Kernel Registration in the Dispatcher

If an operator’s kernel isn’t needed, then it isn’t registered with the dispatcher. An unregistered kernel means that the function is unreachable, and it will be removed by the linker.

Templated Selective Build

The general idea here is that a class template specialization is used to select a class that either captures a reference to a function or not (depending on whether it’s used) and the linker can come along and clean out the unreferenced function.

For example, in the code below, there’s no reference to the function “fn2”, so it will be cleaned up by the linker since it’s not referenced anywhere.

#include <vector>
#include <cstdio>

template <typename T, bool>
struct FunctionSelector {
    T fn_;
    FunctionSelector(T fn): fn_(fn) {}
    T get() { return this->fn_; }
};

// The "false" specialization of this class does NOT retain the argument passed
// to the class constructor, which means that the function pointer passed in
// is considered to be unreferenced in the program (unless it is referenced
// elsewhere).
template <typename T>
struct FunctionSelector<T, false> {
    FunctionSelector(T) {}
};

template <typename T>
FunctionSelector<T, true> make_function_selector_true(T fn) {
    return FunctionSelector<T, true>(fn);
}

template <typename T>
FunctionSelector<T, false> make_function_selector_false(T fn) {
    return FunctionSelector<T, false>(fn);
}

typedef void(*fn_ptr_type)();

std::vector<fn_ptr_type> fns;

template <typename T>
void add_fn(FunctionSelector<T, true> fs) {
    fns.push_back(fs.get());
}

template <typename T>
void add_fn(FunctionSelector<T, false>) {
    // Do nothing.
}

// fn1 will be kept by the linker since it is added to the vector "fns" at
// runtime.
void fn1() {
    printf("fn1n");
}

// fn2 will be removed by the linker since it isn't referenced at all.
void fn2() {
    printf("fn2n");
}

int main() {
    add_fn(make_function_selector_true(fn1));
    add_fn(make_function_selector_false(fn2));
}

[2] Dead Code Eliminated by the Compiler

C++ Compilers can detect dead (unreachable) code by analyzing the code’s control flow statically. For example, if there’s a code-path that comes after an unconditional exception throw, then all the code after it will be marked as dead code and not converted to object code by the compiler. Typically, compilers require the use of the -fdce flag to eliminate dead code.

In the example below, you can see that the C++ code on the left (in the red boxes) doesn’t have any corresponding generated object code on the right.

Figure 4: Dead Code Elimination by C++ Compilers

This property is leveraged in the bodies of PyTorch kernel implementations that have a lot of repeated code to handle multiple dtypes of a Tensor. A dtype is the underlying data-type that the Tensor stores elements of. This can be one of float, double, int64, bool, int8, etc…

Almost every PyTorch CPU kernel uses a macro of the form AT_DISPATCH_ALL_TYPES* that is used to substitute some code specialized for every dtype that the kernel needs to handle. For example:

AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
    kBool, kHalf, kBFloat16, dtype, "copy_kernel", [&] {
  cpu_kernel_vec(
      iter,
      [=](scalar_t a) -> scalar_t { return a; },
      [=](Vectorized<scalar_t> a) -> Vectorized<scalar_t> { return a; });
});

The macro AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3 internally has a switch-case statement that looks like the code in Figure-4 above. The tracing process records the dtypes triggered for the kernel tag “copy_kernel” and the build process processes these tags and inserts throw statements in every case statement that is handling the dtype that isn’t required for this kernel tag.

This is how dtype selectivity is implemented in PyTorch’s Tracing Based Selective Build.

Conclusion

Tracing Based Selective Build is a practical and scalable approach to selecting only the used parts of an application to retain code that static analysis can not detect. This code is usually extremely data/input dependent in nature.

This article provides detailed insights into how Tracing Based Selective Build works under the hood, and the technical details related to its implementation. These techniques can also be applied to other applications and situations that can benefit from reduced binary size.

Read More

Scaling PyTorch models on Cloud TPUs with FSDP

Introduction

The research community has witnessed a lot of successes with large models across NLP, computer vision, and other domains in recent years. Many of these successes were enabled by Cloud TPUs – which are powerful hardware for distributed training. To support TPUs in PyTorch, the PyTorch/XLA library provides a backend for XLA devices (most notably TPUs) and lays the groundwork for scaling large PyTorch models on TPUs.

However, most existing modeling scaling tools in the PyTorch ecosystem assume GPU (or CPU) devices, often depend on specific features in CUDA, and do not work directly on TPUs. The lack of scaling tools makes it challenging to build large models that cannot fit into the memory of a single TPU chip.

To support model scaling on TPUs, we implemented the widely-adopted Fully Sharded Data Parallel (FSDP) algorithm for XLA devices as part of the PyTorch/XLA 1.12 release. We provide an FSDP interface with a similar high-level design to the CUDA-based PyTorch FSDP class while also handling several restrictions in XLA (see Design Notes below for more details). This FSDP interface allowed us to easily build models with e.g. 10B+ parameters on TPUs and has enabled many research explorations.

Using Fully Sharded Data Parallel (FSDP) in PyTorch/XLA

We provide a wrapper class XlaFullyShardedDataParallel over a given PyTorch model to shard its parameters across data-parallel workers. An example usage is as follows:

import torch
import torch_xla.core.xla_model as xm
from torch_xla.distributed.fsdp import XlaFullyShardedDataParallel as FSDP

model = FSDP(my_module)
optim = torch.optim.Adam(model.parameters(), lr=0.0001)
output = model(x, y)
loss = output.sum()
loss.backward()
optim.step()

Wrapping an nn.Module instance with XlaFullyShardedDataParallel enables the ZeRO-2 algorithm on it, where its gradients and the optimizer states are sharded for the entire training process. During its forward and backward passes, the full parameters of the wrapped module are first reconstructed from their corresponding shards for computation.

Nested FSDP wrapping can be used to further save memory. This allows the model to store only the full parameters of one individual layer at any given time. For nested FSDP, one should first wrap its individual submodules with an inner FSDP before wrapping the base model with an outer FSDP. This allows the model to store only the full parameters of one individual layer at any given time. And having an outer wrapper ensures to handle any leftover parameters, corresponding to the ZeRO-3 algorithm. Nested FSDP wrapping can be applied at any depth of submodules and there can be more than 2 layers of nesting.

Model checkpoint saving and loading for models and optimizers can be done like before by saving and loading their .state_dict(). Meanwhile, each training process should save its own checkpoint file of the sharded model parameters and optimizer states, and load the checkpoint file for the corresponding rank when resuming (regardless of ZeRO-2 or ZeRO-3, i.e. nested wrapping or not). A command line tool and a Python interface are provided to consolidate the sharded model checkpoint files together into a full/unshareded model checkpoint file.

Gradient checkpointing (also referred to as “activation checkpointing” or “rematerialization”) is another common technique for model scaling and can be used in conjunction with FSDP. We provide checkpoint_module, a wrapper function over a given nn.Module instance for gradient checkpointing (based on torch_xla.utils.checkpoint.checkpoint).

The MNIST and ImageNet examples below provide illustrative usages of (plain or nested) FSDP, saving and consolidation of model checkpoints, as well as gradient checkpointing.

Starting examples of FSDP in PyTorch/XLA

Training MNIST and ImageNet with FSDP

MNIST and ImageNet classification can often be used as starting points to build more complicated deep learning models. We provide the following FSDP examples on these two datasets:

A comparison of them with the vanilla data-parallel examples of MNIST and ImageNet illustrates how to adapt a training script to use FSDP. A major distinction to keep in mind is that when stepping the optimizer on an FSDP-wrapped model, one should directly call optimizer.step() instead of xm.optimizer_step(optimizer). The latter reduces the gradients across ranks, which is not what we need in FSDP, where the gradients are already reduced and sharded (from a reduce-scatter op in its backward pass).

Installation

FSDP is available from the PyTorch/XLA 1.12 and newer nightly releases. Please refer to https://github.com/pytorch/xla#-available-images-and-wheels for a guide on installation as well as Cloud TPU allocation. Then clone PyTorch/XLA repo on a TPU VM as follows

mkdir -p ~/pytorch && cd ~/pytorch
git clone --recursive https://github.com/pytorch/xla.git
cd ~/

Train MNIST on v3-8 TPU

It gets around 98.9 accuracy for 2 epochs:

python3 ~/pytorch/xla/test/test_train_mp_mnist_fsdp_with_ckpt.py 
  --batch_size 16 --drop_last --num_epochs 2 
  --use_nested_fsdp

The script above automatically tests consolidation of the sharded model checkpoints at the end. You can also manually consolidate the sharded checkpoint files via

python3 -m torch_xla.distributed.fsdp.consolidate_sharded_ckpts 
  --ckpt_prefix /tmp/mnist-fsdp/final_ckpt 
  --ckpt_suffix "_rank-*-of-*.pth"

Train ImageNet with ResNet-50 on v3-8 TPU

It gets around 75.9 accuracy for 100 epochs, same as what one would get without using FSDP; download and preprocess the ImageNet-1k dataset to /datasets/imagenet-1k:

python3 ~/pytorch/xla/test/test_train_mp_imagenet_fsdp.py 
  --datadir /datasets/imagenet-1k --drop_last 
  --model resnet50 --test_set_batch_size 64 --eval_interval 10 
  --lr 0.4 --batch_size 128 --num_warmup_epochs 5 
  --lr_scheduler_divide_every_n_epochs 30 --lr_scheduler_divisor 10 
  --num_epochs 100 
  --use_nested_fsdp

You can also explore other options in these two examples, such as --use_gradient_checkpointing to apply gradient checkpointing (i.e. activation checkpointing) on the ResNet blocks, or --compute_dtype bfloat16 to perform forward and backward passes in bfloat16 precision.

Examples on large-scale models

When building large models on TPUs, we often need to be aware of the memory constraints (e.g. 16 GB per core in TPU v3 and 32 GB per chip in TPU v4). For large models that cannot fit into a single TPU memory or the host CPU memory, one should use nested FSDP to implement the ZeRO-3 algorithm interleave submodule construction with inner FSDP wrapping, so that the full model never needs to be stored in memory during construction.

We illustrate these cases in https://github.com/ronghanghu/ptxla_scaling_examples, which provides examples of training a Vision Transformer (ViT) model with 10B+ parameters on a TPU v3 pod (with 128 cores) as well as other cases.

Design Notes

One might wonder why we need to develop a separate FSDP class in PyTorch/XLA instead of directly reusing PyTorch’s FSDP class or extending it to the XLA backend. The main motivation behind a separate FSDP class in PyTorch/XLA is that the native PyTorch’s FSDP class heavily relies on CUDA features that are not supported by XLA devices, while XLA also has several unique characteristics that need special handling. These distinctions require a different implementation of FSDP that would be much easier to build in a separate class.

Changes in API calls

One prominent distinction is that the native PyTorch FSDP is built upon separate CUDA streams for asynchronous execution in eager mode, while PyTorch/XLA runs in lazy mode and also does not support streams. In addition, TPU requires that all devices homogeneously run the same program. As a result, in the PyTorch/XLA FSDP implementation, CUDA calls and per-process heterogeneity need to be replaced by XLA APIs and alternative homogeneous implementations.

Tensor Storage Handling

Another prominent distinction is how to free a tensor’s storage, which is much harder in XLA than in CUDA. To implement ZeRO-3, one needs to free the storage of full parameters after a module’s forward pass, so that the next module can reuse this memory buffer for subsequent computation. PyTorch’s FSPD accomplishes this on CUDA by freeing the actual storage of a parameter p via p.data.storage().resize_(0). However, XLA tensors do not have this .storage() handle given that the XLA HLO IRs are completely functional and do not provide any ops to deallocate a tensor or resize its storage. Below the PyTorch interface, only the XLA compiler can decide when to free a TPU device memory corresponding to an XLA tensor, and a prerequisite is that the memory can only be released when the tensor object gets deallocated in Python – which cannot happen in FSDP because these parameter tensors are referenced as module attributes and also saved by PyTorch autograd for the backward pass.

Our solution to this issue is to split a tensor’s value properties from its autograd Variable properties, and to free a nn.Parameter tensor by setting its .data attribute to a dummy scalar of size 1. This way the actual data tensor for the full parameter gets dereferenced in Python so that XLA can recycle its memory for other computation, while autograd can still trace the base nn.Parameter as a weak reference to the parameter data. To get this to work, one also needs to handle views over the parameters as views in PyTorch also hold references to its actual data (this required fixing a shape-related issue with views in PyTorch/XLA).

Working with XLA compiler

The solution above should be enough to free full parameters if the XLA compiler faithfully preserves the operations and their execution order in our PyTorch program. But there is another problem – XLA attempts to optimize the program to speed up its execution by applying common subexpression elimination (CSE) to the HLO IRs. In a naive implementation of FSDP, the XLA compiler typically eliminates the 2nd all-gather in the backward pass to reconstruct the full parameters when it sees that it is a repeated computation from the forward pass, and directly holds and reuses the full parameters we want to free up after the forward pass. To guard against this undesired compiler behavior, we introduced the optimization barrier op into PyTorch/XLA and used it to stop eliminating the 2nd all-gather. This optimization barrier is also applied to a similar case of gradient checkpointing to prevent CSE between forward and backward passes that could eliminate the rematerialization.

In the future, if the distinctions between CUDA and XLA become not as prominent as mentioned above, it could be worth considering a merge of the PyTorch/XLA FSDP with the native PyTorch FSDP to have a unified interface.

Acknowledgments

Thanks to Junmin Hao from AWS for reviewing the PyTorch/XLA FSDP pull request. Thanks to Brian Hirsh from the Meta PyTorch team for support on the PyTorch core issues. Thanks to Isaack Karanja, Will Cromar, and Blake Hechtman from Google for support on GCP, XLA, and TPU issues.

Thanks to Piotr Dollar, Wan-Yen Lo, Alex Berg, Ryan Mark, Kaiming He, Xinlei Chen, Saining Xie, Shoubhik Debnath, Min Xu, and Vaibhav Aggarwal from Meta FAIR for various TPU-related discussions.

Read More

Performance Debugging of Production PyTorch Models at Meta

1. Meta’s AI Performance Profiling (MAIProf)

Figure 1: A simplified illustration of the Meta’s AI performance profiling (MAIProf) infrastructure.

Figure 1 gives a simplified illustration of the AI performance profiling infrastructure at Meta. ML research and performance engineers submit through the User Portal a profiling request for a training job to the Profiling Service, which subsequently broadcasts the request to all the GPU hosts running the training job. When the Monitoring Daemon on a GPU host receives the profiling request, it will notify the Kineto GPU tracer (built on top of NVIDIA’s libcupti) inside the PyTorch program corresponding to the training job. As a result, Kineto traces will be collected and uploaded to the Object Store asynchronously (in more details: there is one Kineto trace collected for each individual GPU, each is treated and stored as a blob; an example will be given in Section 2). Meanwhile, MAIProf also collects a variety of aggregated performance metrics: the Monitoring Daemon on every GPU host continuously reads performance counters from NVIDIA’s DCGM/NVML and logs them to a Time Series DB.

Once both trace and metrics collections are completed, the Profiling Service will automatically download traces from the Object Store for trace analysis and performance metrics from the Time Series DB for metric analysis. Finally, an overall profiling report with detailed and insightful analysis is delivered to the user.

To serve production uses, we deliberately made the following design choices for MAIProf:

  • No source-code change required in the PyTorch models: profiling is triggered by sampling the execution of an unmodified model for a user-specified amount of time.
  • Provide a holistic view of performance: MAIProf performs system-wide analysis that cover both CPU and GPU. Under the hood, it invokes various CPU tools (e.g., Python tracer, Autograd Observer) and GPU tools (e.g., Kineto, DCGM) and correlates their results.
  • Provide multiple tools that target a wide range of AI partitioners: At Meta, there are engineers with different backgrounds who may need to tune their AI workload performance. Some of them are AI experts while others are general software engineers. Therefore, MAIProf provides a variety of tools for different levels of performance debugging, from high-level automatic trace comprehension to low-level trace analysis.
  • Support distributed GPU profiling: MAIProf can collect profiling data from multiple hosts, each with multiple GPUs. It then shows a combined view/analysis of the entire system.
  • Highly scalable: MAIProf is built as a service on top of existing infrastructures in Meta data centers such as a scalable storage system called Manifold. Its profiling capability can be easily scaled by adding more machines in the service pool with the increase of workloads.

2. Case Study: Optimizing a Protection PyTorch Model

To be concrete, we use a case study on a protection PyTorch model used in production. First, we discuss our steps for identifying the performance bottlenecks in the model with MAIProf. Then we describe the corresponding optimizations applied and their impacts.

2.1 Performance Bottlenecks

Step 1:

Inspect the CPU and GPU utilization on the same timeline, as shown in Figure 2.

Figure 2: CPU usage over time (the top) vs. GPU usage over time (the bottom).

The first performance anomaly we noticed in Figure 2 is the pattern: “GPU-idle, GPU-active, GPU-idle, GPU-active …” throughout the training. Overall, the GPU is idle for more than half of the training time (this is bad for performance because the GPU is a higher-performance device and so we want it to be utilized as much as possible).

Step 2:

Collect a Python function call trace on the CPU with MAIProf while the GPU is idle, which is shown in Figure 3.

Figure 3: A Python call trace.

The Python trace shows that most of the CPU time is spent inside a Python function sharded_iterrows(). From the source code of the model, we learned that this function processes a big feature table in parallel. The number of worker threads used is controlled by a configurable parameter (num_worker_threads). Also, after investigating how the feature table is generated, we understood the performance anomaly: the training dataset is too large to fit in the CPU memory all at once; it needs to be broken into multiple sub-datasets, each has sufficient data for running 10 epochs. Consequently, a new sub-dataset needs to be read from the disk to memory every 10 epochs, during which the GPU is totally idle.

Step 3:

Collect GPU performance metrics, which is shown in Figure 4.

Figure 4: GPU performance metrics in MAIProf.

We made the following observations from Figure 4:

  • The streaming multiprocessor (SM) runs the model’s CUDA kernels. Its utilization [1] is 9.1%, indicating that the parallel compute units on the GPU are not well utilized.
  • Tensor Core utilization is 0, meaning that Tensor Core (the mixed-precision compute unit on GPU) [2] is not used at all.
  • Max GPU memory utilization is 47.13%, indicating that half of the GPU memory is left unused.

Step 4:

Collect a GPU trace (aka Kineto trace) of the training loop as shown in Figure 5.

Figure 5: A GPU trace (aka Kineto trace) of the training loop.

Since commonly used PyTorch functions are already annotated, their names are automatically shown on the trace. With them, we can roughly divide the trace into the four phases in a training iteration: (1) data loading, (2) forward pass, (3) backward pass, (4) gradient optimization (note: In Figure 5, the “optimizer” phase is from the previous batch while the other three phases are from the current batch).

2.2 Optimizations

We performed four simple optimizations that target the bottlenecks identified above, each requiring only a change in a config parameter or at most a few source lines. They are listed in Figure 6.

Optimization Amount of changes Bottlenecks addressed
Tune num_worker_threads by trying a few possible values within the number of CPU cores on each host. 1 source line GPU totally idle time
Double the batch sizes 2 config parameters GPU memory under-utilization
Use automatic mixed precision in PyTorch 13 source lines Zero Tensor Core utilization
Use mulitensor optimizer in PyTorch 1 source line Many small GPU kernels in the optimizer

Figure 6: Four simple optimizations applied.

3. Concluding Remarks

Performance tuning for PyTorch in production environments is increasingly important. A capable performance-debugging tool is a key to this process. We demonstrate with a case study on a production model that MAIProf is a powerful infrastructure for identifying optimization opportunities.

At Meta, MAIProf has been used by 100s of engineers, from performance novices to experts, to identify many more types of bottlenecks. These include slow data loading, small and/or slow GPU kernels, distributed training issues such as load imbalance and excessive communication. MAIProf covers major classes of models, including recommendation, vision, and natural language processing. In summary, it is now an indispensable tool for tuning the performance of production PyTorch workloads.

References

[1] https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/ cudaexperiments/kernellevel/achievedoccupancy.htm

[2] https://www.nvidia.com/en-us/data-center/tensor-cores/

Read More