PyTorch 2.1: automatic dynamic shape compilation, distributed checkpointing

We are excited to announce the release of PyTorch® 2.1 (release note)! PyTorch 2.1 offers automatic dynamic shape support in torch.compile, torch.distributed.checkpoint for saving/loading distributed training jobs on multiple ranks in parallel, and torch.compile support for the NumPy API.

In addition, this release offers numerous performance improvements (e.g. CPU inductor improvements, AVX512 support, scaled-dot-product-attention support) as well as a prototype release of torch.export, a sound full-graph capture mechanism, and torch.export-based quantization.

Along with 2.1, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog. 

This release is composed of 6,682 commits and 784 contributors since 2.0. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.1.  More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Summary: 

  • torch.compile now includes automatic support for detecting and minimizing recompilations due to tensor shape changes using automatic dynamic shapes.
  • torch.distributed.checkpoint enables saving and loading models from multiple ranks in parallel, as well as resharding due to changes in cluster topology.
  • torch.compile can now compile NumPy operations via translating them into PyTorch-equivalent operations.
  • torch.compile now includes improved support for Python 3.11.
  • New CPU performance features include inductor improvements (e.g. bfloat16 support and dynamic shapes), AVX512 kernel support, and scaled-dot-product-attention kernels.
  • torch.export, a sound full-graph capture mechanism is introduced as a prototype feature, as well as torch.export-based quantization.
  • torch.sparse now includes prototype support for semi-structured (2:4) sparsity on NVIDIA® GPUs.
Stable Beta Prototype Performance Improvements
  Automatic Dynamic Shapes torch.export() AVX512 kernel support
  torch.distributed.checkpoint Torch.export-based Quantization CPU optimizations for scaled-dot-product-attention (SPDA)
  torch.compile + NumPy semi-structed (2:4) sparsity CPU optimizations for bfloat16
  torch.compile + Python 3.11 cpp_wrapper for torchinductor  
  torch.compile + autograd.Function    
  third-party device integration: PrivateUse1    

*To see a full list of public 2.1, 2.0, and 1.13 feature submissions click here.

Beta Features

(Beta) Automatic Dynamic Shapes

Dynamic shapes is functionality built into torch.compile that can minimize recompilations by tracking and generating code based on the symbolic shape of a tensor rather than the static shape (e.g. [B, 128, 4] rather than [64, 128, 4]). This allows torch.compile to generate a single kernel that can work for many sizes, at only a modest cost to efficiency. Dynamic shapes has been greatly stabilized in PyTorch 2.1, and is now automatically enabled if torch.compile notices recompilation due to varying input shapes. You can disable automatic dynamic by passing dynamic=False to torch.compile, or by setting torch._dynamo.config.automatic_dynamic_shapes = False.

In PyTorch 2.1, we have shown good performance with dynamic shapes enabled on a variety of model types, including large language models, on both CUDA and CPU.

For more information on dynamic shapes, see this documentation.

[Beta] torch.distributed.checkpoint

torch.distributed.checkpoint enables saving and loading models from multiple ranks in parallel. In addition, checkpointing automatically handles fully-qualified-name (FQN) mappings across models and optimizers, enabling load-time resharding across differing cluster topologies.

For more information, see torch.distributed.checkpoint documentation and tutorial.

[Beta] torch.compile + NumPy

torch.compile now understands how to compile NumPy operations via translating them into PyTorch-equivalent operations.  Because this integration operates in a device-agnostic manner, you can now GPU-accelerate NumPy programs – or even mixed NumPy/PyTorch programs – just by using torch.compile.

Please see this section in the torch.compile FAQ for more information about torch.compile + NumPy interaction, and follow the PyTorch Blog for a forthcoming blog about this feature.

[Beta] torch.compile + Python 3.11

torch.compile previously only supported Python versions 3.8-3.10. Users can now optimize models with torch.compile in Python 3.11.

[Beta] torch.compile + autograd.Function

torch.compile can now trace and optimize the backward function of user-defined autograd Functions, which unlocks training optimizations for models that make heavier use of extensions mechanisms.

[Beta] Improved third-party device support: PrivateUse1

Third-party device types can now be registered to PyTorch using the privateuse1 dispatch key.  This allows device extensions to register new kernels to PyTorch and to associate them with the new key, allowing user code to work equivalently to built-in device types.  For example, to register “my_hardware_device”, one can do the following:

torch.rename_privateuse1_backend("my_hardware_device")
torch.utils.generate_methods_for_privateuse1_backend()
x = torch.randn((2, 3), device='my_hardware_device')
y = x + x # run add kernel on 'my_hardware_device'

To validate this feature, the OSS team from Ascend NPU has successfully integrated torch_npu into pytorch as a plug-in through the PrivateUse1 functionality.

For more information, please see the PrivateUse1 tutorial here.

Prototype Features

[Prototype] torch.export()

torch.export() provides a sound tracing mechanism to capture a full graph from a PyTorch program based on new technologies provided by PT2.0.

Users can extract a clean representation (Export IR) of a PyTorch program in the form of a dataflow graph, consisting of mostly straight-line calls to PyTorch operators. Export IR can then be transformed, serialized, saved to file, transferred, loaded back for execution in an environment with or without Python.

For more information, please see the tutorial here.

[Prototype] torch.export-based Quantization

torch.ao.quantization now supports post-training static quantization on PyTorch2-based torch.export flows.  This includes support for built-in XNNPACK and X64Inductor Quantizer, as well as the ability to specify one’s own Quantizer.

For an explanation on post-training static quantization with torch.export, see this tutorial, for quantization-aware training for static quantization with torch.export, see this tutorial.

For an explanation on how to write one’s own Quantizer, see this tutorial.

[Prototype] semi-structured (2:4) sparsity for NVIDIA® GPUs

torch.sparse now supports creating and accelerating compute over semi-structured sparse (2:4) tensors.  For more information on the format, see this blog from NVIDIA.A minimal example introducing semi-structured sparsity is as follows:

from torch.sparse import to_sparse_semi_structured
 
x = torch.rand(64, 64).half().cuda()
mask = torch.tensor([0, 0, 1, 1]).tile((64, 16)).cuda().bool()
linear = nn.Linear(64, 64).half().cuda()

linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight.masked_fill(~mask, 0)))
linear(x)

To learn more, please see the documentation and accompanying tutorial.

[Prototype] cpp_wrapper for torchinductor

cpp_wrapper can reduce the Python overhead for invoking kernels in torchinductor by generating the kernel wrapper code in C++. This feature is still in the prototype phase; it does not support all programs that successfully compile in PT2 today. Please file issues if you discover limitations for your use case to help us prioritize.

The API to turn this feature on is:

import torch
import torch._inductor.config as config
config.cpp_wrapper = True

For more information, please see the tutorial.

Performance Improvements

AVX512 kernel support

In PyTorch 2.0, AVX2 kernels would be used even if the CPU supported AVX512 instructions.  Now, PyTorch defaults to using AVX512 CPU kernels if the CPU supports those instructions, equivalent to setting ATEN_CPU_CAPABILITY=avx512 in previous releases.  The previous behavior can be enabled by setting ATEN_CPU_CAPABILITY=avx2.

CPU optimizations for scaled-dot-product-attention (SDPA)

Previous versions of PyTorch provided optimized CUDA implementations for transformer primitives via torch.nn.functiona.scaled_dot_product_attention.  PyTorch 2.1 includes optimized FlashAttention-based CPU routines.

See the documentation here.

CPU optimizations for bfloat16

PyTorch 2.1 includes CPU optimizations for bfloat16, including improved vectorization support and torchinductor codegen.

Read More

New Library Updates in PyTorch 2.1

Summary

We are bringing a number of improvements to the current PyTorch libraries, alongside the PyTorch 2.1 release. These updates demonstrate our focus on developing common and extensible APIs across all domains to make it easier for our community to build ecosystem projects on PyTorch. 

Along with 2.1, we are also releasing a series of beta updates to the PyTorch domain libraries including TorchAudio and TorchVision. Please find the list of the latest stable versions and updates below.

Latest Stable Library Versions (Full List)*  
TorchArrow 0.1.0 TorchRec 0.4.0 TorchVision 0.16
TorchAudio 2.1 TorchServe 0.7.1 TorchX 0.5.0
TorchData 0.7.0 TorchText 0.16.0 PyTorch on XLA Devices 1.14

*To see prior versions or (unstable) nightlies, click on versions in the top left menu above ‘Search Docs’.

TorchAudio

TorchAudio v2.1 introduces the following new features and backward-incompatible changes:

[Beta] A new API to apply filter, effects and codec

`torchaudio.io.AudioEffector` can apply filters, effects and encodings to waveforms in online/offline fashion. You can use it as a form of augmentation.

Please refer to https://pytorch.org/audio/2.1/tutorials/effector_tutorial.html for the usage and examples.

[Beta] Tools for Forced alignment

New functions and a pre-trained model for forced alignment were added. `torchaudio.functional.forced_align` computes alignment from an emission and `torchaudio.pipelines.MMS_FA` provides access to the model trained for multilingual forced alignment in MMS: Scaling Speech Technology to 1000+ languages project.

Please refer to https://pytorch.org/audio/2.1/tutorials/ctc_forced_alignment_api_tutorial.html for the usage of `forced_align` function, and https://pytorch.org/audio/2.1/tutorials/forced_alignment_for_multilingual_data_tutorial.html for how one can use `MMS_FA` to align transcript in multiple languages.

[Beta] TorchAudio-Squim : Models for reference-free speech assessment

Model architectures and pre-trained models from the paper TorchAudio-Sequim: Reference-less Speech Quality and Intelligibility measures in TorchAudio were added.

You can use the pre-trained models `torchaudio.pipelines.SQUIM_SUBJECTIVE` and `torchaudio.pipelines.SQUIM_OBJECTIVE`. They can estimate the various speech quality and intelligibility metrics (e.g. STOI, wideband PESQ, Si-SDR, and MOS). This is helpful when evaluating the quality of speech generation models, such as Text-to-Speech (TTS).

Please refer to https://pytorch.org/audio/2.1/tutorials/squim_tutorial.html for the details.

[Beta] CUDA-based CTC decoder

`torchaudio.models.decoder.CUCTCDecoder` performs CTC beam search in CUDA devices. The beam search is fast. It eliminates the need to move data from CUDA device to CPU when performing automatic speech recognition. With PyTorch’s CUDA support, it is now possible to perform the entire speech recognition pipeline in CUDA.

Please refer to https://pytorch.org/audio/master/tutorials/asr_inference_with_cuda_ctc_decoder_tutorial.html for the detail.

[Prototype] Utilities for AI music generation

We are working to add utilities that are relevant to music AI. Since the last release, the following APIs were added to the prototype.

Please refer to respective documentation for the usage.

New recipes for training models

Recipes for Audio-visual ASR, multi-channel DNN beamforming and TCPGen context-biasing were added.

Please refer to the recipes

Update to FFmpeg support

The version of supported FFmpeg libraries was updated. TorchAudio v2.1 works with FFmpeg 6, 5 and 4.4. The support for 4.3, 4.2 and 4.1 are dropped.

Please refer to https://pytorch.org/audio/2.1/installation.html#optional-dependencies for the detail of the new FFmpeg integration mechanism.

Update to libsox integration

TorchAudio now depends on libsox installed separately from torchaudio. Sox I/O backend no longer supports file-like objects. (This is supported by FFmpeg backend and soundfile.)

Please refer to https://pytorch.org/audio/master/installation.html#optional-dependencies for the details.

TorchRL

Our RLHF components make it easy to build an RLHF training loop with limited RL knowledge. TensorDict enables an easy interaction between datasets (eg, HF datasets) and RL models. The new algorithms we provide deliver a wide range of solutions for offline RL training, which is more data efficient.

Through RoboHive and IsaacGym, TorchRL now provides a built-in interface with hardware (robots), tying training at scale with policy deployment on device. Thanks to SMAC, VMAS, and PettingZoo and related MARL-oriented losses, TorchRL is now fully capable of training complex policies in multi-agent settings.

New algorithms

  • [BETA] We integrate some RLHF components and examples: we provide building blocks for data formatting in RL frameworks, reward model design, specific transforms that enable efficient learning (eg. KL correction) and training scripts
  • [Stable] New algorithms include Decision transformers, CQL, multi-agent losses such as MAPPO and QMixer.New features– [Stable] New transforms such as Visual Cortex 1 (VC1), a foundational model for RL. 
  • We widened the panel of library covered by TorchRL: 
    • [Beta] IsaacGym, a powerful GPU-based simulator that allows interaction and rendering of thousands of vectorized environments by NVIDIA.
    • [Stable] PettingZoo, a multi-agent library by the Farama Foundation.
    • [Stable] SMAC-v2, the new Starcraft Multi-agent simulator
    • [Stable] RoboHive, a collection of environments/tasks simulated with the MuJoCo physics engine.

Performance improvements

We provide faster data collection through refactoring and integration of SB3 and Gym asynchronous environments execution. We also made our value functions faster to execute.

TorchRec

[Prototype] Zero Collision / Managed Collision Embedding Bags

A common constraint in Recommender Systems is the sparse id input range is larger than the number of embeddings the model can learn for a given parameter size.   To resolve this issue, the conventional solution is to hash sparse ids into the same size range as the embedding table.  This will ultimately lead to hash collisions, with multiple sparse ids sharing the same embedding space.   We have developed a performant alternative algorithm that attempts to address this problem by tracking the N most common sparse ids and ensuring that they have a unique embedding representation. The module is defined here and an example can be found here.

[Prototype] UVM Caching – Prefetch Training Pipeline

For tables where on-device memory is insufficient to hold the entire embedding table, it is common to leverage a caching architecture where part of the embedding table is cached on device and the full embedding table is on host memory (typically DDR SDRAM).   However, in practice, caching misses are common, and hurt performance due to relatively high latency of going to host memory.   Building on TorchRec’s existing data pipelining, we developed a new Prefetch Training Pipeline to avoid these cache misses by prefetching the relevant embeddings for upcoming batch from host memory, effectively eliminating cache misses in the forward path.

TorchVision 

Transforms and augmentations

Major speedups

The new transforms in torchvision.transforms.v2 are now 10%-40% faster than before! This is mostly achieved thanks to 2X-4X improvements made to v2.Resize(), which now supports native uint8 tensors for Bilinear and Bicubic mode. Output results are also now closer to PIL’s! Check out our performance recommendations to learn more.

Additionally, torchvision now ships with libjpeg-turbo instead of libjpeg, which should significantly speed-up the jpeg decoding utilities (read_image, decode_jpeg), and avoid compatibility issues with PIL.

CutMix and MixUp

Long-awaited support for the CutMix and MixUp augmentations is now here! Check our tutorial to learn how to use them.

Towards stable V2 transforms

In the previous release 0.15 we BETA-released a new set of transforms in torchvision.transforms.v2 with native support for tasks like segmentation, detection, or videos. We have now stabilized the design decisions of these transforms and made further improvements in terms of speedups, usability, new transforms support, etc.

We’re keeping the torchvision.transforms.v2 and torchvision.tv_tensors namespaces as BETA until 0.17 out of precaution, but we do not expect disruptive API changes in the future.

Whether you’re new to Torchvision transforms, or you’re already experienced with them, we encourage you to start with Getting started with transforms v2 in order to learn more about what can be done with the new v2 transforms.

Browse our main docs for general information and performance tips. The available transforms and functionals are listed in the API reference. Additional information and tutorials can also be found in our example gallery, e.g. Transforms v2: End-to-end object detection/segmentation example or How to write your own v2 transforms.

[BETA] MPS support

The nms and roi-align kernels (roi_align, roi_pool, ps_roi_align, ps_roi_pool) now support MPS. Thanks to Li-Huai (Allan) Lin for this contribution!

TorchX

Schedulers

  • [Prototype] Kubernetes MCAD Scheduler: Integration for easily scheduling jobs on Multi-Cluster-Application-Dispatcher (MCAD)

  • AWS Batch 

    • Add privileged option to enable running containers on EFA enabled instances with elevated networking permissions

TorchX Tracker

  • [Prototype] MLFlow backend for TorchX Tracker: in addition to fsspec based tracker, TorchX can use MLFlow instance to track metadata/experiments 

Components

  • dist.spmd component to support Single-Process-Multiple-Data style applications

Workspace

  • Add ability to access image and workspace path from Dockerfile while building docker workspace

Release includes number of other bugfixes.

To learn more about Torchx visit https://pytorch.org/torchx/latest/

TorchText and TorchData

As of September 2023 we have paused active development of TorchText and TorchData as we re-evaluate how we want to serve the needs of the community in this space.

Read More

How to Build an Interactive Chat-Generation Model using DialoGPT and PyTorch

How to Build an Interactive Chat-Generation Model using DialoGPT and PyTorch

The focus on interactive chat-generation (or conversational response-generation) models has greatly increased in the past several months. Conversational response-generation models such as ChatGPT and Google Bard have taken the AI world by storm. The purpose of interactive chat generation is to answer various questions posed by humans, and these AI based models use natural language processing (NLP) to generate conversations almost indistinguishable from those generated by humans.

This article showcases a code sample on how to create interactive chats based on a pre-trained DialoGPT model from Hugging Face with the addition of the Intel® Extension for PyTorch to perform dynamic quantization on the model.

Get Started

Why DialoGPT?

DialoGPT (Dialogue Generative Pre-trained Transformer) is a large-scale, pre-trained dialogue-response-generation model trained on 147M conversation-like exchanges pulled out from Reddit comment chains and discussion threads. DialoGPT was proposed by Microsoft in 2019. The main goal was to create open-domain chatbots capable of producing natural responses to a variety of conversational topics. The conversational response-generation systems that leverage DialoGPT generate more applicable, resourceful, diverse, and context-specific replies.

DialoGPT Architecture

DialoGPT architecture is based on the GPT-2 model. It is formulated as an autoregressive language model and uses a multi-layer transformer as the model architecture. GPT-2 was proposed by OpenAI. GPT-2 models are trained on general text data whereas DialoGPT is trained on Reddit discussion threads.

Let’s look at the GPT-2 architecture. There are two types of blocks in general transformer architecture:

  • Encoder – contains self-attention layer and feed-forward neural network
  • Decoder – similar to encoder, but the self-attention layer is masked

The self-attention layer allows a position to peak at tokens to the right of the current word (the successive words in text), whereas masked self-attention layer prevents that from happening.

self-attention layer vs masked self-attention layer

GPT-2 is built using transformer decoder blocks. This means that the following layers are used in the architecture:

  1. Embedding Layer – responsible for converting input text into embeddings (each word is converted to a fixed-length vector representation)
  2. Transformer Decoder – includes multiple decoder blocks with masked self-attention and feed forward neural network layers
  3. Output Layer – responsible for converting embeddings obtained from the decoder into words

GPT-2 architecture (and DialoGPT architecture) is shown below.

GPT-2 architecture

As the model is based on transformers architecture, it has the issue of repetition and copying the inputs. To avoid repetition, we can use Top-K sampling and Top-p sampling.

  • Top-K sampling – filters the K most likely next words and redistributes the probability mass among only those K next words.
  • Top-p sampling – rather than selecting only the most likely K words, selects the smallest possible set of words whose cumulative probability exceeds the probability p.

The probability mass is then redistributed among the words in the set. As a result, the size of the set of words can be dynamically increased and decreased based on the probability distribution of the next word.

Quantization using Intel® Extension for PyTorch

What is Quantization?

Quantization is a systematic reduction of the precision of all or several layers within the model. This means a higher-precision type, such as the single-precision floating-point (FP32) mostly used in deep learning, is converted into a lower-precision type such as FP16 (16 bits) or INT8 (8 bits).

This helps in achieving,

  • lower memory bandwidth
  • lower storage
  • higher performance with minimum-to-zero accuracy loss

Quantization is especially important with large models such as those based on the Transformer architecture like BERT or GPT.

There are two types of quantization:

  • Static – Static quantization quantizes the weights and activations of the model. This quantization is used when both memory bandwidth and compute savings are important.
  • Dynamic – In dynamic quantization, the weights are quantized ahead of time, but the activations are dynamically quantized during inference.

Intel Extension for PyTorch: The Intel Extension extends PyTorch with up-to-date features and optimizations for an extra performance boost on Intel® hardware. Learn how to install it standalone or get it a part of the Intel® AI Analytics Toolkit.

The extension can be loaded as a Python* module or linked as a C++ library. Python users can enable it dynamically by importing intel_extension_for_pytorch.

  • This CPU tutorial gives detailed information about Intel Extension for PyTorch for Intel CPUs. Source code is available at the master branch.
  • This GPU tutorial gives detailed information about Intel Extension for PyTorch for Intel GPUs. Source code is available at the xpu-master branch.

How to perform dynamic quantization using Intel Extension for PyTorch?

Here are the steps to quantize the existing FP32 model to INT8 model using dynamic quantization:

  1. Prepare quantization configuration – We can use default dynamic quantization configuration with ipex.quantization.default_dynamic_qconfig.
  2. Prepare the FP32 model by using the** ipex.quantization.prepare **method (provide the input parameters such as FP32 model to quantize, the prepared configuration, example inputs and information if the quantization should be in place).
  3. Convert the model from FP32 to INT8 – Use ipex.quantization.convert method for conversion. The input model will be the model prepared in step 2.

We also encourage you to check out the Intel® Neural Compressor tool that automates popular model-compression technologies such as quantization, pruning, and knowledge distillation across multiple deep learning frameworks.

Code Sample

The following steps are implemented in the code sample:

  1. Load model and tokenizer: Transformers library (check out Intel® Extension for Transformers) and Auto Classes available in the Hugging Face Main Classes are used in this step. These allow us to automatically find the relevant model by the given name. It also allows to easily change the model without major changes in the code on the developer’s side as shown below:
    tokenizer = AutoTokenizer.from_pretrained(model)
    model = AutoModelForCausalLM.from_pretrained(model)
    

    The model parameter is specified as an input for the tokenizer, and model initialization is just the path to the pre-trained DialoGPT model. In this sample, we are using ‘microsoft/DialoGPT-large.’ If you have limited resources, you can use ‘microsoft/DialoGPT-medium’ or ‘microsoft/DialoGPT-small’ models and receive comparable results.

  2. Perform dynamic quantization of the model:
    1. Create the configuration using the default dynamic quantization configuration from Intel Extension for PyTorch library.
    2. Prepare the model.
    3. Convert the model from FP32 to INT8.
      The steps are explained in detail in the above section.
  3. Response generation: The first step in response generation is to encode the input sentence as shown in the code below:
    new_input_ids = tokenizer.encode(input(">> You:") + tokenizer.eos_token, return_tensors='pt')
    

    In this sample, we want our model to save history, so we are adding input sentences in the form of tokens to the chat history:

    bot_input_ids = torch.cat([chat_history_ids, new_input_ids], dim=-1) if chat_round > 0 else new_input_ids
    

    The text generation can be done by the model.generate function, where we can specify all important parameters like saved chat history, length of the response in tokens, and usage of both Top-K and Top-p sampling.

    chat_history_ids = model.generate(bot_input_ids, do_sample=True, max_length=2000, top_k=50, top_p=0.95, pad_token_id=tokenizer.eos_token_id) 
    

    The last step is to decode and print the response:

  4. Preparation for interactive conversation: After response generation, the last step is to add interaction. This can be done by using a simple for loop. Based on the initialized tokenizer, model, and empty chat history, responses are generated for a number of rounds:
    for chat_round in range(n):
    chat_history_ids = generate_response(
    tokenizer,
    model,
    chat_round,
    chat_history_ids
    )
    

    An example of interactive chat generation will look like the one shown in the picture below.

An example of interactive chat generation

What’s Next?

Get started with interactive chat-generation models using Intel Extension for PyTorch and DialoGPT. Download and try the Intel AI Analytics Toolkit and Intel Extension for PyTorch for yourself to build various end-to-end AI applications.

We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.

For more details about the new 4th Gen Intel® Xeon® Scalable processors, visit Intel’s AI Solution Platform portal where you can learn how Intel is empowering developers to run end-to-end AI pipelines on these powerful CPUs.

Useful resources

Explore more AI code samples

See all code samples

Read More

Announcing PyTorch Docathon H2 2023

We are excited to announce that we will be holding a Docathon for PyTorch on November 1, 2023! This event is an opportunity for our community to come together and improve the quality of our documentation.

During the Docathon, we will focus on updating and improving existing content, as well as adding new tutorials and docstrings. We encourage all members of the community to participate and contribute their expertise to make our documentation even better. This is a great opportunity to learn and collaborate together.

Check out our previous docathon success story here.

Why Participate

One of the best things about the Docathon is that you can make a tangible, positive impact on the quality of documentation in real time. This collaborative event brings together diverse team members from various companies, backgrounds, and roles, united to work towards a common goal. This event not only fosters team building and knowledge sharing but also presents an opportunity for individuals to acquire new skills, such as writing, editing, and utilizing documentation tools. Participating in a docathon can be particularly beneficial for team members who may lack experience in these areas.

And of course all participants will be recognized for their contributions. Top participants will receive special awards.

Event Details

  • Nov 1: Kick-off
  • Nov 1- Nov 12: Submissions and Feedback
  • Nov 13 – Nov 15: Final Reviews
  • Nov 15: Winner Announcements

Details for the Docathon to be announced at the kick-off call on November 1.

To participate in the Docathon and receive updates about the event, register here: RSVP

We are excited to see the improvements that will come out of this Docathon, and we look forward to your participation!

Read More

Inside the Matrix: Visualizing Matrix Multiplication, Attention and Beyond

Inside the Matrix: Visualizing Matrix Multiplication, Attention and Beyond

Use 3D to visualize matrix multiplication expressions, attention heads with real weights, and more.

Matrix multiplications (matmuls) are the building blocks of today’s ML models. This note presents mm, a visualization tool for matmuls and compositions of matmuls.

Because mm uses all three spatial dimensions, it helps build intuition and spark ideas with less cognitive overhead than the usual squares-on-paper idioms, especially (though not only) for visual/spatial thinkers.

And with three dimensions available for composing matmuls, along with the ability to load trained weights, we can visualize big, compound expressions like attention heads and observe how they actually behave, using im.

mm is fully interactive, runs in the browser or notebook iframes and keeps its complete state in the URL, so links are shareable sessions (the screenshots and videos in this note all have links that open the visualizations in the tool). This reference guide describes all of the available functionality.

We’ll first introduce the visualization approach, build intuition by visualizing some simple matmuls and expressions, then dive into some more extended examples:

  1. Pitch – why is this way of visualizing better?
  2. Warmup – animations – watching the canonical matmul decompositions in action
  3. Warmup – expressions – a quick tour of some fundamental expression building blocks
  4. Inside an attention head – an in-depth look at the structure, values and computation behavior of a couple of attention heads from GPT2 via NanoGPT
  5. Parallelizing attention – visualizing attention head parallelization with examples from the recent Blockwise Parallel Transformer paper
  6. Sizes in an attention layer – what do the MHA and FFA halves of an attention layer look like together, when we visualize a whole layer as a single structure? How does the picture change during autoregressive decoding?
  7. LoRA – a visual explanation of this elaboration of the attention head architecture
  8. Wrapup – next steps and call for feedback

1 Pitch

mm’s visualization approach is based on the premise that matrix multiplication is fundamentally a three-dimensional operation.

In other words this:

matrix multiplication is fundamentally a three-dimensional operation

is a sheet of paper trying to be this (open in mm):

wrap the matmul around a cube

When we wrap the matmul around a cube this way, the correct relationships between argument shapes, result shape and shared dimensions all fall into place.

Now the computation makes geometric sense: each location i, j in the result matrix anchors a vector running along the depth dimension k in the cube’s interior, where the horizontal plane extending from row i in L and a vertical plane extending from column j in R intersect. Along this vector, pairs of (i, k) (k, j) elements from the left and right arguments meet and are multiplied, and the resulting products are summed along k and the result is deposited in location i, j of the result.

(Jumping ahead momentarily, here’s an animation.)

This is the intuitive meaning of matrix multiplication:

  1. project two orthogonal matrices into the interior of a cube
  2. multiply the pair of values at each intersection, forming a grid of products
  3. sum along the third orthogonal dimension to produce a result matrix.

For orientation, the tool displays an arrow in the cube’s interior that points towards the result matrix, with a blue vane coming from the left argument and a red vane coming from the right argument. The tool also displays white guidelines to indicate the row axis of each matrix, though they’re faint in this screenshot.

The layout constraints are straightforward:

  • left argument and result must be adjoined along their shared height (i) dimension
  • right argument and result must be adjoined along their shared width (j) dimension
  • left and right arguments must be adjoined along their shared (left width/right height) dimension, which becomes the matmul’s depth (k) dimension

This geometry gives us a solid foundation for visualizing all the standard matmul decompositions, and an intuitive basis for exploring nontrivially complex compositions of matmuls, as we’ll see below.

2 Warmup – animations

Before diving into some more complex examples, we’ll run through a few intuition builders to get a feel for how things look and feel in this style of visualization.

2a Dot product

First, the canonical algorithm – computing each result element by taking the dot product of the corresponding left row and right column. What we see in the animation is the sweep of multiplied value vectors through the cube’s interior, each delivering a summed result at the corresponding position.

Here, L has blocks of rows filled with 1 (blue) or -1 (red); R has column blocks filled similarly. k is 24 here, so the result matrix (L @ R) has blue values of 24 and red values of -24 (open in mm – long click or control-click to inspect values):

2b Matrix-vector products

A matmul decomposed into matrix-vector products looks like a vertical plane (a product of the left argument with each column of the right argument) painting columns onto the result as it sweeps horizontally through the cube’s interior (open in mm):

Observing the intermediate values of a decomposition can be quite interesting, even in simple examples.

For instance, note the prominent vertical patterns in the intermediate matrix-vector products when we use randomly initialized arguments- reflecting the fact that each intermediate is a column-scaled replica of the left argument (open in mm):

2c Vector-matrix products

A matmul decomposed into vector-matrix products looks like a horizontal plane painting rows onto the result as it descends through the cube’s interior (open in mm):

Switching to randomly initialized arguments, we see patterns analogous to those we saw with matrix-vector products – only this time the patterns are horizontal, corresponding to the fact that each intermediate vector-matrix product is a row-scaled replica of the right argument.

When thinking about how matmuls express the rank and structure of their arguments, it’s useful to envision both of these patterns happening simultaneously in the computation (open in mm):

Here’s one more intuition builder using vector-matrix products, showing how the identity matrix functions exactly like a mirror set at a 45deg angle to both its counterargument and the result (open in mm):

2d Summed outer products

The third planar decomposition is along the k axis, computing the matmul result by a pointwise summation of vector outer products. Here we see the plane of outer products sweeping the cube “from back to front”, accumulating into the result (open in mm):

Using randomly initialized matrices with this decomposition, we can see not just values but rank accumulate in the result, as each rank-1 outer product is added to it.

Among other things this builds intuition for why “low-rank factorization” – i.e. approximating a matrix by constructing a matmul whose arguments are small in the depth dimension – works best when the matrix being approximated is low rank. LoRA in a later section (open in mm):

3 Warmup – expressions

How can we extend this visualization approach to compositions of matmuls? Our examples so far have all visualized a single matmul L @ R of some matrices L and R – what about when L and/or R are themselves matmuls, and so on transitively?

It turns out we can extend the approach nicely to compound expressions. The key rules are simple: the subexpression (child) matmul is another cube, subject to the same layout constraints as the parent, and the result face of the child is simultaneously the corresponding argument face of the parent, like a covalently shared electron.

Within these constraints, we’re free to arrange the faces of a child matmul however we like. Here we use the tool’s default scheme, which generates alternating convex and concave cubes – this layout works well in practice to maximize use of space and minimize occlusion. (Layouts are completely customizable, however – see the reference for details.)

In this section we’ll visualize some of the key building blocks we find in ML models, to gain fluency in the visual idiom and to see what intuitions even simple examples can give us.

3a Left-associative expressions

We’ll look at two expressions of the form (A @ B) @ C, each with its own distinctive shape and character. (Note: mm adheres to the convention that matrix multiplication is left-associative and writes this simply as A @ B @ C.)

First we’ll give A @ B @ C the characteristic FFN shape, in which the “hidden dimension” is wider than the “input” or “output” dimensions. (Concretely in the context of this example, this means that the width of B is greater than the widths of A or C.)

As in the single matmul examples, the floating arrows point towards the result matrix, blue vane coming from the left argument and red vane from right argument (open in mm):

As in the single matmul examples, the floating arrows point towards the result matrix, blue vane coming from the left argument and red vane from right argument

Next we’ll visualize A @ B @ C with the width of B narrower than that of A or C, giving it a bottleneck or “autoencoder” shape (open in mm):

visualize A @ B @ C with the width of B narrower than that of A or C

This pattern of alternating convex and concave blocks extends to chains of arbitrary length: for example this multilayer bottleneck (open in mm):

pattern of alternating convex and concave blocks extends to chains of arbitrary length

3b Right associative expressions

Next we’ll visualize a right-associative expression A @ (B @ C).

In the same way left-associative expressions extend horizontally – sprouting from the left argument of the root expression, so to speak – right-associative chains extend vertically, sprouting from the root’s right argument.

One sometimes sees an MLP formulated right-associatively, i.e. with columnar input on the right and weight layers running right to left. Using the matrices from the 2-layer FFN example pictured above – suitably transposed – here’s what that looks like, with C now playing the role of the input, B the first layer and A the second layer (open in mm):

an MLP formulated right-associatively

Aside: in addition to the color of the arrow vanes (blue for left, red for right), a second visual cue for distinguishing left and right arguments is their orientation: the rows of the left argument are coplanar with those of the result – they stack along the same axis (i). Both cues tell us for example that B is the left argument to (B @ C) above.

3c Binary expressions

For a visualization tool to be useful beyond simple didactic examples, visualizations need to remain legible as expressions get more complicated. A key structural component in real-world use cases is binary expressions – matmuls with subexpressions on both the left and right.

Here we’ll visualize the simplest such expression shape, (A @ B) @ (C @ D) (open in mm):

binary expressions - matmuls with subexpressions on both the left and right

3d Quick aside: partitioning and parallelism

A full presentation of this topic is out of scope for this note, though we’ll see it in action later in the context of attention heads. But as a warmup, two quick examples should give a sense of how this style of visualization makes reasoning about parallelizing compound expressions very intuitive, via the simple geometry of partitioning.

In the first example we’ll apply the canonical “data parallel” partitioning to the left-associative multilayer bottleneck example above. We partition along i, segmenting the initial left argument (“batch”) and all intermediate results (“activations”), but none of the subsequent arguments (“weights”) – the geometry making it obvious which participants in the expression are segmented and which remain whole (open in mm):

the canonical "data parallel" partitioning to the left-associative multilayer bottleneck example

The second example would (for me, anyway) be much harder to build intuition about without clear geometry to support it: it shows how a binary expression can be parallelized by partitioning the left subexpression along its j axis, the right subexpression along its i axis, and the parent expression along its k axis (open in mm):

a binary expression can be parallelized by partitioning the left subexpression along its j axis, the right subexpression along its i axis, and the parent expression along its k axis

4 Inside an Attention Head

Let’s look at a GPT2 attention head – specifically layer 5, head 4 of the “gpt2” (small) configuration (layers=12, heads=12, embed=768) from NanoGPT, using OpenAI weights via HuggingFace. Input activations are taken from a forward pass on an OpenWebText training sample of 256 tokens.

There’s nothing particularly unusual about this particular head; I chose it mainly because it computes a fairly common attention pattern and lives in the middle of the model, where activations have become structured and show some interesting texture. (Aside: in a subsequent note I’ll present an attention head explorer that lets you visualize all layers and heads of this model, along with some travel notes.)

Open in mm (may take a few seconds to fetch model weights)

There's nothing particularly unusual about this particular head

4a Structure

The entire attention head is visualized as a single compound expression, starting with input and ending with projected output. (Note: to keep things self-contained we do per-head output projection as described in Megatron-LM.)

The computation contains six matmuls:

Q = input @ wQ        // 1
K_t = wK_t @ input_t  // 2
V = input @ wV        // 3
attn = sdpa(Q @ K_t)  // 4
head_out = attn @ V   // 5
out = head_out @ wO   // 6

A thumbnail description of what we’re looking at:

  • the blades of the windmill are matmuls 1, 2, 3 and 6: the former group are the in-projections from input to Q, K and V; the latter is the out-projection from attn @ V back to the embedding dimension.
  • at the hub is the double matmul that first calculates attention scores (convex cube in back), then uses them to produce output tokens from the values vector (concave cube in front). Causality means that the attention scores form a lower triangle.

But I’d encourage exploring this example in the tool itself, rather than relying on the screenshot or the video below to convey just how much signal can be absorbed from it – both about its structure and the actual values flowing through the computation.

4b Computation and Values

Here’s an animation of the attention head computation. Specifically, we’re watching

sdpa(input @ wQ @ K_t) @ V @ wO

(i.e., matmuls 1, 4 , 5 and 6 above, with K_t and V precomputed) being computed as a fused chain of vector-matrix products: each item in the sequence goes all the way from input through attention to output in one step. More on this animation choice in the later section on parallelization, but first let’s look at what the values being computed tell us.

Open in mm

There’s a lot of interesting stuff going on here.

  • Before we even get to the attention calculation, it’s quite striking how low-rank Q and K_t are. Zooming in on the Q @ K_t vector-matrix product animation, the situation is even more vivid: a significant number of channels (embedding positions) in both Q and K look more or less constant across the sequence, implying that the useful attention signal is potentially driven by a only smallish subset of the embedding. Understanding and exploiting this phenomenon is one of the threads we’re pulling on as part of the SysML ATOM transformer efficiency project.
  • Perhaps most familiar is the strong-but-not-perfect diagonal that emerges in the attention matrix. This is a common pattern, showing up in many of the attention heads of this model (and those of many transformers). It produces localized attention: the value tokens in the small neighborhood immediately preceding an output token’s position largely determine that output token’s content pattern.
  • However, the size of this neighborhood and the influence of individual tokens within it vary nontrivially – this can be seen both in the off-diagonal frost in the attention grid, and in the fluctuating patterns of the attn[i] @ V vector-matrix product plane as it descends the attention matrix on its way through the sequence.
  • But note that the local neighborhood isn’t the only thing that’s attracting attention: the leftmost column of the attention grid, corresponding to the first token of the sequence, is entirely filled with nonzero (but fluctuating) values, meaning every output token will be influenced to some degree by the first value token.
  • Moreover there’s an inexact but discernible oscillation in attention score dominance between the current token neighborhood and the initial token. The period of the oscillation varies, but broadly speaking starts short and then lengthens as one travels down the sequence (evocatively correlated with the quantity of candidate attention tokens for each row, given causality).
  • To get a feel for how (attn @ V) is formed, it’s important not to focus on attention in isolation – V is an equal player. Each output item is a weighted average of the entire V vector: at the limit when attention is a perfect diagonal, attn @ V is simply an exact copy of V. Here we see something more textured: visible banding where particular tokens have scored high over a contiguous subsequence of attention rows, superimposed on a matrix visibly similar to to V but with some vertical smearing due to the fat diagonal. (Aside: per the mm reference guide, long-clicking or control-clicking will reveal the actual numeric values of visualized elements.)
  • Bear in mind that since we’re in a middle layer (5), the input to this attention head is an intermediate representation, not the original tokenized text. So the patterns seen in the input are themselves thought-provoking – in particular, the strong vertical threads are particular embedding positions whose values are uniformly high magnitude across long stretches of the sequence – sometimes almost the entire thing.
  • Interestingly, though, the first vector in the input sequence is distinctive, not only breaking the pattern of these high-magnitude columns but carrying atypical values at almost every position (aside: not visualized here, but this pattern is repeated over multiple sample inputs).

Note: apropos of the last two bullet points, it’s worth reiterating that we’re visualizing computation over a single sample input. In practice I’ve found that each head has a characteristic pattern it will express consistently (though not identically) over a decent collection of samples (and the upcoming attention head browser will provide a collection of samples to play with), but when looking at any visualization that includes activations, it’s important to bear in mind that a full distribution of inputs may influence the ideas and intuitions it provokes it in subtle ways.

Finally, one more pitch to explore the animation directly!

4c Heads are different in interesting ways

Before we move on, here’s one more demonstration of the usefulness of simply poking around a model to see how it works in detail.

This is another attention head from GPT2. It behaves quite differently from layer 5, head 4 above – as one might expect, given that it’s in a very different part of the model. This head is in the very first layer: layer 0, head 2 (open in mm, may take a few seconds to load model weights):

This is another attention head from GPT2

Things to note:

  • This head spreads attention very evenly. This has the effect of delivering a relatively unweighted average of V (or rather, the appropriate causal prefix of V) to each row in attn @ V, as can be seen in this animation: as we move down the attention score triangle, the attn[i] @ V vector-matrix product is small fluctuations away from being simply a downscaled, progressively revealed copy of V.
  • attn @ V has striking vertical uniformity – in large columnar regions of the embedding, the same value patterns persist over the entire sequence. One can think of these as properties shared by every token.
  • Aside: on the one hand one might expect some uniformity in attn @ V given the effect of very evenly spread attention. But each row has been constructed from only a causal subsequence of V rather than the whole thing – why is that not causing more variation, like a progressive morphing as one moves down the sequence? By visual inspection V isn’t uniform along its length, so the answer must lie in some more subtle property of its distribution of values.
  • Finally, this head’s output is even more vertically uniform after out-projection
  • the strong impression being that the bulk of the information being delivered by this attention head consists of properties which are shared by every token in the sequence. The composition of its output projection weights reinforces this intuition.

Overall, it’s hard to resist the idea that the extremely regular, highly structured information this attention head produces might be obtained by computational means that are a bit… less lavish. Of course this isn’t an unexplored area, but the specificity and richness of signal of the visualized computation has been useful in generating new ideas, and reasoning about existing ones.

4d Revisiting the pitch: invariants for free

Stepping back, it’s worth reiterating that the reason we can visualize nontrivially compound operations like attention heads and have them remain intuitive is that important algebraic properties – like how argument shapes are constrained, or which parallelization axes intersect which operations – don’t require additional thinking: they arise directly from the geometry of the visualized object, rather than being additional rules to keep in mind.

For example, in these attention head visualizations it’s immediately obvious that

  • Q and attn @ V are the same length, K and V are the same length, and the lengths of these pairs are independent of each other
  • Q and K are the same width, V and attn @ V are the same width, and the widths of these pairs are independent of each other.

These properties are true by construction, as a simple consequence of which parts of the compound structure the constituents inhabit and how they are oriented.

This “properties for free” benefit can be especially useful when exploring variations on a canonical structure – an obvious example being the one-row-high attention matrix in autoregressive token-at-a-time decoding (open in mm):

the one-row-high attention matrix in autoregressive token-at-a-time decoding

5 Parallelizing attention

In the animation of head 5, layer 4 above, we visualize 4 of the 6 matmuls in the attention head

as a fused chain of vector-matrix products, confirming the geometric intuition that the entire left-associative chain from input to output is laminar along the shared i axis, and can be parallelized.

5a Example: partitioning along i

To parallelize the computation in practice, we would partition the input into blocks along the i axis. We can visualize this partition in the tool, by specifying that a given axis be partitioned into a particular number of blocks – in these examples we’ll use 8, but there’s nothing special about that number.

Among other things, this visualization makes clear that wQ (for in-projection), K_t and V (for attention) and wO (for out-projection) are needed in their entirety by each parallel computation, since they’re adjacent to the partitioned matrices along those matrices’ unpartitioned dimensions (open in mm):

wQ (for in-projection), K_t and V (for attention) and wO (for out-projection) are needed in their entirety by each parallel computation

5b Example: double partitioning

As an example of partitioning along multiple axes, we can visualize some recent work which innovates in this space (Block Parallel Transformer, building on work done in e.g. Flash Attention and its antecedents).

First, BPT partitions along i as described above – and actually extends this horizontal partitioning of the sequence into chunks all the way through the second (FFN) half of the attention layer as well. (We’ll visualize this in a later section.)

To fully attack the context length problem, a second partitioning is then added to MHA – that of the attention calculation itself (i.e., a partition along the j axis of Q @ K_t). The two partitions together divide attention into a grid of blocks (open in mm):

The two partitions together divide attention into a grid of blocks

This visualization makes clear

  • the effectiveness of this double partitioning as an attack on the context length problem, since we’ve now visibly partitioned every occurrence of sequence length in the attention calculation
  • the “reach” of this second partitioning: it’s clear from the geometry that the in-projection computations of K and V can be partitioned along with the core double matmul

Note one subtlety: the visual implication here is that we can also parallelize the subsequent matmul attn @ V along k and sum the partial results split-k style, thus parallelizing the entire double matmul. But the row-wise softmax in sdpa() adds the requirement that each row have all its segments normalized before the corresponding row of attn @ V can be computed, adding an extra row-wise step between the attention calculation and the final matmul.

6 Sizes in an Attention Layer

The first (MHA) half of an attention layer is famously computationally demanding because of its quadratic complexity, but the second (FFN) half is demanding in its own right due to the width of its hidden dimension, typically 4 times that of the model’s embedding dimension. Visualizing the biomass of a full attention layer can be useful in building intuition about how the two halves of the layer compare to each other.

6a Visualizing the full layer

Below is a full attention layer with the first half (MHA) in the background and the second (FFN) in the foreground. As usual, arrows point in the direction of computation.

Notes:

  • This visualization doesn’t depict individual attention heads, but instead shows the unsliced Q/K/V weights and projections surrounding a central double matmul. Of course this isn’t a faithful visualization of the full MHA operation – but the goal here is to give a clearer sense of the relative matrix sizes in the two halves of the layer, rather than the relative amounts of computation each half performs. (Also, randomized values are used rather than real weights.)
  • The dimensions used here are downsized to keep the browser (relatively) happy, but the proportions are preserved (from NanoGPT’s small config): model embedding dimension = 192 (from 768), FFN embedding dimension = 768 (from 3072), sequence length = 256 (from 1024), although sequence length is not fundamental to the model. (Visually, changes in sequence length would appear as changes in the width of the input blades, and consequently in the size of the attention hub and the height of the downstream vertical planes.)

Open in mm:

a full attention layer with the first half (MHA) in the background and the second (FFN) in the foreground

6b Visualizing the BPT partitioned layer

Revisiting Blockwise Parallel Transformer briefly, here we visualize BPT’s parallelization scheme in the context of an entire attention layer (with individual heads elided per above). In particular, note how the partitioning along i (of sequence blocks) extends through both MHA and FFN halves (open in mm):

visualize BPT's parallelization scheme in the context of an entire attention layer

6c Partitioning the FFN

The visualization suggests an additional partitioning, orthogonal to the ones described above – in the FFN half of the attention layer, splitting the double matmul (attn_out @ FFN_1) @ FFN_2, first along j for attn_out @ FFN_1, then along k in the subsequent matmul with FFN_2. This partition slices both layers of FFN weights, reducing the capacity requirements of each participant in the computation at the cost of a final summation of the partial results.

Here’s what this partition looks like applied to an otherwise unpartitioned attention layer (open in mm):

what this partition looks like applied to an otherwise unpartitioned attention layer

And here it is applied to a layer partitioned a la BPT (open in mm):

applied to a layer partitioned a la BPT

6d Visualizing token-at-a-time decoding

During autoregressive token-at-a-time decoding, the query vector consists of a single token. It’s instructive to have a mental picture of what an attention layer looks like in that situation – a single embedding row working its way through an enormous tiled plane of weights.

Aside from the emphasizing the sheer immensity of weights compared to activations, this view is also evocative of the notion that K_t and V function like dynamically generated layers in a 6-layer MLP, although the mux/demux computations of MHA itself (papered over here, per above) make the correspondence inexact (open in mm):

the mux/demux computations of MHA itself

7 LoRA

The recent LoRA paper (LoRA: Low-Rank Adaptation of Large Language Models) describes an efficient finetuning technique based on the idea that weight deltas introduced during finetuning are low-rank. Per the paper, this “allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation […], while keeping the pre-trained weights frozen.”

7a The basic idea

In a nutshell, the key move is to train the factors of a weight matrix rather than the matrix itself: replace an I x J weights tensor with a matmul of an I x K tensor and a K x J tensor, holding K to some small number.

If K is small enough the size win can be huge, but the tradeoff is that lowering it lowers the rank of what the product can express. As a quick illustration of both the size savings and the structuring effect on the result, here’s a matmul of random 128 x 4 left and 4 x 128 right arguments – a.k.a. a rank-4 factorization of a 128 x 128 matrix. Notice the vertical and horizontal patterning in L @ R (open in mm):

a matmul of random 128 x 4 left and 4 x 128 right arguments

7b Applying LoRA to an attention head

The way LoRA applies this factoring move to the fine tuning process is to

  • create a low-rank factorization for each weight tensor to be fine-tuned and train the factors, keeping the original weights frozen
  • after fine tuning, multiply each pair of low-rank factors to get a matrix in the shape of the original weights tensor, and add it to the original pretrained weights tensor

The following visualization shows an attention head with the weight tensors wQ, wK_t, wV, wO replaced by low rank factorizations wQ_A @ wQ_B, etc. Visually, the factor matrices show up as low fences along the edges of the windmill blades (open in mm – spacebar stops the spin):

8 Wrapup

8a Call for feedback

I’ve found this way of visualizing matmul expressions extremely helpful for building intuition and reasoning about not just matrix multiplication itself, but also many aspects of ML models and their computation, from efficiency to interpretability.

if you try it out and have suggestions or comments, I definitely want to hear, either in the comments here or in the repo.

8b Next steps

  • There’s a GPT2 attention head explorer built on top of the tool which I’m currently using to inventory and classify the attention head traits found in that model. (This was the tool I used to find and explore the attention heads in this note.) Once complete I plan to post a note with the inventory.
  • As mentioned up top, embedding these visualizations in Python notebooks is dead simple. But session URLs can get… unwieldy, so it will be useful to have Python-side utilities for constructing them from configuration objects, similar to the simple JavaScript helpers used in the reference guide.
  • If you’ve got a use case you think might benefit from visualizations like this but it’s not obvious how to use the tool to do it, get in touch! I’m not necessarily looking to expand its core visualization capabilities that much further (right tool for the job, etc.), but e.g. the API for driving it programmatically is pretty basic, there’s plenty that can be done there.

Read More

PyTorch project timeline

One Year of PyTorch Foundation

It’s been one year since we announced the formation of the PyTorch Foundation! 🎉

In its inaugural year, the PyTorch Foundation made a significant impact by launching PyTorch 2.0, growing contributors and adding new member companies. We’re grateful to our founding members for their support to move the foundation forward.

A few milestones in the past year include:

💻 Over 600,000 repositories on GitHub
✅ 60% of AI implementations choosing PyTorch
📈 More than 20% year over year growth in new repositories
🤝 Over 12,000 commits since last year

And a look at what the foundation has been up to this past year:

PyTorch project timeline

We look forward to growing our community for the years to come through supporting our contributors, democratizing the AI field, and creating new innovations.

We invite you to join us at this year’s PyTorch Conference on October 16-17 in San Francisco. Conference registration is filling up quickly, so take advantage of your chance to be part of this exciting event.

Join us to stay informed about the latest announcements and have the opportunity to connect with both the founding members and new additions to the PyTorch community.

With thanks and gratitude,
The PyTorch Foundation Team

Read More

Accelerated CPU Inference with PyTorch Inductor using torch.compile

Accelerated CPU Inference with PyTorch Inductor using torch.compile

Story at a Glance

  • Although the PyTorch* Inductor C++/OpenMP* backend has enabled users to take advantage of modern CPU architectures and parallel processing, it has lacked optimizations, resulting in the backend performing worse than eager mode in terms of end-to-end performance.
  • Intel optimized the Inductor backend using a hybrid strategy that classified operations into two categories: Conv/GEMM and non-Conv/GEMM element-wise and reduction ops.
  • For popular deep learning models, this hybrid strategy demonstrates promising performance improvements compared to eager mode and improves the C++/OpenMP backend’s efficiency and reliability for PyTorch models.

Inductor Backend Challenges

The PyTorch Inductor C++/OpenMP backend enables users to take advantage of modern CPU architectures and parallel processing to accelerate computations.

However, during the early stages of its development, the backend lacked some optimizations, which prevented it from fully utilizing the CPU computation capabilities. As a result, for most models the C++/OpenMP backend performed worse than eager mode in terms of end-to-end performance, with 45% of TorchBench, 100% of Hugging Face, and 75% of TIMM models performing worse than eager mode.

In this post, we highlight Intel’s optimizations to the Inductor CPU backend, including the technologies and results.

We optimized the backend by using a hybrid strategy that classified operations into two categories: Conv/GEMM and non-Conv/GEMM element-wise and reduction ops. Post-op fusion and weight prepacking using the oneDNN performance library were utilized to optimize the former, while explicit vectorization in C++ codegen was used to optimize the latter.

This hybrid strategy demonstrated promising performance improvements compared to eager mode, particularly on popular deep learning models such as Inductor Hugging Face, Inductor TorchBench and Inductor TIMM. Overall, Intel’s optimizations improve the C++/OpenMP backend’s efficiency and reliability for PyTorch models.

Figure 1. Performance Speedup Ratio Trend

Figure 1: Performance Speedup Ratio Trend

Performance Status of Intel Hybrid Optimizations

Compared to eager mode with the hybrid optimizations, the C++/OpenMP backend shows promising performance improvements. We measured the performance of the three Inductor benchmark suites—TorchBench, Hugging Face, and TIMM—and the results are as follows. (Note: we publish our performance data twice per week on GitHub.)

Overall, these optimizations help to ensure that the C++/OpenMP backend provides efficient and reliable support for PyTorch models.

Passrate

+----------+------------+-------------+-------------+
| Compiler | torchbench | huggingface | timm_models |
+----------+------------+-------------+-------------+
| inductor | 93%, 56/60 | 96%, 44/46  | 100%, 61/61 |
+----------+------------+-------------+-------------+

Geometric mean speedup (Single-Socket Multi-threads)

+----------+------------+-------------+-------------+
| Compiler | torchbench | huggingface | timm_models |
+----------+------------+-------------+-------------+
| inductor |   1.39x	|	1.20x	|	1.73x	|
+----------+------------+-------------+-------------+

Individual Model Performance

Figure 2. TorchBench FP32 Performance (Single-Socket Multi-threads)

Figure 2: TorchBench FP32 Performance (Single-Socket Multi-threads)

Figure 3. Hugging Face FP32 Performance (Single-Socket Multi-thread)

Figure 3: Hugging Face FP32 Performance (Single-Socket Multi-thread)

Figure 4. TIMM FP32 Performance (Single-Socket Multi-threads)

Figure 4: TIMM FP32 Performance (Single-Socket Multi-threads)

Geometric mean speedup (Single-core Single-thread)

+----------+------------+-------------+-------------+
| Compiler | torchbench | huggingface | timm_models |
+----------+------------+-------------+-------------+
| inductor |   1.29x	|	1.15x	|	1.37x	|
+----------+------------+-------------+-------------+

Figure 5. TorchBench FP32 Performance (Single-Socket Single-thread)

Figure 5: TorchBench FP32 Performance (Single-Socket Single-thread)

Figure 6. Hugging Face FP32 Performance (Single-Socket Single Thread)

Figure 6: Hugging Face FP32 Performance (Single-Socket Single Thread)

Figure 7. TIMM FP32 Performance (Single-Socket Single-thread)

Figure 7: TIMM FP32 Performance (Single-Socket Single-thread)

Technical Deep Dive

Now, let’s take a closer look at the two primary optimizations used in the Inductor C++/OpenMP backend:

  1. weight prepacking and post-operation fusion via oneDNN library
  2. explicit vectorization in Inductor C++ codegen

Weight Prepackaging & Post-op Fusion via oneDNN

Shorthand for Intel® oneAPI Deep Neural Network Library, oneDNN library provides a range of post-op fusions (i.e., fuse convolution and matmal with its consecutive operation) that can benefit popular models. The Intel® Extension for PyTorch has implemented most of these fusions and has achieved significant performance improvements. As a result, we have upstreamed all of these fusions that have been applied in Intel’s PyTorch extension to Inductor, enabling a wider range of models to benefit from these optimizations. We have defined these fusions as operators under the mkldnn namespace. This allows the Python module to invoke these mkldnn operations directly.

Currently, the defined fused operations are as follows. You can find these defined fused operations at RegisterMkldnnOpContextClass.cpp.

  • _linear_pointwise: Fuses Linear and its post-unary element-wise operations
  • _linear_pointwise.binary: Fuses Linear and its post-binary element-wise operations
  • _convolution_pointwise: Fuses Convolution and its post-unary element-wise operations
  • _convolution_pointwise.binary: Fuses Convolution and its post-binary element-wise operations

The detailed fusion patterns are defined in the mkldnn.py file: convolution/linear + sigmoid/hardsigmoid/tanh/hardtanh/hardswish/leaky_relu/gelu/relu/relu6/siluconvolution/linear + add/add_/iadd/sub/sub_

On the Inductor side, we apply these fusions on the FX graph that has been lowered. We have defined mkldnn_fuse_fx as the entry point to apply all the fusions. The code snippet for this is as follows:

def mkldnn_fuse_fx(gm: torch.fx.GraphModule, example_inputs):
    ...
    gm = fuse_unary(gm)
    gm = fuse_binary(gm)
    ...
    if config.cpp.weight_prepack:
        gm = pack_module(gm)
    return gm

In the mkldnn_fuse_fx function, we apply fusion on the FX graph that hasn’t been lowered yet. To fuse convolution/linear and its consecutive elementwise operations, we invoke fuse_unary and fuse_binary as follows:

   gm = fuse_unary(gm)
   gm = fuse_binary(gm)

In addition to the post-op fusion, we apply weight prepacking to improve the Conv/GEMM performance further:

   gm = pack_module(gm)

Weight prepacking involves rearranging the weight tensor in a blocked layout, which:

  • can improve vectorization and cache reuse compared to plain formats like NCHW or NHWC and;
  • can help avoid weight reordering at runtime, which can reduce overhead and improve performance and;
  • increases memory usage as the tradeoff.

For these reasons, we provide config.cpp.weight_prepack flag in Inductor to provide users with more control over this optimization, allowing them to enable it based on their specific needs.

Explicit Vectorization in Inductor C++ Codegen

Vectorization is a key optimization technique that can significantly improve the performance of numerical computations. By utilizing SIMD (Single Instruction, Multiple Data) instructions, vectorization enables multiple computations to be performed simultaneously on a single processor core, which can lead to significant performance improvements.

In the Inductor C++/OpenMP backend, we use Intel® AVX2 and Intel® AVX-512 ISA (Instruction Set Architecture) options for vectorization by leveraging the aten vectorization library to facilitate the implementation. Aten vectorization supports multiple platforms, including x86 and Arm, as well as multiple data types. It can be extended to support other ISAs easily by adding more VecISA sub-classes. This allows Inductor to easily support other platforms and data types in the future.

Due to differences in platforms, the C++/OpenMP backend of Inductor starts by detecting the CPU features to determine the vectorization bit width at the beginning of code generation. By default, if the machine supports both AVX-512 and AVX2, the backend will choose 512-bit vectorization.

If the hardware supports vectorization, the C++/OpenMP backend first detects if the loop body can be vectorized or not. There are primarily three scenarios that we are not able to generate kernel with vectorization:

  1. Loop body lacks vector intrinsics support, e.g., rand and atomic_add.
  2. Loop body lacks efficient vector intrinsics support, e.g., non-contiguous load/store.
  3. Data types with vectorization not yet supported but work in progress, e.g., integer, double, half, and bfloat16.

To address this issue, the C++/OpenMP backend uses CppVecKernelChecker to detect whether all operations in a particular loop body can be vectorized or not. In general, we classified the operations into two categories by identifying if they depend on the context.

For most elementwise operations such as add, sub, relu, vectorization is straightforward, and their execution does not depend on context.

However, for certain other operations, their semantics are more complex and their execution depends on context through static analysis.

For example, let’s consider the where operation that takes in mask, true_value, and false_value while the mask value is loaded from a uint8 tensor. The fx graph could be as follows:

graph():
    %ops : [#users=9] = placeholder[target=ops]
    %get_index : [#users=1] = call_module[target=get_index](args = (index0,), kwargs = {})
    %load : [#users=1] = call_method[target=load](args = (%ops, arg1_1, %get_index), kwargs = {})
    %to_dtype : [#users=1] = call_method[target=to_dtype](args = (%ops, %load, torch.bool), kwargs = {})
    ...
    %where : [#users=1] = call_method[target=where](args = (%ops, %to_dtype, %to_dtype_2, %to_dtype_3), kwargs = {})

Regarding uint8, it is a general data type and could be used for computation but is not limited to being used as Boolean for mask. Hence, we need to analyze its context statically. In particular, the CppVecKernelChecker will check whether a uint8 tensor is only used by to_dtype and to_dtype is only used by where. If yes, it could be vectorized. Otherwise, it will fall back to the scalar version. The generated code could be as follows:

Scalar Version

auto tmp0 = in_ptr0[i1 + (17*i0)];
auto tmp3 = in_ptr1[i1 + (17*i0)];
auto tmp1 = static_cast<bool>(tmp0);
auto tmp2 = static_cast<float>(-33.0);
auto tmp4 = tmp1 ? tmp2 : tmp3;
tmp5 = std::max(tmp5, tmp4);

Vectorization Version

float g_tmp_buffer_in_ptr0[16] = {0};
// Convert the flag to float for vectorization. 
flag_to_float(in_ptr0 + (16*i1) + (17*i0), g_tmp_buffer_in_ptr0, 16);
auto tmp0 = at::vec::Vectorized<float>::loadu(g_tmp_buffer_in_ptr0);
auto tmp3 = at::vec::Vectorized<float>::loadu(in_ptr1 + (16*i1) + (17*i0));
auto tmp1 = (tmp0);
auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(-33.0));
auto tmp4 = decltype(tmp2)::blendv(tmp3, tmp2, tmp1);

In addition to context analysis, the C++/OpenMP backend also incorporates several other vectorization-related optimizations. These include:

  • Tiled kernel implementation for supporting transpose load – cpp.py
  • Data type demotion based on value range – cpp.py
  • Replacement of sleef implementation with oneDNN/oneMKL implementation for optimizing aten vectorization – #94577, #92289, #91613

In summary, we examined vectorization optimization in Inductor C++ backend for FP32 training and inference of 150 benchmark models with 90% of inference kernels and 71% of training kernels being vectorized.

In terms of inference, a total of 28,185 CPP kernels were generated, with 25,579 (90%) of them being vectorized, while the remaining 10% were scalar. As for training, 103,084 kernels were generated, with 73,909 (71%) being vectorized and 29% not vectorized.

The results indicate that the vectorization of inference kernels is quite impressive (there is still some work to be done in training kernels since we just started to work on the training). The remaining non-vectorized kernels are analyzed in different categories, highlighting the next steps to improve vectorization coverage: index-related operations, int64 support, vertical reduction, vectorization with fallback, and more.

In addition, we also optimized the C++/OpenMP backend with other optimizations like buffer-reuse and CppWrapper.

Future Work

The next step, we will continue optimizing the C++/OpenMP backend and extend it to support more data types as the next step. This includes:

  1. Improve vectorization coverage
  2. Support and optimize low precision kernel including BF16, FP16, Quantization
  3. Training optimization
  4. Loop tiling
  5. Autotune
  6. Further fusion optimization of Conv/GEMM kernels.
  7. Explore alternative codegen paths: clang/llvm/triton

Summary

Inductor C++/OpenMP backend is a flexible and efficient backend for the CPU. This blog describes the optimizations used in the C++/OpenMP backend of Inductor for inference and training of three benchmark suites – TorchBench, Hugging

Face and TIMM. The primary optimizations include weight prepacking and post-operation fusion via the oneDNN library, as well as explicit vectorization in Inductor C++ codegen using AVX2 and AVX-512 instructions.

The results show that 90% of inference kernels and 71% of training kernels are vectorized, indicating impressive vectorization for inference and room for improvement in training. In addition, we also applied other optimizations like buffer-reuse and CppWrapper. And we will continuously focus on the future work mentioned above to further improve the performance.

Acknowledgements

The results presented in this blog post are the culmination of a collaborative effort between the Intel PyTorch team and Meta. We would like to express our sincere gratitude to @jansel, @desertfire, and @Chillee for their invaluable contributions and unwavering support throughout the development process. Their expertise and dedication have been instrumental in achieving the optimizations and performance improvements discussed here.

Configuration Details

Hardware Details

Item Value
Manufacturer Amazon EC2
Product Name c6i.16xlarge
CPU Model Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Installed Memory 128GB (1x128GB DDR4 3200 MT/s [Unknown])
OS Ubuntu 22.04.2 LTS
Kernel 5.19.0-1022-aws
Microcode 0xd000389
GCC gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
GLIBC ldd (Ubuntu GLIBC 2.35-0ubuntu3.1) 2.35
Binutils GNU ld (GNU Binutils for Ubuntu) 2.38
Python Python 3.10.6
OpenSSL OpenSSL 3.0.2 15 Mar 2022 (Library: OpenSSL 3.0.2 15 Mar 2022)

Software Details

SW Nightly commit Main commit
Pytorch a977a12 0b1b063
Torchbench / a0848e19
torchaudio 0a652f5 d5b2996
torchtext c4ad5dd 79100a6
torchvision f2009ab b78d98b
torchdata 5cb3e6d f2bfd3d
dynamo_benchmarks fea73cb /

Configuration

  • Intel OpenMP
  • Jemalloc – oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1
  • Single-Socket Multi-threads: #of Instances: 1; Cores/Instance: 32
  • Single-Core Single-thread: #of Instances: 1; Cores/Instance: 1

Read More

Graphcore Joins the PyTorch Foundation as a General Member

Graphcore Joins the PyTorch Foundation as a General Member

Graphcore logo

The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Graphcore has joined as a general member.

Graphcore is a UK-based company that specializes in designing and manufacturing AI accelerators, hardware and software specifically tailored for artificial intelligence and machine learning workloads.

“We’re thrilled that PyTorch is the leading framework for development on the Graphcore platform,” said Executive Director of the PyTorch Foundation Ibrahim Haddad. “Graphcore has played an important role in the hardware and open source space, and we look forward to their continued contributions to PyTorch.”

Graphcore has contributed to the PyTorch ecosystem by developing integrations to run on their IPU hardware. These integrations enable researchers and practitioners to use their preferred frameworks while taking advantage of Graphcore’s specialized hardware.

“At Graphcore we’re truly aligned with PyTorch’s objective of reducing the barrier of entry to AI practitioners. By supporting a native PyTorch software environment for IPUs we are giving developers access to new underlying hardware, designed from the ground up for AI, to help unlock new AI techniques to improve efficiency or performance and to drive breakthroughs in AI research and applications, with the same user-friendly PyTorch framework they know and expect. We look forward to contributing to and growing the global AI community as an active member of the PyTorch Foundation and are proud to be the first general member.” Anthony Barbier, Software Frameworks Lead at Graphcore.

To learn more about how you can be a part of the PyTorch Foundation, visit our website.

About Graphcore

Graphcore compute systems are accelerating the AI revolution. Powered by the groundbreaking Intelligence Processing Unit (IPU), Graphcore delivers leading-edge AI performance with unprecedented efficiency. IPUs are used around the world by organisations building their intelligent compute capabilities, including AI-centric startups, large multinational corporations and both public and private research institutions. Graphcore is backed by some of the world’s leading investors and has attracted more than $700m of funding. The company is based in Bristol, UK, with offices across Europe, Asia and North America.

About PyTorch Foundation

The PyTorch Foundation is a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem. The PyTorch Foundation is supported by its members and leading contributors to the PyTorch open source project. The Foundation leverages resources provided by members and contributors to enable community discussions and collaboration.

About The Linux Foundation

The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects are critical to the world’s infrastructure including Linux, Kubernetes, Node.js, ONAP, PyTorch, RISC-V, SPDX, OpenChain, and more. The Linux Foundation focuses on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see its trademark usage page. Linux is a registered trademark of Linus Torvalds.

Read More

Automated trace collection and analysis

Automated trace collection and analysis

In this blog, we share how we enabled the collection and analysis of PyTorch Profiler traces for training workloads without any user side code instrumentation. We leveraged Dynolog – an open source daemon for CPU and GPU telemetry to collect PyTorch Profiler traces, and analyzed the collected traces using Holistic Trace Analysis – an open source library for analyzing PyTorch Profiler traces. This toolchain has allowed engineers at Meta to accelerate their performance optimization workflows. The keystone to our solution was implementing pre and post hooks for the base Optimizer class in PyTorch. We demo PyTorch trace collection using Dynolog in a short video.

Problem

Software developers at Meta run a large number of distributed training runs daily. In order to ensure that GPUs are being used effectively it is necessary to measure and analyze GPU performance for all jobs. Moreover, developers need the capability to introspect models and understand how CPUs and GPUs interact to debug performance issues. Developers build initial prototypes using a handful of GPUs and the production versions scale out to hundreds or thousands of GPUs, serving numerous business use cases such as generative AI, recommendation systems, ad ranking etc.

Given the scale at Meta, it is necessary to have toolchains for performance measurement and monitoring which have low overhead and operate seamlessly with each other, to maintain high developer efficiency.

In this blog, we describe how we use the PyTorch Profiler, Dynolog (a telemetry daemon) and Holistic Trace Analysis (a performance debugging library) to collect traces without any user side code instrumentation and analyze them to identify jobs with low GPU utilization.

Solution

The diagram below shares an overview of how the toolchain works together.

  1. User launches a PyTorch application.
  2. A training service or user triggers a profiling session using the Dynolog CLI which sends a request over the network to the Dynolog daemon.
  3. Dynolog daemon relays the profiling configuration to the PyTorch application, setting it temporarily in a profiling mode.
  4. PyTorch Profiler collects a trace and stores it to the database (e.g., network file system or S3 bucket).
  5. The collected traces are then analyzed using Holistic Trace Analysis (HTA).

Figure 1: Dynolog, PyTorch Profiler and HTA toolchain workflow

Figure 1: Dynolog, PyTorch Profiler and HTA toolchain workflow

Let’s dig a bit deeper in each of the components.

Dynolog

Dynolog is a lightweight monitoring daemon for heterogeneous CPU-GPU systems. It supports continuous monitoring of performance metrics from the CPU (utilization, network bandwidth, instructions/second) and GPU (SM Occupancy, DRAM bandwidth, GPU power draw). Additionally, dynolog exports APIs to collect deep-dive profiling data that can be accessed via the dyno CLI.

One of the chief integrations Dynolog offers is interfacing with the PyTorch Profiler. This enables on-demand remote tracing using a single command to trace thousands of servers. This can be accomplished by using the dyno gputrace command.

PyTorch Profiler

GPU kernels execute asynchronously, and GPU-side support is needed to create the trace. NVIDIA provides this visibility via the CUPTI library. Kineto is the subsystem within Profiler that interfaces with CUPTI. The PyTorch Profiler leverages the Kineto library to collect GPU traces. To enable automated profiling of training workloads at scale without any user side code instrumentation we made a few fundamental changes to PyTorch. These changes enable trace collection without any user intervention.

  • Registration:** **First, we modified PyTorch to register with the Dynolog daemon on start up. This feature is switched on by setting the environment variable KINETO_USE_DAEMON=True. With this environment variable set to True, the PyTorch Profiler periodically polls Dynolog to check for on-demand tracing requests.
  • Iteration hooks: Then, we implemented pre and post hooks for the base Optimizer class. This allowed us to annotate start/end of training iterations. The profiler is then aware of the iteration count and can safely capture a fixed number of iterations in the trace.

Holistic Trace Analysis (HTA)

ML researchers and engineers often struggle to computationally scale up their models as they are unaware of the performance bottlenecks in their workloads. Large distributed training jobs could generate thousands of traces, containing way too much data for a human to inspect. This is where Holistic Trace Analysis comes in. HTA is an open source library for performance analysis – it takes as input PyTorch Profiler traces and up-levels the performance information contained in them. Its goal is to help researchers and engineers achieve the best performance from the hardware stack. To aid performance debugging HTA provides the following features (partial list):

  • Temporal Breakdown: Breakdown of GPU time in terms of time spent in computation, communication, memory events, and idle time on a single node and across all ranks.
  • Idle Time Breakdown: Breakdown of GPU idle time into waiting for the host, waiting for another kernel or attributed to an unknown cause.
  • Kernel Breakdown: Find kernels with the longest duration on each rank.
  • Kernel Duration Distribution: Distribution of average time taken by longest kernels across different ranks.
  • Communication Computation Overlap: Calculate the percentage of time when communication overlaps computation.

We invite you to check out these Jupyter notebooks to see what HTA can do for you. If you are a first time user we recommend starting with the trace_analysis_demo notebook.

To summarize, Dynolog allows us to collect PyTorch Profiler traces on-the-fly in a scalable manner. Furthermore, by leveraging HTA we can automate performance analysis and identify bottlenecks. At Meta, we use the Dynolog, PyTorch Profiler and HTA toolchain to accelerate our performance optimization workflows.

Demo

We share a screencast showcasing trace collection without any user side code instrumentation for a toy PyTorch program. The demo runs in a docker container and the trace collection is triggered using Dynolog. HTA can be used to subsequently analyze the collected trace.

FAQs

Q. What else can dyno gputrace do for me?

The dyno gputrace command supports several custom PyTorch Profiler options:

  • capturing python stacks
  • memory profiling
  • record input shapes

Please run dyno gputrace --help for all the options.

Q. Does Dynolog collect hardware performance metrics?

Dynolog can also be used for always-on monitoring:

  • It incorporates out-of-box GPU performance monitoring for NVIDIA GPUs using DCGM.
  • Dynolog provides basic Linux kernel performance metrics including CPU, network and IO resource usage.
  • Dynolog manages hardware performance counters for micro-architecture specific events related to CPU Cache, TLBs etc on Intel and AMD CPUs.

Q: How can I build the Docker image used in the demo?

The dockerfile is available here. Use the command below to build the Docker image.

docker build -f /path/to/dynolog_repo/dynolog_hta.dockerfile -t <image_name:tag> .

Q. How can I run the docker image?

You can refer to this cheat sheet to run the Docker image.

Acknowledgements

We would like to thank Adnan Aziz, Jay Chae, Aaron Shi, Taylor Robie, Zachary Jones, William Sumendap, Jakob Johnson, Hao Wang, David Carrillo Cisneros, Alston Tang and Parth Malani for supporting this work.

Read More

PyTorch/XLA SPMD: Scale Up Model Training and Serving with Automatic Parallelization

PyTorch/XLA SPMD: Scale Up Model Training and Serving with Automatic Parallelization

Today, we are delighted to announce PyTorch/XLA SPMD: the integration of GSPMD into PyTorch with an easy to use API. PyTorch developers seeking superior performance and scale can train and serve the largest neural networks while maximizing utilization of AI accelerators, such as Google Cloud TPUs.

Introduction

GSPMD is an automatic parallelization system for ML workloads. The XLA compiler transforms the single device program into a partitioned one with proper collectives, based on the user provided sharding hints. This allows developers to write PyTorch programs as if they are on a single large device without any custom sharded computation and/or collective communication ops to scale models.

PyTorch/XLA SPMD allows PyTorch users to parallelize their ML workloads with GSPMD with less effort and with better performance. Some of the key highlights are:

  • Better developer experience. Everything happens with a few sharding annotations from the user, and PyTorch/XLA SPMD achieves comparable performance to the most efficient PyTorch sharding implementation (see the Examples and Results section below). PyTorch/XLA SPMD separates the task of programming an ML model from the challenge of parallelization. Its automated approach to model sharding frees up the user from implementing the sharded version of ops with proper collectives in place.
  • A single API that enables a large variety of parallelism algorithms (including data parallelism, fully sharded data parallelism, spatial partitioning tensor and pipeline parallelism, as well as combinations of these algorithms) for different ML workloads and model architectures.
  • Industry-leading performance in large model training. PyTorch/XLA SPMD brings the powerful XLA GSPMD to PyTorch, enabling users to harness the full power of Google Cloud TPUs.
  • Enabling PyTorch and JAX developers take advantage of the same underlying XLA API to scale models.

Key Concepts

The key concepts behind the sharding annotation API are: 1) Mesh, 2) Partition Spec, and 3) mark_sharding API to express sharding intent using Mesh and Partition Spec. A more detailed design overview is available as a user guide here.

Mesh

For a given cluster of devices, a physical mesh is a representation of the interconnect topology.

We derive a logical mesh based on this topology to create sub-groups of devices which can be used for partitioning different axes of tensors in a model. We apply sharding annotations to map the program across the logical mesh; this automatically inserts communication collectives in the program graph to support functional correctness (see the figure below).

SPMD on PyTorch/XLA

We abstract logical mesh with Mesh API. The axes of the logical Mesh can be named. Here is an example:

import numpy as np
import torch_xla.runtime as xr
from torch_xla.experimental.xla_sharding import Mesh

# Assuming you are running on a TPU host that has 8 devices attached
num_devices = xr.global_runtime_device_count()
# mesh shape will be (4,2) in this example
mesh_shape = (num_devices // 2, 2)
device_ids = np.array(range(num_devices))
# axis_names 'x' nad 'y' are optional
mesh = Mesh(device_ids, mesh_shape, ('x', 'y'))

mesh.get_logical_mesh()
>> array([[0, 1],
          [2, 3],
          [4, 5],
          [6, 7]])
mesh.shape()
>> OrderedDict([('x', 4), ('y', 2)])

Partition Spec

partition_spec has the same rank as the input tensor. Each dimension describes how the corresponding input tensor dimension is sharded across the device mesh (logically defined by mesh_shape). partition_spec is a tuple of device_mesh dimension index, None, or a tuple of mesh dimension indices. The index can be an int or str if the corresponding mesh dimension is named. This specifies how each input rank is sharded (index to mesh_shape) or replicated (None).

# Provide optional mesh axis names and use them in the partition spec
mesh = Mesh(device_ids, (4, 2), ('data', 'model'))
partition_spec = ('model', 'data')
xs.mark_sharding(input_tensor, mesh, partition_spec)

We support all three types of sharding described in the original GSPMD paper. For instance, one can specify partial replication like this:

# Provide optional mesh axis names and use them in the partition spec
mesh = Mesh(device_ids, (2, 2, 2), ('x', 'y', 'z'))

# evenly shard across x and z and replicate among y
partition_spec = ('x', 'z')  # equivalent to ('x', None, 'z')
xs.mark_sharding(input_tensor, mesh, partition_spec)

Simple Example With Sharding Annotation

Users can annotate native PyTorch tensors using the mark_sharding API (src). This takes torch.Tensor as input and returns a XLAShardedTensor as output.

def mark_sharding(t: Union[torch.Tensor, XLAShardedTensor], mesh: Mesh, partition_spec: Tuple[Union[int, None]]) -> XLAShardedTensor

Invoking mark_sharding API takes a user defined logical mesh and partition_spec and generates a sharding annotation for the XLA compiler. The sharding specification is attached to the XLATensor, as well as the original input tensor. Here is a simple usage example from the [RFC], to illustrate how the sharding annotation API works:

import numpy as np
import torch
import torch_xla.core.xla_model as xm
import torch_xla.runtime as xr
import torch_xla.experimental.xla_sharding as xs
from torch_xla.experimental.xla_sharded_tensor import XLAShardedTensor
from torch_xla.experimental.xla_sharding import Mesh

# Enable XLA SPMD execution mode.
xr.use_spmd()

# Device mesh, this and partition spec as well as the input tensor shape define the individual shard shape.
num_devices = xr.global_runtime_device_count()
mesh_shape = (2, num_devicese // 2)  # 2x4 on v3-8, 2x2 on v4-8  
device_ids = np.array(range(num_devices))
mesh = Mesh(device_ids, mesh_shape, ('x', 'y'))

t = torch.randn(8, 4).to(xm.xla_device())

# Mesh partitioning, each device holds 1/8-th of the input
partition_spec = (0, 1)
m1_sharded = xs.mark_sharding(t, mesh, partition_spec)
assert isinstance(m1_sharded, XLAShardedTensor) == True
# Note that the sharding annotation is also in-placed updated to t

We can annotate different tensors in the PyTorch program to enable different parallelism techniques, as described in the comment below:

# Sharding annotate the linear layer weights.
model = SimpleLinear().to(xm.xla_device())
xs.mark_sharding(model.fc1.weight, mesh, partition_spec)

# Training loop
model.train()
for step, (data, target) in enumerate(loader):
  # Assumes `loader` returns data, target on XLA device
  optimizer.zero_grad()
  # Sharding annotate input data, we can shard any input
  # dimensions. Sharding the batch dimension enables 
  # data parallelism, sharding the feature dimension enables
  # spatial partitioning.
  xs.mark_sharding(data, mesh, partition_spec)
  ouput = model(data)
  loss = loss_fn(output, target)
  optimizer.step()
  xm.mark_step()

More complete unit test cases and integration test examples are available in the PyTorch/XLA repo.

Results

Performance

We measured the performance of PyTorch/XLA SPMD using a GPT-2 model (src) and compared it with user-mode FSDP.

Here, SPMD applies the same sharding scheme as the FSDP plot (i.e. 1D sharding). Users are expected to achieve better MFU results by exploring more advanced SPMD sharding schemes.

SPMD vs. FSDP

We use Model FLOPS Utilization (MFU) as a metric for comparison. MFU is “the ratio of the observed throughput relative to the theoretical maximum throughput of a system operating at peak FLOPs” (PaLM paper).

flops_per_step = 6 * global_batch_size * seq_len * num_params
model_flops_utilization = flops_per_step / step_time(s) / chip_count / flops_per_chip

This estimation assumes that the input dimensionality is much larger than the input sequence length (d_model » seq_len). If this assumption is violated the self-attention FLOPs start to be significant enough and this expression will underestimate the true MFU.

Scalability

One of the core benefits of SPMD is the flexible partitioning which can be used to save accelerator memory (HBM) usage and improve scalability. For scalability analysis, we present two studies: 1) we examine the peak HBM across 4 model sizes using Hugging Face transformers (GPT-2) as the base implementation; 2) we examine the peak HBM usage with spatial partitioning.

Peak HBM Utilization

The above figure illustrates the unsharded 2B parameters model peak memory footprint stands at 26GB (red dashed line). harding model weights (model parallelism) reduces the peak memory footprint, and thus, enables larger model training with a given TPU pod slice. In these experiments, we achieved up to 39.75% MFU on a 4B parameters model on Google Cloud TPU v4-16.

We also ran an input batch scalability test using spatial partitioning and a simple ResNet50 example (src) on Cloud TPU v4-8. Input batch is commonly sharded across the batch dimension for data parallelism (DDP, FSDP), but PyTorch/XLA SPMD enables input sharding across input feature dimensions for spatial sharding. As shown in the below figure, one can push the per-device batch size to 512 with spatial partitioning which is not possible with other data parallelism techniques.

Batch size scaling with spatial partitioning

The Road Forward for PyTorch/XLA SPMD

We are ecstatic about what’s ahead for PyTorch/XLA and invite the community to join us. SPMD is still experimental, and we continuously add new features to it. In future releases, we plan to address async dataloading, partially replicated sharding, and other improvements. We’d love to hear from you, answer your questions about PyTorch/XLA SPMD, and learn how you use SPMD.

Cheers!

The PyTorch/XLA Team at Google

Read More