Figure 1: LLaMA Inference Performance on TPU v4 hardware

The Path to Achieve Ultra-Low Inference Latency With LLaMA 65B

Background & State of the Art

In the natural language processing (NLP) space, language models are designed to generate a token (e.g. word) using a sequence of past input tokens. Large Language Models (LLMs) are the latest deep learning innovation in this space built to generate text in a human-like fashion. These models generally use transformers to improve their attention over a large sequence of input tokens.

LLaMA, open sourced by Meta AI, is a powerful foundation LLM trained on over 1T tokens. LLaMA is competitive with many best-in-class models such as GPT-3, Chinchilla, PaLM. LLaMA (13B) outperforms GPT-3 (175B) highlighting its ability to extract more compute from each model parameter.

In this blog post, we use LLaMA as an example model to demonstrate the capabilities of PyTorch/XLA for LLM inference. We discuss how the computation techniques and optimizations discussed here improve inference latency by 6.4x on 65B parameter LLaMA models powered by Google Cloud TPU v4 (v4-16).

Model Overview

We demonstrate the performance capabilities of PyTorch/XLA on LLaMA, the latest LLM from Meta. We showcase performance optimizations on a series of common LLaMA configurations. Notice the 175B parameter model configuration is absent in the public domain. For the 175B parameter model mentioned below, we apply OPT 175B model configuration to the LLaMA code base. Unless stated otherwise, in all configurations, we use max_seq_len=256 and dtype=bfloat16 for weights and activations.

Table 1: Model Configurations Explored in this article

LLaMA Model Hyper Parameters
# Parameters Dimensions N Heads N Layers Max Seq Len
7B 4,096 32 32 256
33B 6,656 52 60 256
65B 8,192 64 80 256
175B 12,288 96 96 256

Performance Challenges of LLMs

LLMs have a few properties that make them challenging for compiler optimizations. (a) LLMs use autoregressive decoding to generate the next token baked on the previous ones; this means prompt tensors and coaches have a dynamic shape. (b) LLMs must work with variable input prompt lengths without triggering recompilation due to input tensor shape changes; input tensors must be properly bucketized and padded to avoid recompilation. (c) LLMs often require more memory than a single TPU (or GPU) device can support. A model-sharding scheme is required to fit the model across a distributed compute architecture. For instance, a LLaMA model with 65B parameters can fit on a v4-16 Cloud TPU, which is comparable to 8 A100 GPUs. (d) running LLMs in production can be expensive; one way to improve performance per total cost of ownership (Perf/TCO) is via quantization; quantization can potentially reduce hardware requirements.

Inference Tech Stack in PyTorch/XLA

Our goal is to offer the AI community a high performance inference stack. PyTorch/XLA integrates with TorchDynamo, PjRt, OpenXLA, and various model parallelism schemes. TorchDynamo eliminates tracing overhead at runtime, PjRt enables efficient host-device communication; PyTorch/XLA traceable collectives enable model and data parallelism on LLaMA via TorchDynamo. To try our results, please use our custom torch, torch-xla wheels to reproduce our LLaMA inference solution. PyTorch/XLA 2.1 will support the features discussed in this post by default.

Parallel Computing

FairScale Sharding

LLaMA uses FairScale model sharding API (fairscale.nn.model_parallel.layers). We built an equivalent representation of this API using PyTorch/XLA communication collective (CC) ops such as all-reduce to communicate program state (e.g. activations) between accelerators. TorchDynamo does not fully support capturing CC ops currently (a.k.a. traceable collectives). Without this support, a TorchDynamo FX graph would be cut at every device communication, meaning at every model layer. Graph cuts lead to performance loss as the underlying XLA compiler loses full graph optimization opportunities. To resolve this, we offer PyTorch/XLA traceable collectives by integrating the dispatcher collectives into our existing CC APIs. The difference is we don’t need to insert c10d.wait() ops after collectives, given the lazy execution nature of PyTorch/XLA. With support for traceable collectives, PyTorch/XLA allows singular FX graph generation in TorchDynamo.

Autoregressive Decoding on PyTorch/XLA

LLMs need autoregressive decoding to feed the previous word as a prompt to predict the next token. Autoregressive decoding leads to unbounded dynamic shape problems, which in turn causes recompilation of every prompt. We optimized the LLaMA autoregressive decoder to operate with fixed shapes that in-place updates the KV-cache, output sequences, and attention masks during every token generation. With a combination of padding, masking, and index ops, we avoided excessive graph recompilation, thereby achieving efficient autoregressive decoding.

KV-Cache Optimization

LLaMA implements autoregressive decoding with KV-cache. For every generated token, the KV-cache stores the attention key/value activations of each Transformer layer. Thus, upon decoding a new token, the key/values of prior tokens no longer need recomputation.

In LLaMA, the KV-cache tensor slices are updated in-place; this leads to recompilation events every time a token is generated. To address this issue, we use index tensors and tensor.index_copy() ops to replace the in-place slice updates. Attention masks and output sequences also benefit from the same optimization.

Input Prompt Optimization

Variable length input prompts are common in LLM applications. This property causes input tensor shape dynamism and in turn recompilation events. When processing a prompt to fill the KV-cache, we either (a) process the input prompt token-by-token, or (b) process the whole prompt in one iteration. The pros and cons of each method are:

  1. Pre-compile 1 graph and process a prompt token-by-token
    • Practical: 1 graph is compiled during warm-up
    • Slow: O(L) to process an input prompt length L – a disadvantage for long prompts
  2. Pre-compile all graphs with input lengths ranging from 1 to max_seq_len (e.g. 2,048)
    • Impractical: pre-compile and cache max_seq_len graphs during warm-up time
    • Fast: 1 graph execution to process the full prompt

We introduce prompt length bucketization, an optimization to strike a balance between the two alternatives. We define a set of ascending bucket sizes, (b0,b1,b2,…,bB-1), and then pre-compile program graphs with input sizes according to these bucket values, (G0,G1,G2,…,GB-1); B is the number of buckets. For a given input prompt, we round up the prompt length to the closest bucket value bn, pad the sequence, and use Gn to process the prompt in one iteration. The computation on the padding tokens is discarded. For prompts larger than the largest bucket size, we process them section-by-section.

The optimal bucket sizes should be determined by prompt length distribution in a target application. Here, we adopt bucket lengths: 128, 256, 384, 512. Any input prompt with up to 2,047 tokens requires up to 4 graph executions. For example, a 1,500 input prompt with generation length of 256 requires 260 graph executions – 4 to process the input, and 256 to generate the output.

Quantization

Quantization reduces the number of bits necessary to represent a value; it reduces the bandwidth to communicate data across multiple accelerator nodes (via collectives) and lowers the hardware requirements to serve a specific model size.

Normally, with BF16 weights, a 175B parameter model would consume about 351GB of memory, and therefore require a v4-32 instance to accommodate the model. By quantizing the weights to INT8, we reduced the model size by roughly 50%, allowing it to run on a smaller v4-16 instance. Because LLaMA shards model activations, quantization offers negligible communication gain.

In our experiments, we quantized the linear layer. Since LLaMA model checkpoints are unavailable publicly, and our goal is to evaluate performance, the quantized model is initialized with random weights.Recent literature such as AWQ and Integer or Floating Point? offer insights into performance properties of LLaMA under various low-bit quantization schemes.

Effect of Batch Size on Quantization Performance

TPU v4 is programmed to run matmul on the Matrix Multiply Unit (MXU) when the model batch size (BS) > 1. For BS = 1, matmul runs on the Vector Processor Unit (VPU). Since MXU is more efficient than VPU, INT8 quantization gains performance at BS>1. See Performance Analysis section for details.

Op Support

Occasionally, new models introduce new mathematical operations that require PyTorch/XLA to extend its supported op set for compilation. For LLaMA, we supported: multinomial.

Methodology

LLaMA works on PyTorch/XLA out of the box on LazyTensorCore. We use this configuration as a baseline for our follow up analysis. All experiments assume 256-long input prompts. In the absence of a publicly available model checkpoint, we used random tensor initialization for this inference stack optimization effort. A model checkpoint is not expected to change latency results discussed here.

Model Sizing

Assuming N is the number of parameters, dimensions is the hidden size, n_layers is the number of layers, n_heads is the number of attention heads, the equation below can be used to approximate the model size. See the Model Overview section for details.

N = (dimensions)^2 * n_layers * 12

n_heads doesn’t affect N, but the following equation holds for the open sourced model configs.

dim = 128 * n_heads

Cache Sizing

Both model parameters and the cache layers in the Attention block contribute to memory consumption. Since the default LLaMA model uses BF16 weights, the memory consumption calculation in this section is based on BF16 weights.

The size of the cache layer is calculated by cache_size = max_batch_size * max_seq_len * dimensions. max_batch_size = 1 and max_seq_len = 256 are used as an example configuration in the following calculations. There are 2 cache layers in each Attention block. So, the total LLaMA cache size (in Bytes) is total_cache_size = n_layers * 2 * cache_size * (2 bytes).

TPU v4 Hardware Sizing

Each TPU v4 chip has 32GB of available High-Bandwidth Memory (HBM). Table 2 has the details on memory consumption and the number of required TPU chips to hold a LLaMA model.

Table 2: LLaMA TPU v4 HBM requirements (i.e. TPU v4 chip requirements)

# Parameters Parameter (MB) Cache (MB) Total (GB) Min # of TPU v4 Chips
7B 14,000 134 14.128 1
33B 66,000 408 66.41 3
65B 130,000 671 130.67 5
175B 350,000 1,208 351.21 11

Metrics

Below are useful metrics to measure inference speed. Assuming T is the total time, B is the batch size, L is the decoded sequence length.

Latency Definition

Latency is the time it takes to get the decoded result at target length L, regardless of the batch size B. Latency represents how long the user should wait to get the response from the generation model.

Latency = T (s)

Per-token latency

One step of autoregressive decoding generates a token for each sample in the batch. Per-token latency is the average time for that one step.

Per-token latency = T / L (s/token)

Throughput

Throughput measures how many tokens are generated per unit time. While it’s not a useful metric for evaluating online serving it is useful to measure the speed of batch processing.

Throughput = B * L / T (tokens/s)

To minimize confusion and misinterpretation, it’s better to avoid metrics like T / (B * L), which mixes latency and throughput.

Results

Figure 1 shows latency / token results for LLaMA 7B to 175B models. In each case, the model is run on a range of TPU v4 configurations. For instance, LLaMA 7B shows 4.7ms/token and 3.8ms/token on v4-8 and v4-16 respectively. For more comparison, visit the HuggingFace LLM performance leaderboard.

In the absence of the features discussed in this blog post, the LLaMA 65B running on v4-32 delivers 120ms/token instead of 14.5ms/token obtained here, leading to 8.3x speedup. As discussed earlier, developers are encouraged to try our custom torch, torch-xla wheels that unlock the repro of LLaMA inference results shared here.

Figure 1: LLaMA Inference Performance on TPU v4 hardware

Figure 1: LLaMA Inference Performance on TPU v4 hardware

PyTorch/XLA:GPU performance is better than PyTorch:GPU eager and similar to PyTorch Inductor. PyTorch/XLA:TPU performance is superior to PyTorch/XLA:GPU. In the near future, XLA:GPU will deliver optimizations that bring parity with XLA:TPU. The single A100 configuration only fits LLaMA 7B, and the 8-A100 doesn’t fit LLaMA 175B.

Figure 2: LLaMA Inference Performance on GPU A100 hardware

Figure 2: LLaMA Inference Performance on GPU A100 hardware

As the batch size increases, we observe a sublinear increase in per-roken latency highlighting the tradeoff between hardware utilization and latency.

Figure 3: LLaMA Inference Performance across different batch sizes

Figure 3: LLaMA Inference Performance across different batch sizes

Our studies suggest the impact of maximum sequence input length (max_seq_len) on inference latency is relatively minimal. We attribute this to the sequential and iterative nature of token generation. The small difference in performance can be due to KV cache access latency changes as the storage size increases.

Figure 4: LLaMA Inference Performance across different prompt lengths

Figure 4: LLaMA Inference Performance across different prompt lengths

LLMs are often memory bound applications; thus, by quantizing model parameters we enable loading and executing a larger tensor on MXUs per unit time (i.e. HBM ⇒ CMEM and CMEM ⇒ MXU data moevment). Figure 5 shows INT8 weight-only quantization offers 1.6x-1.9x speedup allowing running a larger model on a given hardware.

When BS=1, INT8 tensors are dispatched to VPU which is smaller than MXU (see the TPU v4 paper); otherwise, MXU is used. As a result, when BS=1, quantization memory bandwidth gains are offset by lack of MXU utilization. When BS>1, however, memory gains deliver superior latency on the quantized model. For example, in the case of 175B parameters LLaMA, v4-16 with quantiztion and v4-32 without quantiztion deliver similar performance. Note we do not provied FP8 comparisons because PyTorch is yet to offer this data type.

Figure 5: LLaMA Inference Performance vs. weight-only quantization. The missing blue bars suggest the model size doesn’t fit in the specified TPU hardware.

Figure 5: LLaMA Inference Performance vs. weight-only quantization. The missing blue bars suggest the model size doesn’t fit in the specified TPU hardware.

Figure 6 demonstrates the steady performance advantage of PyTorch/XLA as the input prompt length grows from 10 tokens to 1,500 tokens. This strong scaling capability suggests minimal PyTorch/XLA recompilation events enabling a wide range of real-world applications. In this experiment, the maximum length is 2,048 and maximum generation length is 256.

Figure 6: LLaMA Inference Performance vs. Input Prompt Length

Figure 6: LLaMA Inference Performance vs. Input Prompt Length

Final Thoughts

We are ecstatic about what’s ahead for PyTorch/XLA and invite the community to join us. PyTorch/XLA is developed fully in open source. So, please file issues, submit pull requests, and send RFCs to GitHub so that we can openly collaborate. You can also try out PyTorch/XLA for yourself on various XLA devices including TPUs and GPUs.

Cheers,
The PyTorch/XLA Team at Google
#PoweredByPyTorch

Read More

Optimized PyTorch 2.0 Inference with AWS Graviton processors

Optimized PyTorch 2.0 Inference with AWS Graviton processors

New generations of CPUs offer significant performance improvement in machine learning (ML) inference due to specialized built-in instructions. Combined with their flexibility, high speed of development, and low operating cost, these general-purpose processors offer an alternative ML inference solution to other existing hardware solutions.

AWS, Arm, Meta, and others helped optimize the performance of PyTorch 2.0 inference for Arm-based processors. As a result, we are delighted to announce that Arm-based AWS Graviton instance inference performance for PyTorch 2.0 is up to 3.5 times the speed for ResNet-50 compared to the previous PyTorch release, and up to 1.4 times the speed for BERT, making Graviton-based instances the fastest compute optimized instances on AWS for these models (see the following graph).

Relative speed improvement achieved by upgrading PyTorch to 2.0

Image 1: Relative speed improvement achieved by upgrading from PyTorch version 1.13 to 2.0 (higher is better). The performance is measured on c7g.4xlarge instances.

As shown in the next graph, we measured up to 50% cost savings for PyTorch inference with Graviton3-based c7g instances across Torch Hub ResNet-50 and multiple Hugging Face models compared to comparable x86-based compute optimized Amazon EC2 instances. For that graph, we first measured the cost per million inference for the five instance types. Then, we normalized the cost per million inference results to a c5.4xlarge instance, which is the baseline measure of “1” on the Y-axis of the chart.

Relative cost of PyTorch inference running on different AWS instances

Image 2: Relative cost of PyTorch inference running on different AWS instances (lower is better).
Source: AWS ML Blog on Graviton PyTorch2.0 inference performance.

Similar to the preceding inference cost comparison graph, the following graph shows the model p90 latency for the same five instance types. We normalized the latency results to the c5.4xlarge instance, which is the baseline measure of “1” on the Y-axis of the chart. The c7g.4xlarge (AWS Graviton3) model inference latency is up to 50% better than the latencies measured on c5.4xlarge, c6i.4xlarge, and c6a.4xlarge.

Relative latency (p90) of PyTorch inference running on different AWS instances

Image 3: Relative latency (p90) of PyTorch inference running on different AWS instances (lower is better).
Source: AWS ML Blog on Graviton PyTorch2.0 inference performance.

Optimization details

PyTorch supports Compute Library for the Arm® Architecture (ACL) GEMM kernels via the oneDNN backend (previously called “MKL-DNN”) for AArch64 platforms. The optimizations are primarily for PyTorch ATen CPU BLAS, ACL kernels for fp32 and bfloat16, and oneDNN primitive caching. There are no frontend API changes, so no changes are required at the application level to get these optimizations working on Graviton3-based instances.

PyTorch level optimizations

We extended the ATen CPU BLAS interface to accelerate more operators and tensor configurations via oneDNN backend for aarch64 platform. The following diagram highlights (in orange) the optimized components that improved the PyTorch inference performance on aarch64 platform.

PyTorch software stack highlighting (in orange) the components optimized for inference performance improvement on AArch64 platform

Image 4: PyTorch software stack highlighting (in orange) the components optimized for inference performance improvement on AArch64 platform

ACL kernels and BFloat16 FPmath mode

The ACL library provides Neon and SVE optimized GEMM kernels for both fp32 and bfloat16 formats: These kernels improve the SIMD hardware utilization and reduce the end to end inference latencies. The bfloat16 support in Graviton3 allows efficient deployment of models trained using bfloat16, fp32 and Automatic Mixed Precision (AMP). The standard fp32 models use bfloat16 kernels via oneDNN FPmath mode without model quantization. They provide up to two times faster performance compared to existing fp32 model inference without bfloat16 FPmath support. For more details on ACL GEMM kernel support, refer to Arm Compute Library github.

Primitive Caching

The following call sequence diagram shows how ACL operators are integrated into oneDNN backend. As shown in the diagram, ACL objects are handled as oneDNN resources instead of the primitive objects. This is because the ACL objects are stateful and mutable. Since the ACL objects are handled as resource objects, they are not cacheable with the default primitive caching feature supported in oneDNN. We implemented primitive caching at ideep operator level for “convolution”, “matmul” and “inner product” operators to avoid redundant GEMM kernel initialization and tensor allocation overhead.

Call sequence diagram showing how the Compute Library for the Arm® Architecture (ACL) GEMM kernels are integrated into oneDNN backend

Image 5: Call sequence diagram showing how the Compute Library for the Arm® Architecture (ACL) GEMM kernels are integrated into oneDNN backend

How to take advantage of the optimizations

Install the PyTorch 2.0 wheel from the official repo and set environment variables to enable the additional optimizations.

# Install Python
sudo apt-get update
sudo apt-get install -y python3 python3-pip

# Upgrade pip3 to the latest version
python3 -m pip install --upgrade pip

# Install PyTorch and extensions
python3 -m pip install torch
python3 -m pip install torchvision torchaudio torchtext

# Turn on Graviton3 optimization
export DNNL_DEFAULT_FPMATH_MODE=BF16
export LRU_CACHE_CAPACITY=1024

Running an inference

You can use PyTorch torchbench to measure the CPU inference performance improvements, or to compare different instance types.

# Pre-requisite:
# pip install PyTorch2.0 wheels and set the above mentioned environment variables

# Clone PyTorch benchmark repo
git clone https://github.com/pytorch/benchmark.git

# Setup ResNet-50 benchmark
cd benchmark
python3 install.py resnet50

# Install the dependent wheels
python3 -m pip install numba

# Run ResNet-50 inference in jit mode. On successful completion of the inference runs,
# the script prints the inference latency and accuracy results
python3 run.py resnet50 -d cpu -m jit -t eval --use_cosine_similarity

Performance Analysis

Now, we will analyze the inference performance of ResNet-50 on Graviton3-based c7g instance using PyTorch profiler. We run the code below with PyTorch 1.13 and PyTorch 2.0 and run the inference for a few iterations as a warmup before measuring the performance.

# Turn on Graviton3 optimization
export DNNL_DEFAULT_FPMATH_MODE=BF16
export LRU_CACHE_CAPACITY=1024
import torch
from torchvision import models
sample_input = [torch.rand(1, 3, 224, 224)]
eager_model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
model = torch.jit.script(eager_model, example_inputs=[sample_input, ])

model = model.eval()
model = torch.jit.optimize_for_inference(model)

with torch.no_grad():
    # warmup runs
    for i in range(10):
        model(*sample_input)
    prof = torch.profiler.profile(
      on_trace_ready=torch.profiler.tensorboard_trace_handler('./logs'), record_shapes=True, with_stack=True)
    # profile after warmup
    prof.start()
    model(*sample_input)
    prof.stop()

We use tensorboard to view results of the profiler and analyze model performance.

Install PyTorch Profiler Tensorboard plugin as follows

pip install torch_tb_profiler

Launch the tensorboard using

tensorboard --logdir=./logs

Launch the following in the browser to view the profiler output. The profiler supports ‘Overview’, ‘Operator’, ‘Trace’ and ‘Module’ views to get insight into the inference execution.

http://localhost:6006/#pytorch_profiler

The following diagram is the profiler ‘Trace’ view which shows the call stack along with the execution time of each function. In the profiler, we selected the forward() function to get the overall inference time. As shown in the diagram, the inference time for the ResNet-50 model on Graviton3-based c7g instance is around 3 times faster in PyTorch 2.0 compared to PyTorch 1.13.

Profiler Trace view: Forward pass wall duration on PyTorch 1.13 and PyTorch 2.0

Image 6: Profiler Trace view: Forward pass wall duration on PyTorch 1.13 and PyTorch 2.0

The next diagram is the ‘Operator’ view which shows the list of PyTorch operators and their execution time. Similar to the preceding Trace view, the Operator view shows that the operator host duration for the ResNet-50 model on Graviton3-based c7g instance is around 3 times faster in PyTorch 2.0 compared to PyTorch 1.13.

Profiler Operator view: Forward operator Host duration on PyTorch 1.13 and PyTorch 2.0

Image 7: Profiler Operator view: Forward operator Host duration on PyTorch 1.13 and PyTorch 2.0

Benchmarking Hugging Face models

You can use the Amazon SageMaker Inference Recommender utility to automate performance benchmarking across different instances. With Inference Recommender, you can find the real-time inference endpoint that delivers the best performance at the lowest cost for a given ML model. We collected the preceding data using the Inference Recommender notebooks by deploying the models on production endpoints. For more details on Inference Recommender, refer to the amazon-sagemaker-examples GitHub repo. We benchmarked the following models for this post: ResNet50 image classification, DistilBERT sentiment analysis, RoBERTa fill mask, and RoBERTa sentiment analysis.

Conclusion

For PyTorch 2.0, the Graviton3-based C7g instance is the most cost-effective compute optimized Amazon EC2 instance for inference. These instances are available on SageMaker and Amazon EC2. The AWS Graviton Technical Guide provides the list of optimized libraries and best practices that will help you achieve cost benefit with Graviton instances across different workloads.

If you find use cases where similar performance gains are not observed on Graviton, please open an issue on the aws-graviton-getting-started github to let us know about it. We will continue to add more performance improvements to make AWS Graviton-based instances the most cost-effective and efficient general purpose processor for inference using PyTorch.

Acknowledgments

We would like to thank Ali Saidi (Sr. Principal Engineer) and Csaba Csoma (Sr. Manager, Software Development) from AWS, Ashok Bhat (Sr. Product Manager), Nathan Sircombe (Sr. Engineering Manager) and Milos Puzovic (Principal Software Engineer) from Arm for their support during the Graviton PyTorch inference optimization work. We would also like to thank Geeta Chauhan (Engineering Leader, Applied AI) from Meta for her guidance on this blog.

About the authors

Sunita Nadampalli is a ML Engineer and Software Development Manager at AWS.

Ankith Gunapal is an AI Partner Engineer at Meta(PyTorch).

Read More

🎉 PyTorch Docathon H1 2023 Wrap-up 🎉

Thank you to all who participated in our first ever PyTorch Docathon, the results have been nothing short of amazing! We want to extend our sincerest gratitude to all the participants who made this event a resounding success. Your passion, talent, and hard work have left an indelible mark on the PyTorch documentation.

The virtual Docathon ran from May 31 through June 15 with more than 230 registrants and more than 110 participants joining the Docathon Slack channel, the energy and enthusiasm were palpable. Entrants were judged on the difficulty of submissions that resulted in over 40 merged pull requests and the publication of four new tutorials and addition of one new example.

We want to give a special shout-out to our top contributors, who went above and beyond during this event. Your dedication and expertise have been invaluable in enhancing the PyTorch documentation and empowering developers worldwide. See the full list of contributors here.

Meet the top contributors:

As we bring this Docathon to a close, we encourage each and every one of you to stay inspired and keep contributing to PyTorch documentation and code, and pushing the boundaries of what’s possible with PyTorch. Your collective efforts are shaping the landscape of deep learning and fostering innovation in the AI community.

Team PyTorch

Read More

Join the PyTorch Foundation: Membership Now Open

In September 2022, we welcomed PyTorch to the Linux Foundation from Meta, which formed the PyTorch Foundation with founding members AMD, Amazon Web Services (AWS), Google, Meta, Microsoft, and NVIDIA.

Since then, we’ve seen significant growth, including a 39% increase in commits across all repositories, 27% increase of unique contributors, and a 12% increase community contributions – all in the last 90 days! We’re grateful to our founding members for their support to move the foundation forward.

Today, we’re announcing that membership is now open to join the PyTorch Foundation.

As a member of the PyTorch Foundation, you’ll have access to resources that allow you to be stewards of stable, secure, and long-lasting codebases. You can collaborate on training and certification programs, local and regional events, open source developer tooling, academic research, and guides to help new users and contributors have a productive experience.

The PyTorch Foundation’s goal is to help end users navigate the PyTorch ecosystem, recruit talent, and adopt PyTorch and support open source AI technologies successfully.

Why join as a member

Being a part of the PyTorch Foundation grants opportunities to help build the future of end-to-end machine learning frameworks alongside your industry peers.

Membership benefits include:

  • Gain technical traction and insight for your organization’s products by immersing your teams with other industry leaders.
  • Influence technical priorities, approaches, and code.
  • Support the PyTorch project community by helping fund programs and services that the project and its community rely on.
  • Engage with the PyTorch project ecosystem, network with fellow members, and contribute to building and maintaining an engaging and strong PyTorch ecosystem.
  • Provide thought leadership and participate in unique, wide-reaching networking and marketing programs expanding industry awareness as PyTorch amplifies member progress.
  • Retain, attract, and increase engineering skills and employees and build your innovation partner network, supply chain, and customer pipeline.
  • As an active member of the PyTorch community, you can deepen your engagement and leadership in local and industry developer networks and conferences.

How to join

Premier members must submit an application to be considered for board level membership. General and associate members are welcome to join automatically. See below for specific tiering and details on each type of membership.

Premier Members

Premier members are the highest tier. They will appoint one voting representative in any subcommittees or activities of the PTF Governing Board, and receive prominent placement in displays of membership including website, landscape and marketing materials, exclusive live webinars with PyTorch online programs and everything included within a “general” membership. The annual fee is $150,000 + an LF Silver Membership.

General Members

General members will participate in all marketing, community and thought leadership opportunities, as well as discounts on event sponsorships and training courses. General members also have the opportunity to be considered for a PTF board position. The annual fee is dependent on the size of your organization. More details can be found here.

Associate Members

Associate members are free to join and will receive support and participation opportunities with the PyTorch Foundation team. More information can be found here.

Hear from our founding members

AMD

“AMD strongly believes in and supports an open software ecosystem. We are very proud to be a founding member of the PyTorch Foundation, helping to develop an open and collaborative community for AI and ML. AI and ML have the opportunity to impact everything we do, and the work done through the PyTorch Foundation is critical in developing an open framework that is vendor neutral and helps democratize AI for all.”

AWS

“AWS is a firm believer in the PyTorch Foundation mission to develop AI and deep learning tools through open collaboration. Our customers use PyTorch every day to build, train, and deploy machine learning models on AWS. Through our involvement, AWS is supporting innovation and helping to make open source tooling more accessible to our customers and the broader community.”

Google

“The AI revolution is upon us and it’s being built on PyTorch. With new applications like ChatGPT and Stable Diffusion built on PyTorch, the wave of generative AI continues to be felt across every facet of society. We at Google are excited to be a founding member of the PyTorch Foundation and we’re excited for the opportunity to work closely with other leaders in AI to help grow this amazing and innovative community.”

Meta

“Meta has a long history of putting open science at the core of our work in AI and PyTorch is no exception. PyTorch was built from the ground up with an open source, community-first philosophy. We transitioned PyTorch to the PyTorch Foundation because we believe this approach enables the fastest progress in building and deploying new systems that will address real-world needs and answer fundamental questions about the nature of intelligence. With the PyTorch Foundation, the entire AI community is positioned to push the field forward in countless exciting new ways.”

Microsoft

“Microsoft believes strongly in PyTorch and it’s been an honor to be a founding member of the PyTorch Foundation. Internally, we use PyTorch extensively, and an outgrowth of that is the Azure Container for PyTorch, which provides deep optimization for PyTorch development, including ONNX Runtime, DeepSpeed, and Nebula to greatly reduce training cost and accelerate training times on Azure Machine Learning. As part of our ongoing commitment to open source machine learning platforms, we look forward to partnering with industry leaders to continue contributing to the advancement of PyTorch.”

NVIDIA

“As a leading Python-based AI framework, PyTorch has been fundamental to the development of LLMs and GenAI. NVIDIA’s goal is to deepen our collaboration with the open-source AI community as part of the PyTorch Foundation, and help build the next wave of advanced, energy efficient, and cost-effective applications with accelerated computing.”

Join today

We are excited to see the PyTorch Foundation continue to grow alongside the community through neutral governance and support. We hope you’ll join us as a member!

Read More

Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0

Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0

As part of PyTorch 2.0 release, an accelerated implementation of the attention mechanism as part of the “Better Transformer” project (and known in PyTorch as Accelerated Transformers) has been added natively into PyTorch as torch.nn.functional.scaled_dot_product_attention. This implementation leverages fused kernels from FlashAttention and Memory-efficient attention, and supports both training and inference.

We also release a notebook showcasing an example of this integration here

After seeing 20-30% speedups at inference for diffusion models, we went ahead and implemented an integration with 🤗 Transformers models through the 🤗 Optimum library. Similar to the previous integration for encoder models, the integration replaces modules from Transformers with efficient implementations that use torch.nn.functional.scaled_dot_product_attention. The usage is as follow:

from optimum.bettertransformer import BetterTransformer
from transformers import AutoModelForCausalLM

with torch.device(“cuda”):
model = AutoModelForCausalLM.from_pretrained(“gpt2-large”, torch_dtype=torch.float16)

model = BetterTransformer.transform(model)

# do your inference or training here

# if training and want to save the model
model = BetterTransformer.reverse(model)
model.save_pretrained(“fine_tuned_model”)
model.push_to_hub(“fine_tuned_model”) 

Summarizing our findings below about torch.nn.functional.scaled_dot_product_attention:

  • It is most useful to fit larger models, sequence length, or batch size to train on a given hardware.
  • Memory footprint savings on GPU during training range from 20% to 110%+.
  • Speedups during training range from 10% to 70%.
  • Speedups during inference range from 5% to 20%.
  • Standalone, for small head dimensions, scaled_dot_product_attention speedups go up to 3x, memory savings go as high as 40x (depending on the sequence length).

You may be surprised by the wide range of memory savings and speedups. In this blog post, we discuss our benchmarks, where this feature shines and upcoming improvements in future PyTorch releases.

In the next release of transformers you will just need to install the proper version of optimum and run:

model = model.to_bettertransformer()

To convert your model using the BetterTransformer API. You can already try this feature out by installing transformers from source.

Benchmark and usage with 🤗 Transformers

torch.nn.functional.scaled_dot_product_attention is usable with any architecture that uses standard attention, and namely replaces the boiler-plate code:

# native scaled_dot_product_attention is equivalent to the following:
def eager_sdpa(query, key, value, attn_mask, dropout_p, is_causal, scale):
	scale_factor = 1 / math.sqrt(Q.size(-1)) if scale is None else scale
	attn_mask = torch.ones(L, S, dtype=torch.bool).tril(diagonal=0) if is_causal else attn_mask
	attn_mask = attn_mask.masked_fill(not attn_mask, -float('inf')) if attn_mask.dtype==torch.bool else attn_mask
	attn_weight = torch.softmax((Q @ K.transpose(-2, -1) * scale_factor) + attn_mask, dim=-1)
	attn_weight = torch.dropout(attn_weight, dropout_p)
	return attn_weight @ V

In the 🤗 Optimum integration with Transformers models, the following architectures are supported for now: gpt2, gpt-neo, gpt-neox, gptj, t5, bart, codegen, pegasus, opt, LLaMA, blenderbot, m2m100. You can expect this list to be extended in the near future!

To validate the benefits from the native scaled dot-product attention, we ran inference and training benchmarks, whose results are presented below.

Inference benchmark on a single A10G GPU, AWS g5.4xlarge instanceInference benchmark on a single A10G GPU, AWS g5.4xlarge instance

Training benchmark on a single A10G GPU, AWS g5.4xlarge instanceTraining benchmark on a single A10G GPU, AWS g5.4xlarge instance

Training benchmark on a single A100-SXM4-80GB, Nvidia DGXTraining benchmark on a single A100-SXM4-80GB, Nvidia DGX

Out of this benchmark, the most interesting finding is that native SDPA allows for the usage of longer sequence lengths and batch sizes without running into out of memory issues. Moreover, up to 20% speedups can be seen during inference, and even larger during training.

As seen on the training benchmarks, it appears that smaller head dimension brings higher speedups and memory savings, which we will discuss in the following section.

The implementation supports multi-GPU settings as well, thanks to 🤗 Accelerate library by passing device_map=”auto” to the from_pretrained method. Here are some results for training on two A100-SXM4-80GB.

Training benchmark on two A100-SXM4-80GB, Nvidia DGX, using 🤗 Accelerate library for distributed trainingTraining benchmark on two A100-SXM4-80GB, Nvidia DGX, using 🤗 Accelerate library for distributed training

Note that some kernels support only the sm_80 compute capability (which is the one from A100 GPUs), which limits usability on a wide range of hardware, notably if the head dimension is not a power of two. For example, as of PyTorch 2.0.0 during training, opt-2.7b (headim=80) and gpt-neox-20b (headdim=96) can not dispatch to a kernel using flash attention, unless run on an A100 GPU. Better kernels may be developed in the future: https://github.com/pytorch/pytorch/issues/98140#issuecomment-1518101895

Flash Attention, Memory-efficient attention & math differences

The native scaled_dot_product_attention relies on three possible backend implementations: flash attention, memory-efficient attention, and the so-called math implementation which provides a hardware-neutral fallback for all PyTorch platforms.

When fused kernels are available for a given problem size, flash-attention or memory-efficient attention will be used, effectively allowing for a lower memory footprint, as in the memory-efficient attention case O(N) memory allocations are done on the GPU global memory instead of the classic O(N^2) for the traditional eager attention implementation. With flash attention, a reduced number of memory accesses (read and writes) is expected, hence both giving speedups and memory savings.

The “math” implementation is simply an implementation using the PyTorch’s C++ API. Interesting to note in this implementation is that the query and key tensors are scaled individually for numerical stability, thus launching two aten::div operations instead of possibly only one in an eager implementation that does not contain this optimization for numerical stability.

Head dimension influence on speedups, memory savings

Benchmarking torch.nn.functional.scaled_dot_product_attention, we notice a decrease in the speedup / memory gains as the head dimension increases. This is an issue for some architectures like EleutherAI/gpt-neo-2.7B, that has a relatively large head dimension of 128, or EleutherAI/gpt-j-6B (and derived models as PygmalionAI/pygmalion-6b) that has a head dimension of 256 (that actually currently do not dispatch on fused kernels as the head dimension is too large).

This trend can be seen in the figures below, where torch.nn.scaled_dot_production is benchmarked standalone versus the above eager implementation. Moreover, we use the torch.backends.cuda.sdp_kernel context manager to force the usage of respectively math, flash attention, and memory-efficient attention implementation.

Using memory-efficient attention SDP kernel (forward-only), A100Using memory-efficient attention SDP kernel (forward-only), A100

Using math (without dropout), A100Using math (without dropout), A100

Using flash attention SDP kernel (without dropout), A100Using flash attention SDP kernel (without dropout), A100

Using memory-efficient attention SDP kernel (without dropout), A100Using memory-efficient attention SDP kernel (without dropout), A100

We see that for the same problem size, be it for inference-only or training, the speedup decreases with higher head dimension, e.g. from 3.4x for headdim=8 to 1.01x for headdim=128 using flash attention kernel.

The reduced memory saving is expected with larger head dimensions. Recall the standard attention computation:

Math equation

Due to the intermediate computations, the global memory footprint is 2 * N * N + N * d in this standard step by step computation. Memory-efficient attention proposes to iteratively update the softmax renormalization constant and moving its computation at the very end, allowing for only a constant output memory allocation N * d.

Thus, the memory saving ratio is 2 * N / d + 1, which decreases with larger head dimension.

In flash attention, the tradeoff is between the head dimension d and the shared memory size M of a GPU streaming multiprocessor, with a total number of memory accesses of O(N² * d²/M). Thus, the memory accesses scale quadratically in the head dimension, contrary to the standard attention that scales linearly. The reason is that in flash attention, for larger head dimension d, the key and value K, V need to be split into more blocks to fit into shared memory, and in turn each block needs to load the full query Q and output O.

Thus, the highest speedups for flash attention are in a regime where the ratio d² / M is small enough.

Current limitations as of PyTorch 2.0.0

Absence of a scale argument

As of PyTorch 2.0.0, torch.nn.functional.scaled_dot_product_attention has no scale argument and uses the default square root of the hidden size sqrt(d_k).

Math equation

However, some architectures as OPT or T5 do not use a scaling in the attention, which as of Pytorch 2.0.0 forces it to artificially rescale before the scaled_dot_product_attention call. This introduces an unnecessary overhead, as an additional multiplication is necessary, on top of unneeded divisions in the attention.

A fix for this issue has been merged in PyTorch repository.

Support of flash attention / memory-efficient attention with custom mask

As of PyTorch 2.0.0, when passing a custom attention mask, flash attention and memory-efficient attention can not be used. In this case, scaled_dot_product_attention automatically dispatches to the C++ implementation.

However, as we have seen, some architectures require a custom attention mask, as T5 that uses positional bias. Moreover, in the case of a batch size larger than one where some inputs may be padded, a custom attention mask also needs to be passed. For this latter case, an alternative would be to use NestedTensor, which SDPA supports.

This limited support for custom masks thus limits the benefits from SDPA in these specific cases, although we can hope for an extended support in the future.

Note that xformers, from which PyTorch’s SDPA partially takes inspiration, currently supports arbitrary attention masks: https://github.com/facebookresearch/xformers/blob/658ebab39545f180a6075385b3897921623d6c3b/xformers/ops/fmha/cutlass.py#L147-L156 . HazyResearch implementation of flash attention also supports an equivalent implementation of padding, as a cumulative sequence length array is used along with packed query/key/values – similar in essence to NestedTensor.

In conclusion

Using torch.nn.functional.scaled_dot_product_attention is a free-lunch optimization, both making your code more readable, uses less memory, and is in most common cases faster.

Although the implementation in PyTorch 2.0.0 has still minor limitations, inference and training already massively benefit from SDPA in most cases. We encourage you to use this native implementation be it to train or deploy your PyTorch models, and for 🤗 Transformers models as a one-line transformation!

In the future, we would like to adapt the API to enable users to use SDPA in encoder-based models as well.

We thank Benjamin Lefaudeux, Daniel Haziza and Francisco Massa for their advice on the head dimension influence, as well as Michael Gschwind, Christian Puhrsch and Driss Guessous for their feedback on the blog post!

Benchmark reproduction

The benchmark presented in this post was done using torch==2.0.0, transformers==4.27.4, accelerate==0.18.0 and optimum==1.8.0.

The benchmarks can be easily reproduced using the scripts for inference, training for 🤗 Transformers models, and standalone SDPA.

Read More

PyTorch Conference 2023: Join us in San Francisco October 16-17

PyTorch Conference 2023: Join us in San Francisco October 16-17

PyTorch Conference 2023

We’re thrilled to announce the upcoming PyTorch Conference 2023! On October 16-17, the conference will showcase PyTorch 2.0, the next-generation release of the popular machine learning framework. As part of the Linux Foundation, the PyTorch Foundation Conference continues the tradition of bringing together leading researchers, developers, and academic communities to advance the education and development of end-to-end machine learning.

The conference agenda features an engaging lineup of events, including an opening reception, engaging community and partner discussions, informative panels, poster sessions, enlightening use cases and community stories, as well as discussions on the latest trends in machine learning and deep learning development and deployment.

Call for Proposals

We are now accepting speaker proposals for the conference until July 21. The program committee will carefully review all submissions, and selected speakers will be notified by August 8. We strongly encourage both experienced and first-time speakers to submit their proposals. This conference provides an excellent opportunity to connect with the PyTorch community, share your ideas, and showcase your work.

When preparing your proposal, please consider the following guidelines:

  • What are you hoping to get from your presentation?
  • What do you expect the audience to gain from your presentation?
  • How will your presentation help better the open source ecosystem?

To help you shape your proposal, here are some suggested topics for the conference:

  • Deployments on AWS, Azure
  • Use cases and real-world applications
  • Foundational models
  • AI practices
  • Production considerations
  • PyTorch 2.X features and updates
  • Training techniques and best practices
  • Inference methodologies
  • Hardware advancements and optimizations
  • Edge computing applications
  • Scalability solutions
  • Latest research breakthroughs
  • Optimization strategies
  • Extending PyTorch through customizations and plugins

We kindly request that you refrain from submitting sales or marketing pitches and avoid discussing unlicensed or closed-source technologies. Such talks tend to detract from the integrity of our events and are not well-received by conference attendees.

Register Today

Registration is now open! Get your ticket today and secure your spot: https://events.linuxfoundation.org/pytorch-conference/register/

Thank you for your interest, and we look forward to a successful PyTorch Conference 2023!

Read More

Language Identification: Building an End-to-End AI Solution using PyTorch

Language Identification: Building an End-to-End AI Solution using PyTorch

Language Identification is the process of identifying the primary language from multiple audio input samples. In natural language processing (NLP), language identification is an important problem and a challenging issue. There are many language-related tasks such as entering text on your phone, finding news articles you enjoy, or discovering answers to questions that you may have. All these tasks are powered by NLP models. To decide which model to invoke at a particular point in time, we must perform language identification.

This article presents an in-depth solution and code sample for language identification using Intel® Extension for PyTorch, which is a version of the popular PyTorch AI framework optimized for use on Intel® processors, and Intel® Neural Compressor, which is a tool to accelerate AI inference without sacrificing accuracy.

The code sample demonstrates how to train a model to perform language identification using the Hugging Face SpeechBrain* toolkit and optimize it using the Intel® AI Analytics Toolkit (AI Kit). The user can modify the code sample and identify up to 93 languages using the Common Voice dataset.

Proposed Methodology for Language Identification

In the proposed solution, the user will use an Intel AI Analytics Toolkit container environment to train a model and perform inference leveraging Intel-optimized libraries for PyTorch. There is also an option to quantize the trained model with Intel Neural Compressor to speed up inference.

Dataset

The Common Voice dataset is used and for this code sample, specifically, Common Voice Corpus 11.0 for Japanese and Swedish. This dataset is used to train an Emphasized Channel Attention, Propagation and Aggregation Time Delay Neural Network (ECAPA-TDNN), which is implemented using the Hugging Face SpeechBrain library. Time Delay Neural Networks (TDNNs), aka one-dimensional Convolutional Neural Networks (1D CNNs), are multilayer artificial neural network architectures to classify patterns with shift-invariance and model context at each layer of the network. ECAPA-TDNN is a new TDNN-based speaker-embedding extractor for speaker verification; it is built upon the original x-vector architecture and puts more emphasis on channel attention, propagation, and aggregation.

Implementation

After downloading the Common Voice dataset, the data is preprocessed by converting the MP3 files into WAV format to avoid information loss and separated into training, validation, and testing sets.

A pretrained VoxLingua107 model is retrained with the Common Voice dataset using the Hugging Face SpeechBrain library to focus on the languages of interest. VoxLingua107 is a speech dataset used for training spoken language recognition models that work well with real-world and varying speech data. This dataset contains data for 107 languages. By default, Japanese and Swedish are used, and more languages can be included. This model is then used for inference on the testing dataset or a user-specified dataset. Also, there is an option to utilize SpeechBrain’s Voice Activity Detection (VAD) where only the speech segments from the audio files are extracted and combined before samples are randomly selected as input into the model. This link provides all the necessary tools to perform VAD. To improve performance, the user may quantize the trained model to integer-8 (INT8) using Intel Neural Compressor to decrease latency.

Training

The copies of training scripts are added to the current working directory, including create_wds_shards.py – for creating the WebDataset shards, train.py – to perform the actual training procedure, and train_ecapa.yaml – to configure the training options. The script to create WebDataset shards and YAML file are patched to work with the two languages chosen for this code sample.

In the data preprocessing phase, prepareAllCommonVoice.py script is executed to randomly select a specified number of samples to convert the input from MP3 to WAV format. Here, 80% of these samples will be used for training, 10% for validation, and 10% for testing. At least 2000 samples are recommended as the number of input samples and is the default value.

In the next step, WebDataset shards are created from the training and validation datasets. This stores the audio files as tar files which allows writing purely sequential I/O pipelines for large-scale deep learning in order to achieve high I/O rates from local storage—about 3x-10x faster compared to random access.

The YAML file will be modified by the user. This includes setting the value for the largest number for the WebDataset shards, output neurons to the number of languages of interest, number of epochs to train over the entire dataset, and the batch size. The batch size should be decreased if the CPU or GPU runs out of memory while running the training script.

In this code sample, the training script will be executed with CPU. While running the script, “cpu” will be passed as an input parameter. The configurations defined in train_ecapa.yaml are also passed as parameters.

The command to run the script to train the model is:

python train.py train_ecapa.yaml --device "cpu"

In the future, the training script train.py will be designed to work for Intel® GPUs such as the Intel® Data Center GPU Flex Series, Intel® Data Center GPU Max Series, and Intel® Arc™ A-Series with updates from Intel Extension for PyTorch.

Run the training script to learn how to train the models and execute the training script. The 4th Generation Intel® Xeon® Scalable Processor is recommended for this transfer learning application because of its performance improvements through its Intel® Advanced Matrix Extensions (Intel® AMX) instruction set.

After training, checkpoint files are available. These files are used to load the model for inference.

Inference

Inference Pipeline

The crucial step before running inference is to patch the SpeechBrain library’s pretrained interfaces.py file so that PyTorch TorchScript* can be run to improve the runtime. TorchScript requires the output of the model to be only tensors.

Users can choose to run inference using the testing set from Common Voice or their own custom data in WAV format. The following are the options the inference scripts (inference_custom.py and inference_commonVoice.py) can be run with:

Input Option Description
-p Specify the data path.
-d Specify the duration of wave sample. The default value is 3.
-s Specify size of sample waves, default is 100.
–vad (`inference_custom.py` only) Enable VAD model to detect active speech. The VAD option will identify speech segments in the audio file and construct a new .wav file containing only the speech segments. This improves the quality of speech data used as input into the language identification model.
–ipex Run inference with optimizations from Intel Extension for PyTorch. This option will apply optimizations to the pretrained model. Using this option should result in performance improvements related to latency.
–ground_truth_compare (`inference_custom.py` only) Enable comparison of prediction labels to ground truth values.
–verbose Print additional debug information, like latency.

The path to the data must be specified. By default, 100 audio samples of 3-seconds will be randomly selected from the original audio file and used as input to the language identification model.

A small Convolutional Recurrent Deep Neural Network (CRDNN) pretrained on the LibriParty dataset is used to process audio samples and output the segments where speech activity is detected. This can be used in inference with the --vad option.

From the figure below, the timestamps where speech will be detected is delivered from the CRDNN model, and these are used to construct a new, shorter audio file with only speech. Sampling from this new audio file will give a better prediction of the primary language spoken.

Audio wave file visualization

Run the inference script yourself. An example command of running inference:

python inference_custom.py -p data_custom -d 3 -s 50 --vad

This will run inference on data you provide located inside the data_custom folder. This command performs inference on 50 randomly selected 3-second audio samples with voice activity detection.

If you want to run the code sample for other languages, download Common Voice Corpus 11.0 datasets for other languages.

Optimizations with Intel Extension for PyTorch and Intel Neural Compressor

PyTorch

The Intel extension expands PyTorch with up-to-date features and optimizations for an extra performance boost on Intel hardware. Check out how to install Intel Extension for PyTorch. The extension can be loaded as a Python module or linked as a C++ library. Python users can enable it dynamically by importing intel_extension_for_pytorch.

  • The CPU tutorial gives detailed information about Intel Extension for PyTorch for Intel CPUs. Source code is available at the master branch.
  • The GPU tutorial gives detailed information about Intel Extension for PyTorch for Intel GPUs. Source code is available at the xpu-master branch.

To optimize the model for inference using Intel Extension for PyTorch, the --ipexoption can be passed in. The model is optimized using the plug-in. TorchScript speeds up inference because PyTorch is run in graph mode. The command to run with this optimization is:

python inference_custom.py -p data_custom -d 3 -s 50 --vad --ipex --verbose

Note: The --verbose option is required to view the latency measurements.

Auto-mixed precision such as bfloat16 (BF16) support will be added in a future release of the code sample.

Intel Neural Compressor

This is an open-source Python library that runs on CPUs or GPUs, which:

  • Performs model quantization to reduce the model size and increase the speed of deep learning inference for deployment.
  • Automates popular methods such as quantization, compression, pruning, and knowledge distillation across multiple deep-learning frameworks.
  • Is part of the AI Kit

The model can be quantized from float32 (FP32) precision to integer-8 (INT8) by running the quantize_model.py script while passing in the path to the model and a validation dataset. The following code can be used to load this INT8 model for inference:

from neural_compressor.utils.pytorch import load
model_int8 = load("./lang_id_commonvoice_model_INT8", self.language_id)
signal = self.language_id.load_audio(data_path)
prediction = self.model_int8(signal)

Note that the original model is required when loading the quantized model. The command to quantize the trained model from FP32 to INT8 by using quantize_model.py is:

python quantize_model.py -p ./lang_id_commonvoice_model -datapath $COMMON_VOICE_PATH/commonVoiceData/commonVoice/dev

What’s Next?

Try out the above code sample by upgrading the hardware to a 4th Generation Intel Xeon Scalable Processor with Intel AMX and identify up to 93 different languages from Common Voice datasets.

We encourage you to learn more about and incorporate Intel’s other AI/ML Framework optimizations and end-to-end portfolio of tools into your AI workflow. Also, visit AI & ML page covering Intel’s AI software development resources for preparing, building, deploying, and scaling your AI solutions.

For more details about the new 4th Gen Intel Xeon Scalable processors, visit Intel’s AI Solution Platform portal where you can learn how Intel is empowering developers to run end-to-end AI pipelines on these powerful CPUs.

Useful resources

Explore more AI code samples

See all code samples

Read More

Announcing PyTorch Docathon 2023

Announcing PyTorch Docathon 2023

PyTorch Docathon

We are excited to announce the first ever PyTorch Docathon! The Docathon is a hackathon-style event focused on improving the documentation by enlisting the help of the community. Documentation is a crucial aspect of any technology and by improving the documentation, we can make it easier for users to get started with PyTorch, help them understand how to use its features effectively, and ultimately accelerate research to production in the field of machine learning.

WHY PARTICIPATE

Low Barrier to Entry

Many open-source projects require extensive knowledge of the codebase and prior contributions to the project to participate in any sort of hackathon events. The Docathon, on the other hand, is designed for newcomers. We do expect familiarity with Python, basic knowledge of PyTorch, and ML. But don’t fret, there are some tasks that are related to website issues that won’t require even that.

Tangible Results

One of the best things about the Docathon is that you can see the results of your efforts in real time. Improving documentation can have a huge impact on a project’s usability and accessibility and you’ll be able to see those improvements firsthand. Plus having tangible results can be a great motivator to keep contributing.

Collaborative Environment

The Docathon is a collaborative event which means you’ll have the opportunity to work with other contributors and PyTorch maintainers on improving the documentation. This can be a great way to learn from others, share ideas, and build connections.

Learning Opportunities

Finally, even if you are not an expert in PyTorch, the Docathon can be a great learning experience. You’ll have the opportunity to explore the PyTorch modules and test some of the tutorials on your machine as well as in the CI.

EVENT DETAILS

  • May 31: Kick-off
  • May 31 – June 11: Submissions and Feedback
  • June 12 – June 13: Final Reviews
  • June 15: Winner Announcements

Details for the Docathon to be announced at the kick-off stream on May 31.

Please register to join this year’s event: RSVP

Read More

Accelerated Image Segmentation using PyTorch

Accelerated Image Segmentation using PyTorch

Using Intel® Extension for PyTorch to Boost Image Processing Performance

PyTorch delivers great CPU performance, and it can be further accelerated with Intel® Extension for PyTorch. I trained an AI image segmentation model using PyTorch 1.13.1 (with ResNet34 + UNet architecture) to identify roads and speed limits from satellite images, all on the 4th Gen Intel® Xeon® Scalable processor.

I will walk you through the steps to work with a satellite image dataset called SpaceNet5 and how I optimized the code to make deep learning workloads feasible on CPUs just by flipping a few key switches.

Before we get started, some housekeeping…

The code accompanying this article is available in the examples folder in the Intel Extension for PyTorch repository. I borrowed heavily from the City-Scale Road Extraction from Satellite Imagery (CRESI) repository. I adapted it for the 4th Gen Intel Xeon processors with PyTorch optimizations and Intel Extension for PyTorch optimizations. In particular, I was able to piece together a workflow using the notebooks here.

You can find the accompanying talk I gave on YouTube.

I also highly recommend these articles for a detailed explanation of how to get started with the SpaceNet5 data:

I referenced two Hugging Face blogs by Julien Simon; he ran his tests on the AWS instance r7iz.metal-16xl:

The potential cost savings from using a CPU instance instead of a GPU instance on the major cloud service providers (CSP) can be significant. The latest processors are still being rolled out to the CSPs, so I’m using a 4th Gen Intel Xeon processor that is hosted on the Intel® Developer Cloud (you can sign up for the Beta here: cloud.intel.com).

On AWS, you can select from the r7iz.* EC2 instances after you sign up for the preview here (Figure 1). At the time of writing, the new AI-acceleration engine, Intel® Advanced Matrix Extensions (Intel® AMX), is only available on bare metal but it should soon be enabled on the virtual machines.

List of 4th Gen Xeon  instances on AWS EC2

Figure 1. List of 4th Gen Xeon instances on AWS EC2 (image by author)

On Google Cloud* Platform, you can select from the 4th Gen Xeon Scalable processors C3 VMs (Figure 2).

List of 4th Gen Intel Xeon Scalable processor instances on Google Cloud Platform

Figure 2. List of 4th Gen Intel Xeon Scalable processor instances on Google Cloud Platform (image by author)

Hardware Introduction and Optimizations

The 4th Gen Intel Xeon processors were released January 2023, and the bare-metal instance I am using has two sockets (each with 56 physical cores), 504 GB of memory, and Intel AMX acceleration. I installed a few key libraries in the backend to take control and monitor the sockets, memory, and cores that I am using on the CPU:

numactl (with sudo apt-get install numactl)

libjemalloc-dev (with sudo apt-get install libjemalloc)

intel-openmp (with conda install intel-openmp)

gperftools (with conda install gperftools -c conda-forge)

Both PyTorch and Intel Extension for PyTorch have helper scripts so that one does not need to explicitly use intel-openmp and numactl, but they do need to be installed in the backend. In case you want to set them up for other work, here is what I used for OpenMP* …

export OMP_NUM_THREADS=36
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

… where OMP_NUM_THREADS is the number of threads allocated to the job, KMP_AFFINITY affects thread affinity settings (including packing threads close to each other, the state of pinning threads), and KMP_BLOCKTIME sets the time in milliseconds that an idle thread should wait before going to sleep.

Here’s what I used for numactl

numactl -C 0-35 --membind=0 train.py

…where -C specifies which cores to use and --membind instructs the program to only use one socket (socket 0 in this case).

SpaceNet Data

I am using a satellite image dataset from the SpaceNet 5 Challenge. Different cities can be downloaded for free from an AWS S3 bucket:

aws s3 ls s3://spacenet-dataset/spacenet/SN5_roads/tarballs/ --human-readable
2019-09-03 20:59:32    5.8 GiB SN5_roads_test_public_AOI_7_Moscow.tar.gz
2019-09-24 08:43:02    3.2 GiB SN5_roads_test_public_AOI_8_Mumbai.tar.gz
2019-09-24 08:43:47    4.9 GiB SN5_roads_test_public_AOI_9_San_Juan.tar.gz
2019-09-14 13:13:26   35.0 GiB SN5_roads_train_AOI_7_Moscow.tar.gz
2019-09-14 13:13:34   18.5 GiB SN5_roads_train_AOI_8_Mumbai.tar.gz

You can use the following commands to download and unpack a file:

aws s3 cp s3://spacenet-dataset/spacenet/SN5_roads/tarballs/SN5_roads_train_AOI_7_Moscow.tar.gz .
tar -xvzf ~/spacenet5data/moscow/SN5_roads_train_AOI_7_Moscow.tar.gz

Dataset Preparation

I used the Moscow satellite image dataset, which consists of 1,352 images of 1,300 by 1,300 pixels with corresponding street labels in separate text files. The dataset contains both 8-band multispectral images and 3-band RGB images. Figure 3 shows four sample RGB satellite images and their corresponding generated masks. I used the speed_masks.py script from the CRESI repository to generate the segmentation masks.

Satellite image 3-channel RGB chips from Moscow (top row) and corresponding pixel segmentation masks with varying speed limits

Figure 3. Satellite image 3-channel RGB chips from Moscow (top row) and corresponding pixel segmentation masks with varying speed limits (bottom row) (image by author)

There is a JSON configuration file that must be updated for all remaining components: training and validation split, training, and inference. An example configuration can be found here. I perform an 80:20 training/validation split, making sure to point to the correct folder of satellite images and corresponding masks for training. The configuration parameters are explained in more in the notebook under examples in GitHub for Intel Extension for PyTorch here.

Training a ResNet34 + UNet Model

I made some changes to the cresi code described below in order to run on a CPU and optimize the training. To run natively on a CPU, replace self.model = nn.DataParallel(model).cuda() with self.model = nn.DataParallel(model) in the train.py script. In the 01_train.py script, remove torch.randn(10).cuda().

To optimize training, add import intel_extension_for_pytorch as ipex to the import statements in the train.py script. Just after defining the model and optimizer as follows:

self.model = nn.DataParallel(model)
self.optimizer = optimizer(self.model.parameters(), lr=config.lr)

Add the ipex.optimize line to use BF16 precision, instead of FP32:

self.model, self.optimizer = ipex.optimize(self.model, 
    optimizer=self.optimizer,dtype=torch.bfloat16)

Add a line to do mixed-precision training just before running a forward pass and calculating the loss function:

with torch.cpu.amp.autocast():
    if verbose:
        print("input.shape, target.shape:", input.shape, target.shape)
    output = self.model(input)
    meter = self.calculate_loss_single_channel(output, target, meter, training, iter_size)

Now that we have optimized our training code, we can move onto training our model.

Like the winner of the SpaceNet 5 competition, I trained a ResNet34 encoder + UNet decoder model. It is pretrained from ImageNet weights, and the backbone is left completely unfrozen during training. The training can be run with the 01_train.py script, but in order to control the use of hardware I used a helper script. There are actually two helper scripts: one that comes with stock PyTorch and one that comes with Intel Extension for PyTorch. They both accomplish the same thing, but the first one from stock is torch.backends.xeon.run_cpu, and the second one from Intel Extension for PyTorch is ipexrun.

Here is what I ran in the command-line:

python -m torch.backends.xeon.run_cpu --ninstances 1 
  --ncores_per_instance 32 
  --log_path /home/devcloud/spacenet5data/moscow/v10_xeon4_devcloud22.04/logs/run_cpu_logs 
  /home/devcloud/cresi/cresi/01_train.py 
  /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json --fold=0
ipexrun --ninstances 1 
--ncore_per_instance 32 
/home/devcloud/cresi/cresi/01_train.py 
/home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json --fold=0

In both cases, I am asking PyTorch to run training on one socket with 32 cores. Upon running, I get a printout of what environment variables get set in the backend to understand how PyTorch is using the hardware:

INFO - Use TCMalloc memory allocator
INFO - OMP_NUM_THREADS=32
INFO - Using Intel OpenMP
INFO - KMP_AFFINITY=granularity=fine,compact,1,0
INFO - KMP_BLOCKTIME=1
INFO - LD_PRELOAD=/home/devcloud/.conda/envs/py39/lib/libiomp5.so:/home/devcloud/.conda/envs/py39/lib/libtcmalloc.so
INFO - numactl -C 0-31 -m 0 /home/devcloud/.conda/envs/py39/bin/python -u 01_train.py configs/ben/v10_xeon4_baseline_ben.json --fold=0

During training, I make sure that my total loss function is decreasing (i.e., the model is converging on a solution).

Inference

After training a model, we can start to make predictions from satellite images alone. In the eval.py inference script, add import intel_extension_for_pytorch as ipex to the import statements. After loading the PyTorch model, use Intel Extension for PyTorch to optimize the model for BF16 inference:

model = torch.load(os.path.join(path_model_weights, 
    'fold{}_best.pth'.format(fold)), 
    map_location = lambda storage, 
    loc: storage)
model.eval()
model = ipex.optimize(model, dtype = torch.bfloat16)

Just prior to running prediction, add two lines for mixed precision:

with torch.no_grad():
    with torch.cpu.amp.autocast():
        for data in pbar:
            samples = torch.autograd.Variable(data['image'], volatile=True)
            predicted = predict(model, samples, flips=self.flips)

To run inference, we can use the 02_eval.py script. Now that we have a trained model, we can make predictions on satellite images (Figure 4). We can see that it does seem to map the roads closely to the image!

Moscow satellite image and accompanying prediction of roads

Figure 4. Moscow satellite image and accompanying prediction of roads (image by author)

I realize that the model I’ve trained is overfit to the Moscow image data and probably won’t generalize well to other cities. However, the winning solution to this challenge used data from six cities (Las Vegas, Paris, Shanghai, Khartoum, Moscow, Mumbai) and performs well on new cities. In the future, one thing that would be worth testing is training on all six cities and running inference on another city to reproduce their results.

Note on Post-Processing

There are further post-processing steps that can be performed to add the mask as graph features to maps. You can read more about the post-processing steps here:

The SpaceNet 5 Baseline — Part 3: Extracting Road Speed Vectors from Satellite Imagery

Post-processing scripts

Conclusions

In summary, we:

  • Created 1,352 image training masks (with speed limits) to correspond to our training satellite image data (from .geojson text file labels)
  • Defined our configuration file for training and inference
  • Split up our data into training and validation sets
  • Optimized our code for CPU training, including using Intel Extension for PyTorch and BF16
  • Trained a performant ResNet34 + UNet model on a 4th Gen Intel Xeon CPU
  • Ran initial inference to see the prediction of a speed limit mask

You can find detailed benchmarks here for the 4th Gen Intel Xeon CPU here.

Next Steps

Extend the optimizations on an Intel CPU by using the Intel Extension for PyTorch:

pip install intel-extension-for-pytorch

git clone https://github.com/intel/intel-extension-for-pytorch

Get in touch with me on LinkedIn if you have any more questions!

More information about the Intel Extension for PyTorch can be found here.

Get the Software

I encourage you to check out Intel’s other AI Tools and Framework optimizations and learn about the open, standards-based oneAPI multiarchitecture, multivendor programming model that forms the foundation of Intel’s AI software portfolio.

For more details about 4th Gen Intel Xeon Scalable processor, visit AI Platform where you can learn about how Intel is empowering developers to run high-performance, efficient end-to-end AI pipelines.

PyTorch Resources

Read More

Introducing Hidet: A Deep Learning Compiler for Efficient Model Serving

Introducing Hidet: A Deep Learning Compiler for Efficient Model Serving

Hidet is a powerful deep learning compiler that simplifies the process of implementing high-performing deep learning operators on modern accelerators (e.g., NVIDIA GPUs). With the new feature of torch.compile(...) in PyTorch 2.0, integrating a novel compiler into PyTorch is easier than ever – Hidet now can be used as a torch.compile(...) backend to accelerate PyTorch models, making it an attractive option for PyTorch users who want to improve the inference performance of their models, especially for those who also need to implement extremely optimized custom operators.

Using Hidet to Compile A PyTorch Model

To use Hidet in PyTorch, you need to first install the hidet package via pip:

pip install hidet

Hidet is integrated with PyTorch as a torch.compile(...) backend following the Custom Backends tutorial. You can specify hidet as the backend when you compile a model. (Note: requires PyTorch version 2.0+):

torch.compile(..., backend='hidet')

Hidet converts the given PyTorch model in the torch.fx.Graph format into its internal graph representation, and conducts a series of optimizations. Hidet provides a few options to configure the optimizations. For example, we can use hidet.torch.dynamo_config.use_tensor_core(<strong>True</strong>) to allow Hidet to generate CUDA kernels that leverage the Tensor Cores on NVIDIA GPUs, and use hidet.torch.dynamo_config.search_space(2) to allow Hidet to search for the best operator schedule specific for your hardware and input sizes. More configurations can be found in Hidet’s documentation.

Here’s a complete example of how to use Hidet to compile and optimize a pre-trained ResNet50 model from torchvision:

import hidet
import torch

# Load a pre-trained ResNet50 model
x = torch.randn(1, 3, 224, 224, device='cuda').half()
model = torch.hub.load(
    'pytorch/vision:v0.6.0', 'resnet50', pretrained=True
).cuda().half().eval()

# Configure hidet to use tensor core and enable tuning
hidet.torch.dynamo_config.use_tensor_core(True)
hidet.torch.dynamo_config.search_space(2) 

# Compile the model using Hidet
model_opt = torch.compile(model, backend='hidet')

# Check correctness
torch.testing.assert_close(actual=model_opt(x), expected=model(x), rtol=1e-2, atol=1e-2)

# Benchmark
from hidet.utils import benchmark_func
print('eager: {:2f}'.format(benchmark_func(lambda: model(x))))
print('hidet: {:2f}'.format(benchmark_func(lambda: model_opt(x))))

We encourage you to try out the above script on your own NVIDIA GPU(s)! If you run this script on an aws.g5.2xlarge instance, you would get the result shown in the following figure. Hidet achieves the speedup because it could automatically fuse multiple operators, tune operator schedules, and use CUDA Graph to reduce framework-level overhead. More results can be found in the ASPLOS’23 publication of Hidet (vs. PyTorch 1.11) and our performance tracking (vs. PyTorch 2.0).

Eager vs Hidet latency

Using Hidet Script to Write Custom Operators

Hidet Script is one approach to implement tensor operators in Python. The following example shows how to implement a naive matrix multiplication using Hidet Script and integrate it as a PyTorch operator.

import torch
import hidet


def matmul(m_size, n_size, k_size):
    from hidet.lang import f32, attr
    from hidet.lang.cuda import threadIdx, blockIdx, blockDim

    with hidet.script_module() as script_module:
        @hidet.script
        def matmul(
            a: f32[m_size, k_size],
            b: f32[k_size, n_size],
            c: f32[m_size, n_size]
        ):
            attr.cuda_grid_dim = ((m_size + 31) // 32, (n_size + 31) // 32)
            attr.cuda_block_dim = (32, 32)
            i = threadIdx.x + blockIdx.x * blockDim.x
            j = threadIdx.y + blockIdx.y * blockDim.y
            if i < m_size and j < n_size:
                c[i, j] = 0.0
                for k in range(k_size):
                    c[i, j] += a[i, k] * b[k, j]

    ir_module = script_module.ir_module()
    func = hidet.driver.build_ir_module(ir_module)
    return func


class NaiveMatmul(torch.autograd.Function):
    @staticmethod
    def forward(ctx, a, b):
        m, k = a.shape
        k, n = b.shape
        c = torch.empty([m, n], dtype=a.dtype, device=a.device)
        func = matmul(m, n, k)
        func(a, b, c)
        return c


a = torch.randn([3, 4], device='cuda')
b = torch.randn([4, 5], device='cuda')
c = NaiveMatmul.apply(a, b)
cc = torch.matmul(a, b)
torch.testing.assert_close(c, cc)

More optimizations can be applied, see the example in our documentation to learn more.

Hidet Script vs. Triton: Triton greatly simplifies the CUDA programming by introducing the tile-based programming model where the parallel execution unit is thread blocks instead of threads. However, this simplification also prevents the tensor program developers from manipulating the fine-grained computation and memory resources (e.g., warps, shared memory) in their preferred ways. It would be challenging to implement an optimization that requires fine-grained control of these resources using Triton if it has not been implemented by the Triton compiler itself. Hidet Script, on the other hand, simplifies tensor programming while still enabling users to implement their own optimizations with extensive flexibility. It’s worth noting that the more granular control of Hidet Script also brings added complexity compared to Triton.

More about Hidet

Hidet originates from a research project led by the EcoSystem lab at the University of Toronto (UofT) and AWS. The authors propose a new way, named the task-mapping programming paradigm, to construct tensor programs. It aims to simplify the tensor programming without sacrificing any optimization opportunity. Now, Hidet is an open-source project, jointly supported by CentML and the EcoSystem lab, that aims to provide an efficient solution to end-to-end inference on modern accelerators (e.g., NVIDIA GPUs).

Additional Resources

Acknowledgement

We would like to thank Jerry Park, Mark Saroufim, Jason Liang and Helen Suk for their valuable help on preparing the blog post and feedback on the text. We also would like to thank Nikita Shulga, Jason Ansel, and Dmytro Dzhulgakov for reviewing and improving our PyTorch PR73873 on the 3rd-party dynamo backend registration.

Read More