ExecuTorch Beta: On-Device AI and LLMs, Stability, and Acceleration with Partners

TLDR

  • ExecuTorch has achieved Beta status with the release of v0.4, providing stable APIs and runtime, as well as extensive kernel coverage.
  • ExecuTorch is the recommended on-device inference engine for Llama 3.2 1B/3B models, offering enhanced performance and memory efficiency for both original and quantized models.
  • There has been a significant increase in adoption and ecosystem growth for ExecuTorch, and the focus is now on improving reliability, performance, and coverage for non-CPU backends as the next steps.

Current On-Device AI Market

The on-device AI market has been rapidly expanding, and is revolutionizing the way we interact with technology. It is unlocking new experiences, enabling personalization, and reducing latency. Traditionally, computer vision and speech recognition have been the primary use-cases for on-device AI, particularly in IoT, industrial applications, and mobile devices. However, the emergence of Large Language Models (LLMs) has made Generative AI the fastest growing sector in AI, subsequently highlighting the importance of on-device Generative AI. IDC forecasts by 2028, close to 1 billion GenAI capable smartphones being shipped worldwide.

LLMs are not only getting smaller but more powerful. This has led to the creation of a new class of applications that leverage multiple models for intelligent agents and streamlined workflows. The community is rapidly adopting and contributing to these new models, with quantized versions being created within hours of model release. Several leading technology companies are investing heavily in small LLMs, even deploying Low-Rank Adaptation (LoRA) at scale on-device to transform user experiences.

However, this rapid progress comes at a cost. The fragmentation of our on-device AI landscape creates complexity and inefficiency when going from model authoring to edge deployment. This is where PyTorch’s ExecuTorch comes in – our Beta announcement marks an important milestone in addressing these challenges and empowering developers to create innovative, AI-powered applications.

What’s New Today

It’s been exactly one year since we first open sourced ExecuTorch, six months since Alpha release, and today, we’re excited to announce three main developments:

1. Beta. ExecuTorch has reached Beta status starting from v0.4! It is now widely adopted and used in production environments across Meta. Through this adoption process we’ve identified and addressed feature gaps, improved stability, and expanded kernel and accelerator coverage. These improvements make us confident to promote ExecuTorch from Alpha to Beta status, and we are happy to welcome the community to adopt it in their own production settings. Here are three concrete enhancements:

  1. Developers can write application code and include the latest ExecuTorch as a dependency, updating when needed with a clean API contract. This is possible due to our API stabilization efforts, as well as our explicit API lifecycle and backwards compatibility policy.
  2. Running ExecuTorch on CPUs reached the necessary performance, portability and coverage. In particular, we have implemented more than 85% of all core ATen operators as part of our portable CPU kernels library to ensure running a model on ExecuTorch just works in most cases and making missing ops an exception rather than the norm. Moreover, we integrated and extensively tested our XNNPACK delegate for high performance on a wide range of CPU architectures. It is used in a number of production cases today.
  3. In addition to the low-level ExecuTorch components for greater portability, we built extensions and higher-level abstractions to support more common use-cases such as developer tooling to support on-device debugging and profiling, and Module.h extension to simplify deployment for mobile devices.

2. On-Device Large-Language Models (LLMs). There has been a growing interest in the community to deploy Large Language Models (LLMs) on edge devices, as it offers improved privacy and offline capabilities. However, these models are quite large, pushing the limits of what is possible. Fortunately, ExecuTorch can support these models, and we’ve enhanced the overall framework with numerous optimizations.

  • ExecuTorch is the recommended framework to run latest Llama models on-device with excellent performance today. The Llama 3.2 1B/3B models are well-suited for mobile deployment, and it is especially true with the official quantized 1B/3B model releases from Meta, as it provides a great balance between performance, accuracy, and size. When deploying Llama 3.2 1B/3B quantized models, decode latency improved by 2.5x and prefill latency improved by 4.2x on average, while model size decreased by 56% and memory usage reduced by 41% on average when benchmarked on Android OnePlus 12 device (we’ve also verified similar relative performance on Samsung S24+ for 1B and 3B, and Samsung S22 for 1B). For Llama 3.2 1B quantized model, for example, ExecuTorch is able to achieve 50.2 tokens/s for decoding and 260 tokens/s for prefill on the OnePlus 12, using the latest CPU kernels from XNNPACK and Kleidi libraries. These quantized models allow developers to integrate LLMs into memory and power-constrained devices while still maintaining quality and safety.
  • One of the value propositions of ExecuTorch is being able to use accelerators on mobile devices seamlessly. In fact, ExecuTorch also showcased accelerators to achieve even greater performance running Llama across Apple MPS backend, Qualcomm AI Accelerator, and MediaTek AI Accelerator.
  • There has been growing community and industry interest in multimodal and beyond text-only LLMs, evidenced by Meta’s Llama 3.2 11B/90B vision models and open-source models like Llava. We have so far enabled Llava 1.5 7B model on phones via ExecuTorch, making many optimizations, notably reducing runtime memory from 11GB all the way down to 5GB.

3. Ecosystem and Community Adoption
Now that ExecuTorch is in Beta, it is mature enough to be used in production. It is being increasingly used at Meta across various product surfaces. For instance, ExecuTorch already powers various ML inference use cases across Meta’s Ray-Ban Meta Smart Glasses and Quest 3 VR headsets as well as Instagram and WhatsApp.

We also partnered with Hugging Face to provide native ExecuTorch support for models being exported using torch.export. This collaboration ensures exported artifacts can directly be lowered and run efficiently on various mobile and edge devices. Models like gemma-2b and phi3-mini are already supported and more foundational models support is in progress.

With stable APIs and Gen AI support, we’re excited to build and grow ExecuTorch with the community. The on-device AI community is growing rapidly and finding ways to adopt ExecuTorch across various fields. For instance, ExecuTorch is being used in a mobile app built by Digica to streamline inventory management in hospitals. As another example, Software Mansion developed an app, EraserAI, to remove unwanted objects from a photo with EfficientSAM running on-device with ExecuTorch via Core ML delegate.

Towards General Availability (GA):
Since the original release of ExecuTorch alpha, we’ve seen a growing interest within the community in using ExecuTorch in various production environments. To that end, we have made great progress towards more stabilized and matured APIs and have made a significant investment in community support, adoption and contribution to ExecuTorch. As are are getting close to GA, we are investing our efforts in the following areas:

  • Non-CPU backends: Bringing non-CPU backends to even greater robustness, coverage and performance is our next goal. From day one of our original launch, we have partnered with Apple (for Core ML and MPS), Arm (for EthosU NPU) and Qualcomm (for Hexagon NPU) on accelerator integration with ExecuTorch, and we’ve since then expanded our partnership to MediaTek (NPU) and Cadence (XTensa DSP). We’re also building Vulkan GPU integration in-house. In terms of feature coverage, we’ve successfully implemented the core functionalities with our partners, ensured seamless integration with our developer tooling, and showcased successful LLM integration with many of the accelerators. Our next big step is to thoroughly validate the performance and reliability of the system in real-world, production use-cases. This stage will help us fine-tune the experience and ensure the stability needed for smooth operations.

  • Benchmarking infra: As part of our ongoing testing efforts, we’ve developed a benchmarking infrastructure along with a public dashboard to showcase our progress toward on-device model inference benchmarking. This allows us to transparently track and display model coverage across various backends, giving our community real-time insights into how we’re advancing towards our goals.

We’re excited to share these developments with you and look forward to continued improvements in collaboration with our partners and the community! We welcome community contribution to help us make ExecuTorch the clear choice for deploying AI and LLM models on-device. We invite you to start using ExecuTorch in your on-device projects, or even better consider contributing to it. You can also report any issues on our GitHub page.

Read More

PyTorch 2.5 Release Blog

We are excited to announce the release of PyTorch® 2.5 (release note)! This release features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode.

This release is composed of 4095 commits from 504 contributors since PyTorch 2.4. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.5. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

As well, please check out our new ecosystem projects releases with TorchRec and TorchFix.

Beta Prototype
CuDNN backend for SDPA FlexAttention
torch.compile regional compilation without recompilations Compiled Autograd
TorchDynamo added support for exception handling & MutableMapping types Flight Recorder
TorchInductor CPU backend optimization Max-autotune Support on CPU with GEMM Template
TorchInductor on Windows
FP16 support on CPU path for both eager mode and TorchInductor CPP backend
Autoload Device Extension
Enhanced Intel GPU support

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] CuDNN backend for SDPA

The cuDNN “Fused Flash Attention” backend was landed for torch.nn.functional.scaled_dot_product_attention. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs.

[Beta] torch.compile regional compilation without recompilations

Regional compilation without recompilations, via torch._dynamo.config.inline_inbuilt_nn_modules which default to True in 2.5+. This option allows users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Compared to compiling the full model, this option can result in smaller compilation latencies with 1%-5% performance degradation compared to full model compilation.

See the tutorial for more information.

[Beta] TorchInductor CPU backend optimization

This feature advances Inductor’s CPU backend optimization, including CPP backend code generation and FX fusions with customized CPU kernels. The Inductor CPU backend supports vectorization of common data types and all Inductor IR operations, along with the static and symbolic shapes. It is compatible with both Linux and Windows OS and supports the default Python wrapper, the CPP wrapper, and AOT-Inductor mode.

Additionally, it extends the max-autotune mode of the GEMM template (prototyped in 2.5), offering further performance gains. The backend supports various FX fusions, lowering to customized kernels such as oneDNN for Linear/Conv operations and SDPA. The Inductor CPU backend consistently achieves performance speedups across three benchmark suites—TorchBench, Hugging Face, and timms—outperforming eager mode in 97.5% of the 193 models tested.

PROTOTYPE FEATURES

[Prototype] FlexAttention

We’ve introduced a flexible API that enables implementing various attention mechanisms such as Sliding Window, Causal Mask, and PrefixLM with just a few lines of idiomatic PyTorch code. This API leverages torch.compile to generate a fused FlashAttention kernel, which eliminates extra memory allocation and achieves performance comparable to handwritten implementations. Additionally, we automatically generate the backwards pass using PyTorch’s autograd machinery. Furthermore, our API can take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.

For more information and examples, please refer to the official blog post and Attention Gym.

[Prototype] Compiled Autograd

Compiled Autograd is an extension to the PT2 stack allowing the capture of the entire backward pass. Unlike the backward graph traced by AOT dispatcher, Compiled Autograd tracing is deferred until backward execution time, which makes it impervious to forward pass graph breaks, and allows it to record backward hooks into the graph.

Please refer to the tutorial for more information.

[Prototype] Flight Recorder

Flight recorder is a new debugging tool that helps debug stuck jobs. The tool works by continuously capturing information about collectives as they run. Upon detecting a stuck job, the information can be used to quickly identify misbehaving ranks/machines along with code stack traces.

For more information please refer to the following tutorial.

[Prototype] Max-autotune Support on CPU with GEMM Template

Max-autotune mode for the Inductor CPU backend in torch.compile profiles multiple implementations of operations at compile time and selects the best-performing one. This is particularly beneficial for GEMM-related operations, using a C++ template-based GEMM implementation as an alternative to the ATen-based approach with oneDNN and MKL libraries. We support FP32, BF16, FP16, and INT8 with epilogue fusions for x86 CPUs. We’ve seen up to 7% geomean speedup on the dynamo benchmark suites and up to 20% boost in next-token latency for LLM inference.

For more information please refer to the tutorial.

[Prototype] TorchInductor CPU on Windows

Inductor CPU backend in torch.compile now works on Windows. We support MSVC (cl), clang (clang-cl) and Intel compiler (icx-cl) for Windows inductor currently.

See the tutorial for more details.

[Prototype] FP16 support on CPU path for both eager mode and TorchInductor CPP backend

Float16 is a commonly used reduced floating point type for performance improvement in neural network inference/training. Since this release, float16 for both eager and TorchInductor is supported on the CPU path.

[Prototype] Autoload Device Extension

PyTorch now supports autoloading for out-of-tree device extensions, streamlining integration by eliminating the need for manual imports. This feature, enabled through the torch.backends entrypoint, simplifies usage by ensuring seamless extension loading, while allowing users to disable it via an environment variable if needed.

See the tutorial for more information.

[Prototype] Enhanced Intel GPU support

Intel GPUs support enhancement is now available for both Intel® Data Center GPU Max Series and Intel® Client GPUs (Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Arc™ Graphics for dGPU parts), which is to make it easier to accelerate your Machine Learning workflows on Intel GPUs in PyTorch 2.5 release. We also enabled the initial support of PyTorch on Windows for Intel® Client GPUs in this release.

  • Expanded PyTorch hardware backend support matrix to include both Intel Data Center and Client GPUs.  
  • The implementation of SYCL* kernels to enhance coverage and execution of Aten operators on Intel GPUs to boost performance in PyTorch eager mode.
  • Enhanced Intel GPU backend of torch.compile to improve inference and training performance for a wide range of deep learning workloads.

These features are available through PyTorch preview and nightly binary PIP wheels. For more information regarding Intel GPU support, please refer to documentation.

Read More

ExecuTorch Beta: On-Device AI and LLMs, Stability, and Acceleration with Partners

TLDR

  • ExecuTorch has achieved Beta status with the release of v0.4, providing stable APIs and runtime, as well as extensive kernel coverage.
  • ExecuTorch is the recommended on-device inference engine for Llama 3.2 1B/3B models, offering enhanced performance and memory efficiency for both original and quantized models.
  • There has been a significant increase in adoption and ecosystem growth for ExecuTorch, and the focus is now on improving reliability, performance, and coverage for non-CPU backends as the next steps.

Current On-Device AI Market

The on-device AI market has been rapidly expanding, and is revolutionizing the way we interact with technology. It is unlocking new experiences, enabling personalization, and reducing latency. Traditionally, computer vision and speech recognition have been the primary use-cases for on-device AI, particularly in IoT, industrial applications, and mobile devices. However, the emergence of Large Language Models (LLMs) has made Generative AI the fastest growing sector in AI, subsequently highlighting the importance of on-device Generative AI. IDC forecasts by 2028, close to 1 billion GenAI capable smartphones being shipped worldwide.

LLMs are not only getting smaller but more powerful. This has led to the creation of a new class of applications that leverage multiple models for intelligent agents and streamlined workflows. The community is rapidly adopting and contributing to these new models, with quantized versions being created within hours of model release. Several leading technology companies are investing heavily in small LLMs, even deploying Low-Rank Adaptation (LoRA) at scale on-device to transform user experiences.

However, this rapid progress comes at a cost. The fragmentation of our on-device AI landscape creates complexity and inefficiency when going from model authoring to edge deployment. This is where PyTorch’s ExecuTorch comes in – our Beta announcement marks an important milestone in addressing these challenges and empowering developers to create innovative, AI-powered applications.

What’s New Today

It’s been exactly one year since we first open sourced ExecuTorch, six months since Alpha release, and today, we’re excited to announce three main developments:

1. Beta. ExecuTorch has reached Beta status starting from v0.4! It is now widely adopted and used in production environments across Meta. Through this adoption process we’ve identified and addressed feature gaps, improved stability, and expanded kernel and accelerator coverage. These improvements make us confident to promote ExecuTorch from Alpha to Beta status, and we are happy to welcome the community to adopt it in their own production settings. Here are three concrete enhancements:

  1. Developers can write application code and include the latest ExecuTorch as a dependency, updating when needed with a clean API contract. This is possible due to our API stabilization efforts, as well as our explicit API lifecycle and backwards compatibility policy.
  2. Running ExecuTorch on CPUs reached the necessary performance, portability and coverage. In particular, we have implemented more than 85% of all core ATen operators as part of our portable CPU kernels library to ensure running a model on ExecuTorch just works in most cases and making missing ops an exception rather than the norm. Moreover, we integrated and extensively tested our XNNPACK delegate for high performance on a wide range of CPU architectures. It is used in a number of production cases today.
  3. In addition to the low-level ExecuTorch components for greater portability, we built extensions and higher-level abstractions to support more common use-cases such as developer tooling to support on-device debugging and profiling, and Module.h extension to simplify deployment for mobile devices.

2. On-Device Large-Language Models (LLMs). There has been a growing interest in the community to deploy Large Language Models (LLMs) on edge devices, as it offers improved privacy and offline capabilities. However, these models are quite large, pushing the limits of what is possible. Fortunately, ExecuTorch can support these models, and we’ve enhanced the overall framework with numerous optimizations.

  • ExecuTorch is the recommended framework to run latest Llama models on-device with excellent performance today. The Llama 3.2 1B/3B models are well-suited for mobile deployment, and it is especially true with the official quantized 1B/3B model releases from Meta, as it provides a great balance between performance, accuracy, and size. When deploying Llama 3.2 1B/3B quantized models, decode latency improved by 2.5x and prefill latency improved by 4.2x on average, while model size decreased by 56% and memory usage reduced by 41% on average when benchmarked on Android OnePlus 12 device (we’ve also verified similar relative performance on Samsung S24+ for 1B and 3B, and Samsung S22 for 1B). For Llama 3.2 1B quantized model, for example, ExecuTorch is able to achieve 50.2 tokens/s for decoding and 260 tokens/s for prefill on the OnePlus 12, using the latest CPU kernels from XNNPACK and Kleidi libraries. These quantized models allow developers to integrate LLMs into memory and power-constrained devices while still maintaining quality and safety.
  • One of the value propositions of ExecuTorch is being able to use accelerators on mobile devices seamlessly. In fact, ExecuTorch also showcased accelerators to achieve even greater performance running Llama across Apple MPS backend, Qualcomm AI Accelerator, and MediaTek AI Accelerator.
  • There has been growing community and industry interest in multimodal and beyond text-only LLMs, evidenced by Meta’s Llama 3.2 11B/90B vision models and open-source models like Llava. We have so far enabled Llava 1.5 7B model on phones via ExecuTorch, making many optimizations, notably reducing runtime memory from 11GB all the way down to 5GB.

3. Ecosystem and Community Adoption
Now that ExecuTorch is in Beta, it is mature enough to be used in production. It is being increasingly used at Meta across various product surfaces. For instance, ExecuTorch already powers various ML inference use cases across Meta’s Ray-Ban Meta Smart Glasses and Quest 3 VR headsets as well as Instagram and WhatsApp.

We also partnered with Hugging Face to provide native ExecuTorch support for models being exported using torch.export. This collaboration ensures exported artifacts can directly be lowered and run efficiently on various mobile and edge devices. Models like gemma-2b and phi3-mini are already supported and more foundational models support is in progress.

With stable APIs and Gen AI support, we’re excited to build and grow ExecuTorch with the community. The on-device AI community is growing rapidly and finding ways to adopt ExecuTorch across various fields. For instance, ExecuTorch is being used in a mobile app built by Digica to streamline inventory management in hospitals. As another example, Software Mansion developed an app, EraserAI, to remove unwanted objects from a photo with EfficientSAM running on-device with ExecuTorch via Core ML delegate.

Towards General Availability (GA):
Since the original release of ExecuTorch alpha, we’ve seen a growing interest within the community in using ExecuTorch in various production environments. To that end, we have made great progress towards more stabilized and matured APIs and have made a significant investment in community support, adoption and contribution to ExecuTorch. As are are getting close to GA, we are investing our efforts in the following areas:

  • Non-CPU backends: Bringing non-CPU backends to even greater robustness, coverage and performance is our next goal. From day one of our original launch, we have partnered with Apple (for Core ML and MPS), Arm (for EthosU NPU) and Qualcomm (for Hexagon NPU) on accelerator integration with ExecuTorch, and we’ve since then expanded our partnership to MediaTek (NPU) and Cadence (XTensa DSP). We’re also building Vulkan GPU integration in-house. In terms of feature coverage, we’ve successfully implemented the core functionalities with our partners, ensured seamless integration with our developer tooling, and showcased successful LLM integration with many of the accelerators. Our next big step is to thoroughly validate the performance and reliability of the system in real-world, production use-cases. This stage will help us fine-tune the experience and ensure the stability needed for smooth operations.

  • Benchmarking infra: As part of our ongoing testing efforts, we’ve developed a benchmarking infrastructure along with a public dashboard to showcase our progress toward on-device model inference benchmarking. This allows us to transparently track and display model coverage across various backends, giving our community real-time insights into how we’re advancing towards our goals.

We’re excited to share these developments with you and look forward to continued improvements in collaboration with our partners and the community! We welcome community contribution to help us make ExecuTorch the clear choice for deploying AI and LLM models on-device. We invite you to start using ExecuTorch in your on-device projects, or even better consider contributing to it. You can also report any issues on our GitHub page.

Read More

PyTorch 2.5 Release Blog

We are excited to announce the release of PyTorch® 2.5 (release note)! This release features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode.

This release is composed of 4095 commits from 504 contributors since PyTorch 2.4. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.5. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

As well, please check out our new ecosystem projects releases with TorchRec and TorchFix.

Beta Prototype
CuDNN backend for SDPA FlexAttention
torch.compile regional compilation without recompilations Compiled Autograd
TorchDynamo added support for exception handling & MutableMapping types Flight Recorder
TorchInductor CPU backend optimization Max-autotune Support on CPU with GEMM Template
TorchInductor on Windows
FP16 support on CPU path for both eager mode and TorchInductor CPP backend
Autoload Device Extension
Enhanced Intel GPU support

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] CuDNN backend for SDPA

The cuDNN “Fused Flash Attention” backend was landed for torch.nn.functional.scaled_dot_product_attention. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs.

[Beta] torch.compile regional compilation without recompilations

Regional compilation without recompilations, via* torch._dynamo.config.inline_inbuilt_nn_modules* which default to True in 2.5+. This option allows users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Compared to compiling the full model, this option can result in smaller compilation latencies with 1%-5% performance degradation compared to full model compilation.

See the tutorial for more information.

[Beta] TorchInductor CPU backend optimization

This feature advances Inductor’s CPU backend optimization, including CPP backend code generation and FX fusions with customized CPU kernels. The Inductor CPU backend supports vectorization of common data types and all Inductor IR operations, along with the static and symbolic shapes. It is compatible with both Linux and Windows OS and supports the default Python wrapper, the CPP wrapper, and AOT-Inductor mode.

Additionally, it extends the max-autotune mode of the GEMM template (prototyped in 2.5), offering further performance gains. The backend supports various FX fusions, lowering to customized kernels such as oneDNN for Linear/Conv operations and SDPA. The Inductor CPU backend consistently achieves performance speedups across three benchmark suites—TorchBench, Hugging Face, and timms—outperforming eager mode in 97.5% of the 193 models tested.

PROTOTYPE FEATURES

[Prototype] FlexAttention

We’ve introduced a flexible API that enables implementing various attention mechanisms such as Sliding Window, Causal Mask, and PrefixLM with just a few lines of idiomatic PyTorch code. This API leverages torch.compile to generate a fused FlashAttention kernel, which eliminates extra memory allocation and achieves performance comparable to handwritten implementations. Additionally, we automatically generate the backwards pass using PyTorch’s autograd machinery. Furthermore, our API can take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.

For more information and examples, please refer to the official blog post and Attention Gym.

[Prototype] Compiled Autograd

Compiled Autograd is an extension to the PT2 stack allowing the capture of the entire backward pass. Unlike the backward graph traced by AOT dispatcher, Compiled Autograd tracing is deferred until backward execution time, which makes it impervious to forward pass graph breaks, and allows it to record backward hooks into the graph.

Please refer to the tutorial for more information.

[Prototype] Flight Recorder

Flight recorder is a new debugging tool that helps debug stuck jobs. The tool works by continuously capturing information about collectives as they run. Upon detecting a stuck job, the information can be used to quickly identify misbehaving ranks/machines along with code stack traces.

For more information please refer to the following tutorial.

[Prototype] Max-autotune Support on CPU with GEMM Template

Max-autotune mode for the Inductor CPU backend in torch.compile profiles multiple implementations of operations at compile time and selects the best-performing one. This is particularly beneficial for GEMM-related operations, using a C++ template-based GEMM implementation as an alternative to the ATen-based approach with oneDNN and MKL libraries. We support FP32, BF16, FP16, and INT8 with epilogue fusions for x86 CPUs. We’ve seen up to 7% geomean speedup on the dynamo benchmark suites and up to 20% boost in next-token latency for LLM inference.

For more information please refer to the tutorial.

[Prototype] TorchInductor CPU on Windows

Inductor CPU backend in torch.compile now works on Windows. We support MSVC (cl), clang (clang-cl) and Intel compiler (icx-cl) for Windows inductor currently.

See the tutorial for more details.

[Prototype] FP16 support on CPU path for both eager mode and TorchInductor CPP backend

Float16 is a commonly used reduced floating point type for performance improvement in neural network inference/training. Since this release, float16 for both eager and TorchInductor is supported on the CPU path.

[Prototype] Autoload Device Extension

PyTorch now supports autoloading for out-of-tree device extensions, streamlining integration by eliminating the need for manual imports. This feature, enabled through the torch.backends entrypoint, simplifies usage by ensuring seamless extension loading, while allowing users to disable it via an environment variable if needed.

See the tutorial for more information.

[Prototype] Enhanced Intel GPU support

Intel GPUs support enhancement is now available for both Intel® Data Center GPU Max Series and Intel® Client GPUs (Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Arc™ Graphics for dGPU parts), which is to make it easier to accelerate your Machine Learning workflows on Intel GPUs in PyTorch 2.5 release. We also enabled the initial support of PyTorch on Windows for Intel® Client GPUs in this release.

  • Expanded PyTorch hardware backend support matrix to include both Intel Data Center and Client GPUs.  
  • The implementation of SYCL* kernels to enhance coverage and execution of Aten operators on Intel GPUs to boost performance in PyTorch eager mode.
  • Enhanced Intel GPU backend of torch.compile to improve inference and training performance for a wide range of deep learning workloads.

These features are available through PyTorch preview and nightly binary PIP wheels. For more information regarding Intel GPU support, please refer to documentation.

Read More

The Path to Achieve PyTorch Performance Boost on Windows CPU

The Path to Achieve PyTorch Performance Boost on Windows CPU

The challenge of PyTorch’s lower CPU performance on Windows compared to Linux has been a significant issue. There are multiple factors leading to this performance disparity. Through our investigation, we’ve identified several reasons for poor CPU performance on Windows, two primary issues have been pinpointed: the inefficiency of the Windows default malloc memory allocator and the absence of SIMD for vectorization optimizations on the Windows platform. In this article, we show how PyTorch CPU performance on Windows has improved from the previous releases and where it stands as of PyTorch 2.4.1.

Memory Allocation Optimization in PyTorch 2.1.2 and later

In versions prior to PyTorch 2.1.2, PyTorch relied on the operating system’s default malloc function for memory allocation. The default malloc memory allocation on the Windows platform was less efficient compared to the malloc implementation mechanism on the Linux platform, leading to increased memory allocation times and reduced performance. To address this, we have substituted the default Windows malloc with mimalloc, a more efficient memory allocator developed by Microsoft. This update, included with the release of PyTorch 2.1.2 and later, has significantly enhanced the CPU performance of PyTorch on Windows, as shown in Figure 1.1.

performance comparison chart

PyTorch CPU Performance Improvement on Windows with Memory Allocation Optimization

Figure 1.1: Relative throughput improvement achieved by upgrading from Windows PyTorch version 2.0.1 to 2.1.2 (higher is better).

The graph illustrates that with the release of PyTorch 2.1.2, there has been a notable enhancement in CPU performance on the Windows platform. The degree of improvement varies across different models, which can be attributed to the diverse mix of operations they perform and their corresponding memory access patterns. While the BERT model shows a modest performance gain, models like ResNet50 and MobileNet-v3 Large benefit from more pronounced improvements.

On a high-performance CPU, memory allocation becomes a performance bottleneck. This is also why addressing this issue has led to such significant performance improvements.

As shown in the graphs below, we see that PyTorch CPU performance on Windows can significantly be improved. However, there is still a noticeable gap when compared to its performance on Linux. The absence of vectorization optimizations in the Windows variant of PyTorch CPU is a key factor to the remaining performance gap.

performance comparison chart

Windows vs Linux Performance on PyTorch 2.0.1

Figure 1.2: Relative performance of Windows vs Linux with PyTorch version 2.0.1 (higher is better).

performance comparison chart

Windows vs Linux Performance on PyTorch 2.1.2

Figure 1.3: Relative performance of Windows vs Linux with PyTorch version 2.1.2 (higher is better).

Vectorization Optimization in PyTorch 2.4.1 and later

Prior to PyTorch 2.4.1, the Windows build of PyTorch lacked SIMD for vectorization optimizations, a feature that the Linux build leveraged for improved performance. This discrepancy was due to the SLEEF Library’s integration issues on Windows, which is a SIMD Library for Evaluating Elementary Functions, vectorized libm and DFT and is essential for efficient trigonometric calculations. Through a collaborative effort with engineers from ARM and Qualcomm, these challenges were resolved, enabling the integration of SIMD into PyTorch for Windows. The PyTorch 2.4.1 update has thus significantly enhanced PyTorch’s CPU performance on Windows, as shown in Figure 2.1.

performance comparison chart

PyTorch CPU Performance Improvement on Windows with Vertorization Optimization

Figure 2.1: Relative throughput improvement achieved by upgrading from PyTorch CPU version 2.1.2 to 2.4.1 (higher is better).

As shown in the graph below, we see that PyTorch CPU performance on Windows ahieved the performance on Linux.

performance comparison chart

Windows vs Linux Performance on PyTorch 2.4.1

Figure 2.2: Relative performance of Windows vs Linux with PyTorch version 2.4.1 (higher is better).

CONCLUSION

From PyTorch 2.0.1 to PyTorch 2.4.1, the CPU performance gap between Windows and Linux has been continuously narrowing. We compared the ratio of CPU performance on Windows to CPU performance on Linux across different versions, and the results are shown in the following graph.

performance comparison chart

Windows vs Linux Performance on different version of PyTorch

Figure 3: Performance Ratio for Windows to Linux with different version of PyTorch (higher is better).

The graph shows that with PyTorch 2.4.1, CPU performance on Windows has nearly converged with that on Linux, and on some models, it has even surpassed Linux. For example, in the case of DistillBERT and RoBERTa models, the CPU performance ratio of Windows to Linux has achieved a remarkable 102%. However, certain models, including MobileNet-v3, still show a performance discrepancy. Intel engineers will continue to collaborate with Meta engineers, to reduce the performance gap of PyTorch CPU between Windows and Linux.

HOW TO TAKE ADVANTAGE OF THE OPTIMIZATIONS

Install PyTorch CPU 2.4.1 or later on Windows from the official repository, and you may automatically experience a performance boost with memory allocation and vectorizations.

ACKNOWLEDGMENTS

The results presented in this blog post was achieved through the collaborative effort of the Intel PyTorch team and Meta. We would like to express our sincere gratitude to Xu Han, Jiong Gong, Haozhe Zhu, Mingfei Ma, Chuanqi Wang, Guobing Chen and Eikan Wang. Their expertise and dedication have been instrumental in achieving the optimizations and performance improvements discussed here. Thanks to Jiachen Pu from community for his participation in the issue discussion and suggesting the use of mimalloc. We’d also like to express our gratitude to Microsoft for providing such an easily integrated and performant mallocation library. Thanks to Pierre Blanchard , Nathan Sircombe from ARM and Alex Reinking from Qualcomm for their contribution in overcome the compatibility issues with the sleef integrated to PyTorch Windows. Finally we want to thank Jing Xu, Weizhuo Zhang and Zhaoqiong Zheng for their contributions to this blog.

Product and Performance Information

The configurations in the table are collected with svr-info. Test by Intel on August 30, 2024.

Specification Configuration1 Configuration2
Name ThinkBook 14 G5+ IRH ThinkBook 14 G5+ IRH
Time Fri Aug 30 02:43:02 PM UTC 2024 Fri Aug 30 02:43:02 PM UTC 2024
System LENOVO LENOVO
Baseboard LENOVO LENOVO
Chassis LENOVO LENOVO
CPU Model 13th Gen Intel(R) Core(TM) i7-13700H 13th Gen Intel(R) Core(TM) i7-13700H
Microarchitecture Unknown Intel Unknown Intel
Sockets 1 1
Cores per Socket 14 14
Hyperthreading Enabled Enabled
CPUs 20 20
Intel Turbo Boost Enabled Enabled
Base Frequency 2.4GHz 2.4GHz
All-core Maximum Frequency 4.7GHz 4.7GHz
Maximum Frequency 4.8GHz 4.8GHz
NUMA Nodes 1 1
Prefetchers L2 HW: Enabled, L2 Adj.: Enabled, DCU HW: Enabled, DCU IP: Enabled L2 HW: Enabled, L2 Adj.: Enabled, DCU HW: Enabled, DCU IP: Enabled
PPINs
Accelerators DLB, DSA, IAA, QAT DLB, DSA, IAA, QAT
Installed Memory 32GB (8x4GB LPDDR4 7400 MT/s [5200 MT/s]) 32GB (8x4GB LPDDR4 7400 MT/s [5200 MT/s])
Hugepagesize 2048kb 2048kb
Transparent Huge Pages madvise madvise
Automatic NUMA Balancing Disabled Disabled
NIC “1. Raptor Lake PCH CNVi WiFi 2. Intel Corporation” “1. Raptor Lake PCH CNVi WiFi 2. Intel Corporation”
Disk Micron MTFDKBA512TFH 500G Micron MTFDKBA512TFH 500G
BIOS LBCN22WW LBCN22WW
Microcode 0x411c 0x411c
OS Windows 11 Desktop Ubuntu 23.10
Kernel OS Build 19045.4412 6.5.0-27-generic
TDP 200 watts 200 watts
Power & Perf Policy Normal Powersave (7) Normal Powersave (7)
Frequency Governor performance performance
Frequency Driver intel_pstate intel_pstate
Max C-State 9 9

Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Read More

PyTorch Foundation Technical Advisory Council Elects New Leadership

PyTorch Foundation Technical Advisory Council Elects New Leadership

We are pleased to announce the first-ever Chair and Vice Chair of the PyTorch Foundation’s Technical Advisory Council (TAC): Luca Antiga as the Chair and Jiong Gong as Vice Chair. Both leaders bring extensive experience and deep commitment to the PyTorch community, and they are set to guide the TAC in its mission to foster an open, diverse, and innovative PyTorch technical community.

Meet the New Leadership

Luca Antiga

Luca Antiga is the CTO at Lightning AI since 2022. He is an early contributor to PyTorch core and co-authored “Deep Learning with PyTorch” (published by Manning). He started his journey as a researcher in Bioengineering, and later co-founded Orobix, a company focused on building and deploying AI in production settings.

“I am looking forward to taking on the role of the chair of the PyTorch TAC,” says Luca. “As the TAC chair, I will ensure effective, timely topic selection and enhance visibility of technical needs from the board members and from the ecosystem at large. I will strive for directional, cohesive messaging throughout the transition of PyTorch from Meta to the Linux Foundation.”

Jiong Gong

Jiong Gong is a Principal Engineer and SW Architect for PyTorch Optimization from Intel. He serves as one of the PyTorch CPU module maintainers and is an active contributor to the TorchInductor CPU backend.

“I plan to further strengthen the collaboration between PyTorch developers and hardware vendors, promoting innovation and performance optimization across various hardware platforms, enhancing PyTorch ecosystem and streamlining the decision-making process,” says Jiong. “I am honored to serve as the vice chair of the TAC.”

What Does the TAC Do?

The PyTorch Foundation’s TAC provides a forum for technical communication, leadership, and collaboration for the PyTorch Foundation. The committee members are members of the PyTorch Foundation. The committee holds open meetings once a month that anyone in the community can attend. The committee provides thought leadership on technical topics, knowledge sharing, and a forum to discuss issues with other technical experts in the community.

New TAC Webpage

Stay connected with the PyTorch Foundation’s Technical Advisory Council (TAC) by visiting our new TAC webpage. Here you can find the TAC members, where to view upcoming meeting agendas, access presentations, attend public meetings, watch meeting recordings and participate in discussions on key technical topics.

Plus stay tuned on our blog for regular updates from the PyTorch Foundation TAC leadership.

Read More

PyTorch Conference 2024 Recap: On Fire 🔥

PyTorch Conference 2024 Recap: On Fire 🔥

women dancing with fire

The 2024 PyTorch Conference in San Francisco gathered nearly 1,500 AI researchers, developers, and enthusiasts. Over two days, the event featured engaging discussions, insightful keynotes, and hands-on sessions focused on artificial intelligence (AI) and advancements in PyTorch, the leading open-source machine learning framework. Attendees delved into the future of generative AI, Large Language Models (LLMs), and the crucial role open-source technology plays in driving AI innovation. Here’s a recap of the key themes, highlights, and major takeaways from this year’s conference.

Key Themes of the PyTorch Conference 2024

Three core themes emerged throughout the conference:

  1. Generative AI and LLMs: Many sessions focused on how PyTorch continues to evolve as a primary framework for Large Language Models and Generative AI applications. From scaling these models to optimizing their performance on various hardware platforms, the conference showcased the ongoing advancements and challenges in LLM architecture.
  2. Democratizing AI Through Open Source: One of the recurring themes was the importance of open source tools and communities in shaping the future of AI. PyTorch is committed to inclusivity, ease of use, and accessibility to developers of all levels, with a focus on bringing AI to an even larger global audience.
  3. Distributed and Edge Computing: Distributed computing and edge deployment appeared in many discussions, highlighting how PyTorch is being used to drive AI to the edge. The focus on edge accelerators, scalable training, and inference showcased how PyTorch enables the deployment of powerful models across diverse environments, from the cloud to on-device applications.

panel of people on a conference stage

Watch the Sessions from PyTorch Conference

The PyTorch Conference featured keynote sessions from top AI leaders and interesting lightning talks. You can view all of the conference sessions on our YouTube channel.

PyTorch Conference Startup Showcase

man speaking at a conference

New this year, the Startup Showcase was an exciting addition to the PyTorch Conference. Featuring early-stage founders pitching their AI startups to a panel of top venture capitalists, this event showcased the next generation of AI-driven innovation. The finalists for the inaugural PyTorch Conference Startup Showcase included Remix Inc., Cartesia, OpenBabylon, Remyx AI, A2 Labs, Inc., QuicSnap, Iso AI, CTGT, and Creao.ai, representing some of the most innovative AI/ML startups in the industry. Attendees got a front-row seat to see cutting-edge AI startups in action, while top VCs from the AI industry evaluated the pitches.

Congratulations to the PyTorch Conference Startup Showcase winner, CTGT! Deep learning can be opaque and biased, which limits its potential in crucial areas like healthcare and finance. CTGT is changing the game by enhancing data lineage in LLMs and cutting hallucinations. They’re empowering companies to create customized models using 500x less compute.

View the Startup Showcase

Mini-Summits

The DL Compiler Mini-Summit offered attendees a deep dive into the advances in deep learning (DL) compilers that are transforming AI workloads.

View the DL Compiler Mini-Summit

People watching an event

The Fine-Tuning Mini-Summit brought together a thriving community of researchers, developers, practitioners and hobbyists which focuses on topics ranging from memory efficiency, parameter-efficient fine-tuning and quantization to performance at scale and reproducible evaluations.

View the Fine-Tuning Mini-Summit

Major Takeaways from the PyTorch Conference 2024

Matt giving his keynote

  1. LLMs are Here to Stay: were a focal point of the event, reaffirming their pivotal role in the future of AI. As these models continue to scale, PyTorch remains the preferred framework for developing, training, and deploying them across various platforms and industries.
  2. Open Source Drives Innovation: A key takeaway from the conference was that open-source tools like PyTorch are vital for democratizing AI. This community-driven approach accelerates innovation, enabling researchers and developers globally to collaborate and contribute to faster advancements and more accessible AI technologies.
  3. Ethics and Sustainability Matter: The focus on ethical AI development was a significant takeaway. Talks on the inclusivity of computer vision models, the environmental impacts of AI infrastructure, and the need for transparent, unbiased AI models highlighted the growing importance of ethical considerations in the future of AI.
  4. PyTorch Expands Beyond the Cloud: With several sessions dedicated to edge AI and distributed computing, the conference showcased how PyTorch is expanding beyond cloud-based applications into edge devices and diverse computing environments. This shift is crucial as AI advances into areas like autonomous vehicles, mobile applications, and IoT devices.

Thank You to Our Sponsors

A crowd of people at a conference

Sponsor logos

We would like to thank each of the sponsors that made the PyTorch Conference 2024 possible. These include:

Diamond Sponsors:

  • AMD
  • Cloud Native Computing Foundation
  • IBM
  • Intel – PyTorch
  • Lightning.ai
  • Meta – PyTorch

Platinum Sponsors:

  • Arm
  • Google
  • Lambda Labs
  • Nvidia

Silver Sponsors:

  • Anyscale – PyTorch
  • Baseten
  • Chainguard
  • Databricks
  • Fal
  • FuriosaAi
  • HPE
  • Jane Street
  • Microsoft – PyTorch
  • MinIO
  • Outerbounds
  • Together.AI

Bronze Sponsors:

  • d-Matrix
  • MemVerge
  • Perforated AI
  • Quansight
  • Rotational Labs
  • ScaleGenAI

Special Event Sponsors:

  • PyTorch Flare Party: Hugging Face
  • Startup Showcase: Mayfield
  • Diversity Scholarship: AWS
  • Women and Non-Binary in PyTorch Lunch: Google
  • Happy Hour Reception: Lightning.AI

Thank you for your continued support in advancing the PyTorch ecosystem and helping to shape the future of AI!

Save the Date

See you next year for the PyTorch Conference in San Francisco at the Palace of Fine Arts from October 22-23, 2025.

Read More

CUDA-Free Inference for LLMs

PyTorch Native Architecture Optimization: torchao

By Team PyTorch

We’re happy to officially launch torchao, a PyTorch native library that makes models faster and smaller by leveraging low bit dtypes, quantization and sparsity. torchao is an accessible toolkit of techniques written (mostly) in easy to read PyTorch code spanning both inference and training. This blog will help you pick which techniques matter for your workloads.

We benchmarked our techniques on popular GenAI models like LLama 3 and Diffusion models and saw minimal drops in accuracy. Unless otherwise noted the baselines are bf16 run on A100 80GB GPU.

Our topline metrics for llama 3 are

For inference

  • 97% speedup for Llama 3 8B using autoquant with int4 weight only quantization and hqq
  • 73% peak VRAM reduction for Llama 3.1 8B at 128K context length with a quantized KV cache

For training

  • 50% speedup for Llama 3 70B pretraining using float8 training on H100
  • 30% peak VRAM reduction for Llama 3 8B using 4 bit quantized optimizers.

Our topline metrics for diffusion model inference

  • 53% speedup using float8 dynamic quantization inference with float8 row-wise scaling on flux1.dev onH100
  • 50% reduction in model VRAM for CogVideoX using int8 dynamic quantization

Below we’ll walk through some of the techniques available in torchao you can apply to your models for inference and training.

Inference

Our inference quantization algorithms work over arbitrary PyTorch models that contain nn.Linear layers. Weight only and dynamic activation quantization for various dtypes and sparse layouts can be chosen using our top level quantize_ api

from torchao.quantization import (
quantize_,
int4_weight_only,
)
quantize_(model, int4_weight_only())

Sometimes quantizing a layer can make it slower because of overhead so if you’d rather we just pick how to quantize each layer in a model for you then you can instead run

model = torchao.autoquant(torch.compile(model, mode=’max-autotune’))

quantize_ API has a few different options depending on whether your model is compute bound or memory bound.

from torchao.quantization import (
# Memory bound models
int4_weight_only,
int8_weight_only,

# Compute bound models  
int8_dynamic_activation_int8_semi_sparse_weight,  
int8_dynamic_activation_int8_weight,  
  
# Device capability 8.9+  
float8_weight_only,  
float8_dynamic_activation_float8_weight,   )

We also have extensive benchmarks on diffusion models in collaboration with the HuggingFace diffusers team in diffusers-torchao where we demonstrated 53.88% speedup on Flux.1-Dev and 27.33% speedup on CogVideoX-5b

Our APIs are composable so we’ve for example composed sparsity and quantization to bring 5% speedup for ViT-H inference

But also can do things like quantize weights to int4 and the kv cache to int8 to support Llama 3.1 8B at the full 128K context length running in under 18.9GB of VRAM.

QAT

Post training quantization, especially at less than 4 bit can suffer from serious accuracy degradations. Using Quantization Aware Training (QAT) we’ve managed to recover up to 96% of the accuracy degradation on hellaswag. We’ve integrated this as an end to end recipe in torchtune with a minimal tutorial

Training

Low precision compute and communications

torchao provides easy to use e2e workflows for reducing the precision of training compute and distributed communications, starting with float8 for `torch.nn.Linear` layers.Here is a one-liner to convert the compute gemms of your training run to float8:

from torchao.float8 import convert_to_float8_training
convert_to_float8_training(model)

For an e2e example of how to speed up LLaMa 3 70B pretraining by up to 1.5x with float8, see our README, and torchtitan’s blog and float8 recipe.

Performance and accuracy of float8 pretraining of LLaMa 3 70B, vs bfloat16


(source: https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359)

We are expanding our training workflows to more dtypes and layouts

  1. NF4 QLoRA in torchtune
  2. Prototype int8 training support
  3. Accelerated sparse 2:4 training

Low bit Optimizers

Inspired by Bits and Bytes we’ve also added prototype support for 8 and 4 bit optimizers as a drop in replacement for AdamW.

from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit
optim = AdamW8bit(model.parameters())

Integrations

We’ve been actively working on making sure torchao works well in some of the most important projects in open source.

  1. Huggingface transformers as an inference backend
  2. In diffusers-torchao as a reference implementation for accelerating diffusion models
  3. In HQQ for fast 4 bit inference
  4. In torchtune for PyTorch native QLoRA and QAT recipes
  5. In torchchat for post training quantization
  6. In SGLang for for int4 and int8 post training quantization

#

Conclusion

If you’re interested in making your models faster and smaller for training or inference, we hope you’ll find torchao useful and easy to integrate.

pip install torchao

There are a lot of things we’re excited about next ranging from going lower than 4 bit, performant kernels for high-throughput inference, expanding to more layers, scaling types or granularities, MX hardware support and supporting more hardware backends. If any of the above sounds exciting you can follow our progress at: https://github.com/pytorch/ao

If you’re interested in working on torchao, we’ve created a contributors guide, and if you have any questions we hang out on the #torchao channel on discord.gg/cudamode

Acknowledgements

We are fortunate to stand on the shoulders of giants and collaborate with some of the best people in open source. Thank you!

  1. Bits and Bytes for pioneering work in low bit optimizers and QLoRA
  2. Answer.ai for their engineering work to get FSDP and QLoRA composing
  3. Mobius Labs for the lovely back and forths on quantization algorithms and low bit kernels
  4. HuggingFace transformers for their help in battle testing and integrating our work
  5. HuggingFace diffusers for our collaboration on extensive benchmarks and best practices
  6. torch.compile so we could write our algorithms in pure PyTorch
  7. CUDA MODE for most of our early contributors

Read More

CUDA-Free Inference for LLMs

CUDA-Free Inference for LLMs

PyTorch Native Architecture Optimization: torchao

By Team PyTorch

We’re happy to officially launch torchao, a PyTorch native library that makes models faster and smaller by leveraging low bit dtypes, quantization and sparsity. torchao is an accessible toolkit of techniques written (mostly) in easy to read PyTorch code spanning both inference and training. This blog will help you pick which techniques matter for your workloads.

We benchmarked our techniques on popular GenAI models like LLama 3 and Diffusion models and saw minimal drops in accuracy. Unless otherwise noted the baselines are bf16 run on A100 80GB GPU.

Our topline metrics for llama 3 are

For inference

  • 97% speedup for Llama 3 8B using autoquant with int4 weight only quantization and hqq
  • 73% peak VRAM reduction for Llama 3.1 8B at 128K context length with a quantized KV cache

For training

  • 50% speedup for Llama 3 70B pretraining using float8 training on H100
  • 30% peak VRAM reduction for Llama 3 8B using 4 bit quantized optimizers.

Our topline metrics for diffusion model inference

  • 53% speedup using float8 dynamic quantization inference with float8 row-wise scaling on flux1.dev onH100
  • 50% reduction in model VRAM for CogVideoX using int8 dynamic quantization

Below we’ll walk through some of the techniques available in torchao you can apply to your models for inference and training.

Inference

Our inference quantization algorithms work over arbitrary PyTorch models that contain nn.Linear layers. Weight only and dynamic activation quantization for various dtypes and sparse layouts can be chosen using our top level quantize_ api

from torchao.quantization import (
quantize_,
int4_weight_only,
)
quantize_(model, int4_weight_only())

Sometimes quantizing a layer can make it slower because of overhead so if you’d rather we just pick how to quantize each layer in a model for you then you can instead run

model = torchao.autoquant(torch.compile(model, mode=’max-autotune’))

quantize_ API has a few different options depending on whether your model is compute bound or memory bound.

from torchao.quantization import (
# Memory bound models
int4_weight_only,
int8_weight_only,

# Compute bound models  
int8_dynamic_activation_int8_semi_sparse_weight,  
int8_dynamic_activation_int8_weight,  
  
# Device capability 8.9+  
float8_weight_only,  
float8_dynamic_activation_float8_weight,   )

«««< HEAD:_posts/2024-09-25-pytorch-native-architecture-optimization.md
We also have extensive benchmarks on diffusion models in collaboration with the HuggingFace diffusers team in diffusers-torchao where we demonstrated 53.88% speedup on Flux.1-Dev and 27.33% speedup on CogVideoX-5b
=======

We also have extensive benchmarks on diffusion models in collaboration with the HuggingFace diffusers team in diffusers-torchao where we demonstrated 53.88% speedup on Flux.1-Dev and 27.33% speedup on CogVideoX-5b

97898699f7101b847da377106274783ced03bb3d:_posts/2024-09-25-pytorch-native-architecture-optimizaion.md

Our APIs are composable so we’ve for example composed sparsity and quantization to bring 5% speedup for ViT-H inference

But also can do things like quantize weights to int4 and the kv cache to int8 to support Llama 3.1 8B at the full 128K context length running in under 18.9GB of VRAM.

QAT

Post training quantization, especially at less than 4 bit can suffer from serious accuracy degradations. Using Quantization Aware Training (QAT) we’ve managed to recover up to 96% of the accuracy degradation on hellaswag. We’ve integrated this as an end to end recipe in torchtune with a minimal tutorial

Training

Low precision compute and communications

torchao provides easy to use e2e workflows for reducing the precision of training compute and distributed communications, starting with float8 for `torch.nn.Linear` layers.Here is a one-liner to convert the compute gemms of your training run to float8:

from torchao.float8 import convert_to_float8_training
convert_to_float8_training(model)

For an e2e example of how to speed up LLaMa 3 70B pretraining by up to 1.5x with float8, see our README, and torchtitan’s blog and float8 recipe.

Performance and accuracy of float8 pretraining of LLaMa 3 70B, vs bfloat16


(source: https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359)

We are expanding our training workflows to more dtypes and layouts

  1. NF4 QLoRA in torchtune
  2. Prototype int8 training support
  3. Accelerated sparse 2:4 training

Low bit Optimizers

Inspired by Bits and Bytes we’ve also added prototype support for 8 and 4 bit optimizers as a drop in replacement for AdamW.

from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit
optim = AdamW8bit(model.parameters())

Integrations

We’ve been actively working on making sure torchao works well in some of the most important projects in open source.

  1. Huggingface transformers as an inference backend
  2. In diffusers-torchao as a reference implementation for accelerating diffusion models
  3. In HQQ for fast 4 bit inference
  4. In torchtune for PyTorch native QLoRA and QAT recipes
  5. In torchchat for post training quantization
  6. In SGLang for for int4 and int8 post training quantization

#

Conclusion

If you’re interested in making your models faster and smaller for training or inference, we hope you’ll find torchao useful and easy to integrate.

pip install torchao

There are a lot of things we’re excited about next ranging from going lower than 4 bit, performant kernels for high-throughput inference, expanding to more layers, scaling types or granularities, MX hardware support and supporting more hardware backends. If any of the above sounds exciting you can follow our progress at: https://github.com/pytorch/ao

If you’re interested in working on torchao, we’ve created a contributors guide, and if you have any questions we hang out on the #torchao channel on discord.gg/cudamode

Acknowledgements

We are fortunate to stand on the shoulders of giants and collaborate with some of the best people in open source. Thank you!

  1. Bits and Bytes for pioneering work in low bit optimizers and QLoRA
  2. Answer.ai for their engineering work to get FSDP and QLoRA composing
  3. Mobius Labs for the lovely back and forths on quantization algorithms and low bit kernels
  4. HuggingFace transformers for their help in battle testing and integrating our work
  5. HuggingFace diffusers for our collaboration on extensive benchmarks and best practices
  6. torch.compile so we could write our algorithms in pure PyTorch
  7. CUDA MODE for most of our early contributors

Read More

Challenges and Efforts in PyTorch Multi-Device Integration: Compatibility, Portability, and Integration Efficiencies

Challenges and Efforts in PyTorch Multi-Device Integration: Compatibility, Portability, and Integration Efficiencies

Introduction

As the demand for diverse hardware accelerators grows, the need for a robust and adaptable deep learning framework becomes increasingly critical. While working through this integration, several challenges have surfaced in the PyTorch ecosystem, potentially affecting various hardware vendors. This blog aims to highlight these issues and propose solutions to enhance PyTorch’s adaptability, portability, and resilience across different hardware platforms.

Improve Users’ Code Portability via Accelerator Autoloading

Currently, users face additional work when running their code on different accelerators. One such task is manually importing modules for out-of-tree devices. This requires users to not only understand the different usage patterns between accelerators but also make their code aware of these differences. If you have projects originally running on GPU/CPU and want to migrate to other accelerators, this can lead to significant work and potential frustration.

Examples of extra import:

# Case 1: Use HPU
import torch
import torchvision.models as models
import habana_frameworks.torch # <-- extra import
model = models.resnet50().eval().to("hpu")
input = torch.rand(128, 3, 224, 224).to("hpu")
output = model(input)

# Case 2: Use torch_npu
import torch
import torch_npu # <-- extra import
print(torch.ones(1, 2, device='npu'))

As a high-level machine learning framework, PyTorch’s ability to shield users from device differences is a competitive feature. Accelerator Autoloading allows users to continue using the familiar PyTorch device programming model without explicitly loading or importing device-specific extensions.

How does it works?

Utilize Python’s plugin architecture to enable automatic loading of device extensions via entry points in the PyTorch package.

Python entry points provide a standardized way for Python packages to expose and discover components or plugins within an application. Via definition in accelerator’s package setup.py , PyTorch can automatically initialize accelerator modules when calling import torch , which gives users consistent experience between different backend devices.

From device perspective, only need to claim following setup in setup.py (as example of torch_npu )

// setup.py 
entry_points={
 'torch.backends': ['torch_npu = torch_npu:_autoload', ],
}

When import torch is invoked, the accelerator module will be loaded automatically. This provides users with a consistent programming experience across out-of-tree devices, eliminating the need to be aware of differences between CUDA, HPU, and NPU.

# Case 1: Use HPU 
import torch 
import torchvision.models as models 
model = models.resnet50().eval().to("hpu") 
input = torch.rand(128, 3, 224, 224).to("hpu") 
output = model(input) 

# Case 2: Use torch_npu 
import torch 
print(torch.ones(1, 2, device='npu'))

Device Integration Optimization

What is PrivateUse1?

In PyTorch, the dispatcher is a crucial component of the framework’s backend that manages how operations are routed to the appropriate device-specific implementation. Dispatch keys are an integral part of this system, serving as identifiers that represent various execution contexts—such as the device (CPU, CUDA, XPU), layout (dense, sparse), and autograd functionality. These keys ensure that operations are directed to the correct implementation.

PrivateUse1 is a customizable device dispatch key, similar to CUDA/CPU/XPU, etc.), reserved for out-of-tree devices. It provides developers with a way to extend PyTorch’s functionality without modifying the core framework, allowing for the integration of new devices, hardware accelerators, or other specialized computing environments.

Why do we need PrivateUse1?

Internally, dispatch keys are represented as bit masks, each bit represents whether a certain key is active. This bit mask representation is efficient for quick lookup and combination of keys, but it inherently limits the number of distinct keys (typically to 64 or fewer).

The current implementation of BackendComponent dispatch keys in PyTorch has encountered a critical bottleneck, which restricts the addition of new backends and, as a result, limits the expansion of the PyTorch ecosystem.

bit diagram

In response to this challenge, a series of optimizations have been applied to the PrivateUse1 mechanism to enhance its capacity.

  • PrivateUse1 integration mechanism

    Initially reserved as fallback options, PrivateUse1, along with PrivateUse2 and PrivateUse3, were designed to be activated only when existing key resources became scarce.

    PrivateUse1 is now being developed to match the robustness and versatility of established keys like CUDA and CPU. Achieving this required a deep integration across critical PyTorch modules. This integration wasn’t just a simple switch—it involved significant updates to core components such as AMP (Automatic Mixed Precision), Autograd, Distributed Training, Checkpointing, DataLoader, Optimization, and Quantization, etc.

flow diagram

The activation of PrivateUse1 was a massive collaborative effort, culminating in over 100 pull requests aimed at making it from a placeholder to a fully operational dispatch key.

  • PrivateUse1 UT/CI Quality Assurance

    While unit tests are essential for ensuring quality during the development of the PrivateUse1 mechanism, they are not sufficient on their own to prevent new pull requests from inadvertently affecting existing functionality or compatibility of out-of-tree devices.

    To mitigate this risk, the community has added the pytorch_openreg module to the test suite. This module leverages a CPU backend to simulate interactions with accelerators, creating a controlled environment for rigorous testing. After implemented, this will enable automatic execution of device-generic test cases whenever relevant code is updated, allowing us to quickly detect and address any potential issues affecting the PrivateUse1 integration mechanism.

  • Comprehensive Documentation

    By providing comprehensive and easy-to-understand documentation, we aim to lower the barrier to entry for developers and encourage wider adoption of the PrivateUse1 mechanism in the PyTorch ecosystem. This documentation includes:

    • Step-by-step guides for integrating new backends using PrivateUse1
    • Clear explanations of PrivateUse1’s functionality and benefits
    • Code examples and best practices for efficient implementation

These enhancements aim to improve the robustness and reliability of the PrivateUse1 mechanism, facilitating better integration of new backends and expanding the capabilities of PyTorch.

Compatibility Between Upstream and Downstream

Device-Generic Unit Tests

Most unit tests in PyTorch focus on CPU and CUDA devices, which limits participation from users with other hardware. To address this, a plan to modify PyTorch’s unit testing framework, enabling better support for non-CUDA devices. This plan includes removing existing device restrictions, implementing dynamic data type loading, and generalizing decorators to accommodate a broader range of devices. Additionally, we aim to enforce the use of universal device code and expand distributed testing to support non-NCCL backends.

Through these improvements, we hope to significantly increase test coverage and pass rates for non-CUDA devices, integrating them into PyTorch’s continuous integration process. Initial changes have already been implemented, paving the way for new hardware support and creating a reference template for other devices.

Ensuring Robust Device Integration through Automated Testing

To uphold the high standards of quality assurance in PyTorch, an independent build repository and daily continuous integration (CI) workflows have been established, focusing on smoke and integration testing.

The pytorch-integration-tests repository automates the testing of PyTorch’s device-specific functionalities, ensuring that they operate correctly and efficiently across a variety of hardware platforms(NPUs and other specialized devices). In repository we are trying to make a fully automated system that continuously validates PyTorch’s compatibility with different hardware backends.

  • Automated Integration Tests: Run automated tests across different devices using GitHub Actions. This automation ensures that every change in the codebase is thoroughly tested against multiple hardware platforms, catching potential issues early in the development process.
  • Reusable Workflows: Workflows in this repository are modular and reusable, which streamlines the testing process. Developers can easily adapt these workflows to new devices or testing scenarios, making the system both flexible and scalable as PyTorch evolves.
  • Awareness of Out-of-Tree Devices: The repository displays the existence and behavior of all out-of-tree devices, keeping the community informed. This approach minimizes the risk of accidentally breaking downstream functionalities and provides fast feedback on changes.

Efforts to enhance multi-device integration are pivotal for its adaptability in the evolving deep learning landscape. These initiatives not only benefit current users but also lower entry barriers for new hardware vendors and developers, fostering innovation in AI and machine learning. As PyTorch continues to evolve, its commitment to flexibility, robustness, and inclusivity positions it as a leading framework capable of meeting the diverse needs of the deep learning community.

Read More