PyTorch – Page 6 – Vedere AI

Accelerating Large Scale Training and Convergence with PyTorch Float8 Rowwise on Crusoe 2K H200s

April 28, 2025

by Meta and Crusoe PyTorch

Meta: Less Wright, Hamid Shojanazeri, Vasiliy Kuznetsov, Daniel Vega-Myhre, Gokul Nadathur, Will Constable, Tianyu Liu, Tristan Rice, Driss Guessous, Josh Fromm, Luca Wehrstedt, Jiecao Yu, Sandeep Parab
Crusoe: Ethan Petersen, Martin Cala, Chip Smith

Working with Crusoe.AI we were provided access to one of their new 2K H200 clusters in Iceland, which enabled us to showcase training accelerations of 34 – 43% at scale by leveraging TorchTitan’s HSDP2 and TorchAO’s new float8 rowwise, with comparable convergence and stability vs BF16.

In this post we detail the synergy of H200’s with PyTorch’s new Float8 rowwise training with TorchTitan’s FSDP2/HSDP2 and CP at scale.

Background – what is an H200?

H200’s are an ‘enhanced’ H100, offering the exact same compute as an H100, but with two additional improvements.

Larger global memory, 141GiB HBM3e vs the standard 80GiB HBM3
Memory bandwidth is ~43% faster with 4.8TB/s vs 3.35 TB/s. The faster memory transfer has an outsized effect on training speed, especially for PyTorch’s AsyncTP.

What is PyTorch Float8 rowwise?

Float 8 Rowwise is a finer grained resolution for Float8 vs the previous ‘tensor wise’ Float8. It is designed to ensure finer grained accuracy to support larger workloads that tend to become more sensitive to quantization at scale and as training progresses.

There are two key improvements with Float8 rowwise:

Each row now maintains its own scaling factor versus a single scaling factor for the entire tensor, thus improving quantization precision. Finer grained scaling per row helps reduce the effect of outliers (extreme values that force the quantization scaling factor to stretch and degrade the precision of the normally distributed values) and thus ensures better precision.
The scaling factor itself is now implemented by rounding down to the nearest power of 2. This has been shown to help reduce quantization errors when multiplying/dividing by the scaling factor as well as ensuring large values remain scaled to the same value in both the forward and backward passes.

Note that other large scale models have been trained using Float8 at 2K scale with a combination of 1×128 groupwise and 128×128 blockwise, with power of 2 scaling factors. They had the same goal of improving Float8’s precision for supporting large scale training.

Thus, Float8 rowwise offers a similar promise to enable Float8 for very large scale training, but we wanted to provide proof of stability and convergence at scale, which training on the Crusoe H200 2k cluster provided initial verification thereof.

Showcasing Float8 Rowwise Loss convergence vs BF16 at 1600 and 1920 GPU Scale:

In order to verify comparable loss convergence, we ran two separate runs at both 1920 and then 1600 (1.6k) gpu scale using TorchTitan and Lllama3 70B. The 1.6K GPU runs were set for 2.5k iterations, using TorchTitans’ HSDP2 and Context Parallel to enable 2D parallelism.

The loss convergence tests were run using Titan’s deterministic mode – this mode effectively freezes most potential sources of variation from run to run, and thus helps ensure that the only substantial change is what we want to test, namely the loss convergence and loss curves of BF16 vs Float8 Rowwise.

Note that deterministic mode also slows down training speed because various kernels will not be autotuned to maximize throughput (otherwise we risk using different kernels between runs and introducing variance).

Two runs were completed, one with BF16 and the other with Float8 Rowwise.

Both runs completed their assigned 2.5k iters without issue, showcasing the Crusoe cluster stability, with FP8 completing at exactly 24 hours and BF16 finishing after 31 hours, 19 minutes.

DType	Time / Iters	Loss

BF16	24 hours	3.15453
Float8 Rowwise	24 hours	2.86386

BF16	31 hours, 19 minutes / 2.5K	2.88109
Float8 Rowwise	24 hours / 2.5K	2.86386

At the 24 hour mark, Float8 completed 2.5K iterations showcasing the comparative speed up (even in deterministic mode) of float8 training. At the 24 hour mark, Float8 enabled a +9.21% relative improvement in loss compared to BF16 for the same 24 hours of large scale training time.

After 31 hours, 19 minutes, the BF16 run finally completed its 2.5k iters.

The final loss numbers:
BF16 = 2.88109
Float8 = 2.86386

From the loss curves we observed very similar curves at the first and last ⅓ and then a turbulent zone in the middle where both showed similar spikes, but with a slight skew to the relative timing of the spikes.

As a result of this, we can see that PyTorch’s Float8 rowwise offers similar convergence but over 33% speedup for the same amount of training time.

Long Term Training stability with Float8 Rowwise

Beyond showcasing comparable convergence, we also wanted to show longer term training stability with Float8 and thus we launched a 4 day, 15K run at 256 scale.

As shown above, Float8 training ran for over 100 hours with no issues, highlighting the long term stability of Float8 Rowwise.

Determinism in TorchTitan

To verify determinism and to see if the spikiness in the longer runs was from scale, we also ran a smaller run comprising of 2 runs of BF16, and 1 run of Float8 at 256 scale, and with HSDP2 only (i.e. without 2D Context parallel).

In this case both BF16 runs had identical curves and final loss, and we saw a similar spikiness zone for all three runs.

At the 2K iteration mark, both Float8 and BF16 ending at nearly identical points:
BF16 *2 = 3.28538
Float8 rowwise = 3.28203

The above result confirms that neither CP nor scale (2k) are responsible for spikiness in the loss as we saw similar effect at 256 scale as well. The most likely explanation for the loss spikes could be content distribution in the dataset.

For the sake of determinism, the experiments were run with a serialized C4 dataset (not shuffled), meaning the spikes could be from encountering new content within the dataset.

Net speedups at various Scales with Float8 rowwise:

We performed shorter runs at various GPU scales to understand how Float8 Rowwise would scale in terms of training acceleration as cluster sizes expanded. Doubling in scale from 960 to 1920, Float8 continued to deliver impressive training speedups, with a range of over 34-43% gains compared to BF16. We also want to note that scaling from 1k to 2k GPUs communication overhead likely kicked in and we observed a 4% hit on throughput with BF16.

As shown in the longer training runs at scale above, Float8 rowwise delivered substantial speedups with equal or even slightly improved loss endpoints while delivering 34% speedups at 1920 (DeepSeek) scale.

How can I use Float8 Rowwise in my training?

Float8 Rowwise is available now for you to use in your large scale training. It is packaged in TorchAO’s latest builds (0.9 and higher) and integrated into TorchTitan natively if you want to get up and running quickly.

To activate Float8 Rowwise in TorchTitan:

First enable the model converter to hotswap the nn.linears into float8 linear layers in your models .toml file – see line 29:

Secondly, specify the ‘rowwise’ float8 recipe – see line 72:

Note that you have three choices for the ‘recipe_name’:

rowwise which is the recommended default,
tensorwise (the older style float8) and
rowwise_with_gw_hp.

The gw_hp rowwise option keeps the gradients to the weights in BF16 precision during the backwards pass, and this can further enhance float8 precision for extremely sensitive workloads. But, it can ironically be a bit more performant than generic rowwise if the majority of the matmul sizes in your model are smaller (with an estimated tipping point at roughly 13-16K dimensions on H100).

Thus while we recommend rowwise as the default, it may be worth comparing with gw_hp on your model to verify which provides the best performance, with an upside of even greater precision.

By toggling the model converter on and off with a #, you can directly compare training acceleration between BF16 and Float8 Rowwise to understand the potential speedups for your own training.

Future Updates:

We’ll have an additional update coming showcasing multiple improvements for Pipeline Parallel and Async Distributed Checkpointing so please stay tuned.

Accelerate PyTorch 2.7 on Intel® GPUs

April 26, 2025

by Intel PyTorch Team PyTorch

PyTorch 2.7 continues to deliver significant functionality and performance enhancements on Intel® GPU architectures to streamline AI workflows. Application developers and researchers seeking to fine-tune, inference and develop PyTorch models on Intel GPUs will now have a consistent user experience across various operating systems, including Windows, Linux and Windows Subsystem for Linux (WSL2). This is made possible through improved installation, eager mode script debugging, a performance profiler, and graph model (torch.compile) deployment. As a result, developers have greater options with a unified GPU programming paradigm for both front-end and back-end development.

Incremental improvements of Intel GPU support in PyTorch

Since PyTorch 2.4, we’ve made steady improvements to Intel GPU support with each release. With PyTorch 2.7, we are excited to share that we have established a solid foundation to have Intel GPU work in both graph mode (torch.compile) and eager mode on Windows and Linux. This includes a wide range of Intel GPU products, many of which you may already access. We hope these enhancements will unlock more ubiquitous hardware for your AI research and development.

Over time, we have expanded Intel GPU Support across Windows and Linux, including these products:
Simpler installation of torch-xpu PIP wheels and an effortless setup experience.
High ATen operation coverage with SYCL and oneDNN for smooth eager mode support with functionality and performance.
Notable speedups with torch.compile through default TorchInductor and Triton backend, proved by measurable performance gains with Hugging Face, TIMM, and TorchBench benchmarks.

Check out the detailed advancements in these related release blogs: PyTorch 2.4, PyTorch 2.5, and PyTorch 2.6.

What’s New in PyTorch 2.7

These are the features in PyTorch 2.7 that were added to help accelerate performance on Intel GPUs.

Improve scaled dot-product attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
With the new SDPA optimization for Intel GPUs on PyTorch 2.7, Stable Diffusion float16 inference achieved up to 3x gain over PyTorch 2.6 release on Intel® Arc B580 Graphics and Intel® Core Ultra 7 Processor 258V with Intel® Arc Graphics 140V on eager mode. See Figure 1 below.

Figure 1. PyTorch 2.7 Stable Diffusion Performance Gains Over PyTorch 2.6

Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux. With this, Intel GPUs became the first accelerator to support torch.compile on Windows. Refer to Windows tutorial for details.
Graph model (torch.compile) is enabled in Windows 11 for the first time across Intel GPUs, delivering the performance advantages over eager mode as on Linux by PyTorch 2.7. The latest performance data was measured on top of PyTorch Dynamo Benchmarking Suite using Intel® Arc B580 Graphics on Windows showcase torch.compile speedup ratio over eager mode as shown in Figure 2. Both training and inference achieved similar significant improvements.

Figure 2. Torch.compile Performance Gains Over Eager Mode on Windows

Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide full graph mode quantization pipelines with enhanced computational efficiency. Refer to PT2E tutorial for details.
Enable AOTInductor and torch.export on Linux to simplify deployment workflows. Refer to AOTInductor tutorial for details.
Enable profiler on both Windows and Linux to facilitate model performance analysis. Refer to the PyTorch profiler tutorial for details.

Review the Getting Started on Intel GPU Guide for a tour of the environment setup and a quick start on Intel GPUs.

Future Work

Looking ahead, we will continue the Intel GPU upstream efforts in future PyTorch releases to:

Attain state-of-the-art PyTorch-native performance to showcase competitive GEMM computational efficiency for torch.compile, and enhance performance for LLM models through FlexAttention and lower precision data types.
Broaden feature compatibility by delivering distributed XCCL backend support for Intel® Data Center GPU Max Series.
Expand accelerator support across core PyTorch ecosystem components including torchao, torchtune, and torchtitan.

Follow along in the PyTorch Dev Discussion to learn more about Intel GPU & CPU enabling status and features. As we get further along, we will create tickets on GitHub to document our progress.

Summary

In this blog, we reviewed the Intel GPU upstream progress starting in PyTorch 2.4 and highlighted the new features of PyTorch 2.7 that accelerate AI workload performance across various Intel GPUs. These new features, especially SDPA on Windows, achieved up to 3x inference (Stable Diffusion, float16) gain over PyTorch 2.6 release on Intel Arc B580 Graphics and Intel Core Ultra 7 Processor 258V with Intel Arc Graphics 140V. Also, torch.compile on Windows delivers similar performance advantages over eager mode on Dynamo benchmarks as on Linux.

Acknowledgements

We want to thank the following PyTorch maintainers for their technical discussions and insights: Nikita Shulga, Jason Ansel, Andrey Talman, Alban Desmaison, and Bin Bao.

We also thank collaborators from PyTorch for their professional support and guidance.

Product and Performance Information

Measurement on Intel Core Ultra 7 258V: 2200 MHz, 8 Core(s), 8 Logical Processor(s) with Intel Arc 140V GPU (16GB), GPU memory 18.0 GB, using Intel Graphics Driver 32.0.101.6647 (WHQL Certified), Windows 11 Pro – 24H2. And Intel Core Ultra 5 245KF: 4200 MHz, 14 Core(s), 14 Logical Processor(s), Intel Arc B580 Graphics, dedicated GPU memory 12.0 GB, shared GPU memory 15.8 GB, using Intel Graphics Driver 32.0.101.6647 (WHQL Certified), Windows 11 Enterprise LTSC – 24H2. Test by Intel on Apr 8th, 2025.

Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.  See backup for configuration details.  No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

AI Disclaimer

AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at www.intel.com/AIPC. Results may vary.

Accelerate PyTorch 2.7 on Intel® GPUs

April 26, 2025

by Intel PyTorch Team PyTorch

Incremental improvements of Intel GPU support in PyTorch

Over time, we have expanded Intel GPU Support across Windows and Linux, including these products:
Simpler installation of torch-xpu PIP wheels and an effortless setup experience.
High ATen operation coverage with SYCL and oneDNN for smooth eager mode support with functionality and performance.
Notable speedups with torch.compile through default TorchInductor and Triton backend, proved by measurable performance gains with Hugging Face, TIMM, and TorchBench benchmarks.

Check out the detailed advancements in these related release blogs: PyTorch 2.4, PyTorch 2.5, and PyTorch 2.6.

What’s New in PyTorch 2.7

These are the features in PyTorch 2.7 that were added to help accelerate performance on Intel GPUs.

Improve scaled dot-product attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
With the new SDPA optimization for Intel GPUs on PyTorch 2.7, Stable Diffusion float16 inference achieved up to 3x gain over PyTorch 2.6 release on Intel® Arc B580 Graphics and Intel® Core Ultra 7 Processor 258V with Intel® Arc Graphics 140V on eager mode. See Figure 1 below.

Figure 1. PyTorch 2.7 Stable Diffusion Performance Gains Over PyTorch 2.6

Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux. With this, Intel GPUs became the first accelerator to support torch.compile on Windows. Refer to Windows tutorial for details.
Graph model (torch.compile) is enabled in Windows 11 for the first time across Intel GPUs, delivering the performance advantages over eager mode as on Linux by PyTorch 2.7. The latest performance data was measured on top of PyTorch Dynamo Benchmarking Suite using Intel® Arc B580 Graphics on Windows showcase torch.compile speedup ratio over eager mode as shown in Figure 2. Both training and inference achieved similar significant improvements.

Figure 2. Torch.compile Performance Gains Over Eager Mode on Windows

Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide full graph mode quantization pipelines with enhanced computational efficiency. Refer to PT2E tutorial for details.
Enable AOTInductor and torch.export on Linux to simplify deployment workflows. Refer to AOTInductor tutorial for details.
Enable profiler on both Windows and Linux to facilitate model performance analysis. Refer to the PyTorch profiler tutorial for details.

Review the Getting Started on Intel GPU Guide for a tour of the environment setup and a quick start on Intel GPUs.

Future Work

Looking ahead, we will continue the Intel GPU upstream efforts in future PyTorch releases to:

Attain state-of-the-art PyTorch-native performance to showcase competitive GEMM computational efficiency for torch.compile, and enhance performance for LLM models through FlexAttention and lower precision data types.
Broaden feature compatibility by delivering distributed XCCL backend support for Intel® Data Center GPU Max Series.
Expand accelerator support across core PyTorch ecosystem components including torchao, torchtune, and torchtitan.

Follow along in the PyTorch Dev Discussion to learn more about Intel GPU & CPU enabling status and features. As we get further along, we will create tickets on GitHub to document our progress.

Summary

Acknowledgements

We want to thank the following PyTorch maintainers for their technical discussions and insights: Nikita Shulga, Jason Ansel, Andrey Talman, Alban Desmaison, and Bin Bao.

We also thank collaborators from PyTorch for their professional support and guidance.

Product and Performance Information

Notices and Disclaimers

Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

AI Disclaimer

Accelerate PyTorch 2.7 on Intel® GPUs

April 25, 2025

by the Intel PyTorch Team PyTorch

Incremental improvements of Intel GPU support in PyTorch

Over time, we have expanded Intel GPU Support across Windows and Linux, including these products:
Simpler installation of torch-xpu PIP wheels and an effortless setup experience.
High ATen operation coverage with SYCL and oneDNN for smooth eager mode support with functionality and performance.
Notable speedups with torch.compile through default TorchInductor and Triton backend, proved by measurable performance gains with Hugging Face, TIMM, and TorchBench benchmarks.

Check out the detailed advancements in these related release blogs: PyTorch 2.4, PyTorch 2.5, and PyTorch 2.6.

What’s New in PyTorch 2.7

These are the features in PyTorch 2.7 that were added to help accelerate performance on Intel GPUs.

Improve scaled dot-product attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
With the new SDPA optimization for Intel GPUs on PyTorch 2.7, Stable Diffusion float16 inference achieved up to 3x gain over PyTorch 2.6 release on Intel® Arc™ B580 Graphics and Intel® Core™ Ultra 7 Processor 258V with Intel® Arc™ Graphics 140V on eager mode. See Figure 1 below.

Figure 1. PyTorch 2.7 Stable Diffusion Performance Gains Over PyTorch 2.6

Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux. With this, Intel GPUs became the first accelerator to support torch.compile on Windows. Refer to Windows tutorial for details.
Graph model (torch.compile) is enabled in Windows 11 for the first time across Intel GPUs, delivering the performance advantages over eager mode as on Linux by PyTorch 2.7. The latest performance data was measured on top of PyTorch Dynamo Benchmarking Suite using Intel® Arc™ B580 Graphics on Windows showcase torch.compile speedup ratio over eager mode as shown in Figure 2. Both training and inference achieved similar significant improvements.

Figure 2. Torch.compile Performance Gains Over Eager Mode on Windows

Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide full graph mode quantization pipelines with enhanced computational efficiency. Refer to PT2E tutorial for details.
Enable AOTInductor and torch.export on Linux to simplify deployment workflows. Refer to AOTInductor tutorial for details.
Enable profiler on both Windows and Linux to facilitate model performance analysis. Refer to the PyTorch profiler tutorial for details.

Review the Getting Started on Intel GPU Guide for a tour of the environment setup and a quick start on Intel GPUs.

Future Work

Looking ahead, we will continue the Intel GPU upstream efforts in future PyTorch releases to:

Attain state-of-the-art PyTorch-native performance to showcase competitive GEMM computational efficiency for torch.compile, and enhance performance for LLM models through FlexAttention and lower precision data types.
Broaden feature compatibility by delivering distributed XCCL backend support for Intel® Data Center GPU Max Series.
Expand accelerator support across core PyTorch ecosystem components including torchao, torchtune, and torchtitan.

Follow along in the PyTorch Dev Discussion to learn more about Intel GPU & CPU enabling status and features. As we get further along, we will create tickets on GitHub to document our progress.

Summary

Acknowledgments

We want to thank the following PyTorch maintainers for their technical discussions and insights: Nikita Shulga, Jason Ansel, Andrey Talman, Alban Desmaison, and Bin Bao.

We also thank collaborators from PyTorch for their professional support and guidance.

Product and Performance Information

Notices and Disclaimers

Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

AI Disclaimer

PyTorch 2.7 Release

April 24, 2025

by Intel PyTorch Team PyTorch

We are excited to announce the release of PyTorch® 2.7 (release notes)! This release features:

support for the NVIDIA Blackwell GPU architecture and pre-built wheels for CUDA 12.8 across Linux x86 and arm64 architectures.
torch.compile support for Torch Function Modes which enables users to override any *torch.** operation to implement custom user-defined behavior.
Mega Cache which allows users to have end-to-end portable caching for torch;
new features for FlexAttention – LLM first token processing, LLM throughput mode optimization and Flex Attention for Inference.

This release is composed of 3262 commits from 457 contributors since PyTorch 2.6. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.7. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Beta	Prototype
Torch.Compile support for Torch Function Modes	NVIDIA Blackwell Architecture Support
Mega Cache	PyTorch Native Context Parallel
	Enhancing Intel GPU Acceleration
	FlexAttention LLM first token processing on x86 CPUs
	FlexAttention LLM throughput mode optimization on x86 CPUs
	Foreach Map
	Flex Attention for Inference
	Prologue Fusion Support in Inductor

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] Torch.Compile support for Torch Function Modes

This feature enables users to override any *torch.** operation to implement custom user-defined behavior. For example, ops can be rewritten to accommodate a specific backend. This is used in FlexAttention to re-write indexing ops.

See the tutorial for more information.

[Beta] Mega Cache

Mega Cache allows users to have end-to-end portable caching for torch. The intended use case is after compiling and executing a model, the user calls torch.compiler.save_cache_artifacts() which will return the compiler artifacts in a portable form. Later, potentially on a different machine, the user may call torch.compiler.load_cache_artifacts() with these artifacts to pre-populate the torch.compile caches in order to jump-start their cache.

See the tutorial for more information.

PROTOTYPE FEATURES

[Prototype] NVIDIA Blackwell Architecture Support

PyTorch 2.7 introduces support for NVIDIA’s new Blackwell GPU architecture and ships pre-built wheels for CUDA 12.8. For more details on CUDA 12.8 see CUDA Toolkit Release.

Core components and libraries including cuDNN, NCCL, and CUTLASS have been upgraded to ensure compatibility with Blackwell platforms.
PyTorch 2.7 includes Triton 3.3, which adds support for the Blackwell architecture with torch.compile compatibility.
To utilize these new features, install PyTorch with CUDA 12.8 using: pip install torch==2.7.0 –index-url https://download.pytorch.org/whl/cu128

More context can also be found here.

[Prototype] PyTorch Native Context Parallel

PyTorch Context Parallel API allows users to create a Python context so that every *torch.nn.functional.scaled_dot_product_attention() *call within will run with context parallelism. Currently, PyTorch Context Parallel supports 3 attention backends: 1. Flash attention; 2. Efficient attention; and 3. cuDNN attention.

As an example, this is used within TorchTitan as the Context Parallel solution for LLM training.

See tutorial here.

[Prototype] Enhancing Intel GPU Acceleration

This latest release introduces enhanced performance optimizations for Intel GPU architectures. These improvements accelerate workloads across various Intel GPUs through the following key enhancements:

Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux.
Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide a full graph mode quantization pipelines with enhanced computational efficiency.
Improve Scaled Dot-Product Attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
Enable AOTInuctor and torch.export on Linux to simplify deployment workflows.
Implement more Aten operators to enhance the continuity of operators execution on Intel GPU and increase the performance on Intel GPU in eager mode.
Enable profiler on both Windows and Linux to facilitate model performance analysis.
Expand the Intel GPUs support to Intel® Core Ultra Series 2 with Intel® Arc Graphics, and Intel® Arc B-Series graphics on both Windows and Linux.

For more information regarding Intel GPU support, please refer to Getting Started Guide.

[Prototype] FlexAttention LLM first token processing on x86 CPUs

FlexAttention x86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific scaled_dot_product_attention operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.

[Prototype] FlexAttention LLM throughput mode optimization

The performance of FlexAttention on x86 CPUs for LLM inference throughput scenarios has been further improved by adopting the new C++ micro-GEMM template ability. This addresses the performance bottlenecks for large batch size scenarios present in PyTorch 2.6. With this enhancement, users can transparently benefit from better performance and a smoother experience when using FlexAttention APIs and torch.compile for LLM throughput serving on x86 CPUs.

[Prototype] Foreach Map

This feature uses torch.compile to allow users to apply any pointwise or user-defined function (e.g. torch.add) to lists of tensors, akin to the existing *torch.foreach** ops. The main advantage over the existing *torch.foreach** ops is that any mix of scalars or lists of tensors can be supplied as arguments, and even user-defined python functions can be lifted to apply to lists of tensors. Torch.compile will automatically generate a horizontally fused kernel for optimal performance.

See tutorial here.

[Prototype] Flex Attention for Inference

In release 2.5.0, FlexAttention* torch.nn.attention.flex_attention* was introduced for ML researchers who’d like to customize their attention kernels without writing kernel code. This update introduces a decoding backend optimized for inference, supporting GQA and PagedAttention, along with feature updates including nested jagged tensor support, performance tuning guides and trainable biases support.

[Prototype] Prologue Fusion Support in Inductor

Prologue fusion optimizes matrix multiplication (matmul) operations by fusing operations that come before the matmul into the matmul kernel itself, improving performance by reducing global memory bandwidth.

PyTorch 2.7 Release

April 24, 2025

by Intel PyTorch Team PyTorch

We are excited to announce the release of PyTorch® 2.7 (release notes)! This release features:

support for the NVIDIA Blackwell GPU architecture and pre-built wheels for CUDA 12.8 across Linux x86 and arm64 architectures.
torch.compile support for Torch Function Modes which enables users to override any *torch.** operation to implement custom user-defined behavior.
Mega Cache which allows users to have end-to-end portable caching for torch;
new features for FlexAttention – LLM first token processing, LLM throughput mode optimization and Flex Attention for Inference.

Beta	Prototype
Torch.Compile support for Torch Function Modes	NVIDIA Blackwell Architecture Support
Mega Cache	PyTorch Native Context Parallel
	Enhancing Intel GPU Acceleration
	FlexAttention LLM first token processing on x86 CPUs
	FlexAttention LLM throughput mode optimization on x86 CPUs
	Foreach Map
	Flex Attention for Inference
	Prologue Fusion Support in Inductor

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] Torch.Compile support for Torch Function Modes

See the tutorial for more information.

[Beta] Mega Cache

See the tutorial for more information.

PROTOTYPE FEATURES

[Prototype] NVIDIA Blackwell Architecture Support

PyTorch 2.7 introduces support for NVIDIA’s new Blackwell GPU architecture and ships pre-built wheels for CUDA 12.8. For more details on CUDA 12.8 see CUDA Toolkit Release.

Core components and libraries including cuDNN, NCCL, and CUTLASS have been upgraded to ensure compatibility with Blackwell platforms.
PyTorch 2.7 includes Triton 3.3, which adds support for the Blackwell architecture with torch.compile compatibility.
To utilize these new features, install PyTorch with CUDA 12.8 using: pip install torch==2.7.0 –index-url https://download.pytorch.org/whl/cu128

More context can also be found here.

[Prototype] PyTorch Native Context Parallel

As an example, this is used within TorchTitan as the Context Parallel solution for LLM training.

See tutorial here.

[Prototype] Enhancing Intel GPU Acceleration

Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux.
Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide a full graph mode quantization pipelines with enhanced computational efficiency.
Improve Scaled Dot-Product Attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
Enable AOTInuctor and torch.export on Linux to simplify deployment workflows.
Implement more Aten operators to enhance the continuity of operators execution on Intel GPU and increase the performance on Intel GPU in eager mode.
Enable profiler on both Windows and Linux to facilitate model performance analysis.
Expand the Intel GPUs support to Intel® Core Ultra Series 2 with Intel® Arc Graphics, and Intel® Arc B-Series graphics on both Windows and Linux.

For more information regarding Intel GPU support, please refer to Getting Started Guide.

[Prototype] FlexAttention LLM first token processing on x86 CPUs

[Prototype] FlexAttention LLM throughput mode optimization

[Prototype] Foreach Map

See tutorial here.

[Prototype] Flex Attention for Inference

[Prototype] Prologue Fusion Support in Inductor

PyTorch 2.7 Release

April 23, 2025

by Facebook PyTorch

We are excited to announce the release of PyTorch® 2.7 (release notes)! This release features:

support for the NVIDIA Blackwell GPU architecture and pre-built wheels for CUDA 12.8 across Linux x86 and arm64 architectures.
torch.compile support for Torch Function Modes which enables users to override any *torch.** operation to implement custom user-defined behavior.
Mega Cache which allows users to have end-to-end portable caching for torch;
new features for FlexAttention – LLM first token processing, LLM throughput mode optimization and Flex Attention for Inference.

Beta	Prototype
Torch.Compile support for Torch Function Modes	NVIDIA Blackwell Architecture Support
Mega Cache	PyTorch Native Context Parallel
	Enhancing Intel GPU Acceleration
	FlexAttention LLM first token processing on X86 CPUs
	FlexAttention LLM throughput mode optimization on X86 CPUs
	Foreach Map
	Flex Attention for Inference
	Prologue Fusion Support in Inductor

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] Torch.Compile support for Torch Function Modes

See the tutorial for more information.

[Beta] Mega Cache

See the tutorial for more information.

PROTOTYPE FEATURES

[Prototype] NVIDIA Blackwell Architecture Support

PyTorch 2.7 introduces support for NVIDIA’s new Blackwell GPU architecture and ships pre-built wheels for CUDA 12.8. For more details on CUDA 12.8 see CUDA Toolkit Release.

Core components and libraries including cuDNN, NCCL, and CUTLASS have been upgraded to ensure compatibility with Blackwell platforms.
PyTorch 2.7 includes Triton 3.3, which adds support for the Blackwell architecture with torch.compile compatibility.
To utilize these new features, install PyTorch with CUDA 12.8 using: pip install torch==2.7.0 –index-url https://download.pytorch.org/whl/cu128

More context can also be found here.

[Prototype] PyTorch Native Context Parallel

As an example, this is used within TorchTitan as the Context Parallel solution for LLM training.

See tutorial here.

[Prototype] Enhancing Intel GPU Acceleration

Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux.
Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide a full graph mode quantization pipelines with enhanced computational efficiency.
Improve Scaled Dot-Product Attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
Enable AOTInuctor and torch.export on Linux to simplify deployment workflows.
Implement more Aten operators to enhance the continuity of operators execution on Intel GPU and increase the performance on Intel GPU in eager mode.
Enable profiler on both Windows and Linux to facilitate model performance analysis.
Expand the Intel GPUs support to Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics, and Intel® Arc™ B-Series graphics on both Windows and Linux.

For more information regarding Intel GPU support, please refer to Getting Started Guide.

[Prototype] FlexAttention LLM first token processing on X86 CPUs

FlexAttention X86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific scaled_dot_product_attention operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.

[Prototype] FlexAttention LLM throughput mode optimization

[Prototype] Foreach Map

See tutorial here.

[Prototype] Flex Attention for Inference

[Prototype] Prologue Fusion Support in Inductor

Accelerating Whisper on Arm with PyTorch and Hugging Face Transformers

April 8, 2025

by Pareena Verma, Arm PyTorch

Automatic speech recognition (ASR) has revolutionized how we interact with technology, clearing the way for applications like real-time audio transcription, voice assistants, and accessibility tools. OpenAI Whisper is a powerful model for ASR, capable of multilingual speech recognition and translation.

A new Arm Learning Path is now available that explains how to accelerate Whisper on Arm-based cloud instances using PyTorch and Hugging Face transformers.

Why Run Whisper on Arm?

Arm processors are popular in cloud infrastructure for their efficiency, performance, and cost-effectiveness. With major cloud providers such as AWS, Azure, and Google Cloud offering Arm-based instances, running machine learning workloads on this architecture is becoming increasingly attractive.

What You’ll Learn

The Arm Learning Path provides a structured approach to setting up and accelerating Whisper on Arm-based cloud instances. Here’s what you cover:

1. Set Up Your Environment

Before running Whisper, you must set up your development environment. The learning path walks you through setting up an Arm-based cloud instance and installing all dependencies, such as PyTorch, Transformers, and ffmpeg.

2. Run Whisper with PyTorch and Hugging Face Transformers

Once the environment is ready, you will use the Hugging Face transformer library with PyTorch to load and execute Whisper for speech-to-text conversion. The tutorial provides a step-by-step approach for processing audio files and generating audio transcripts.

3. Measure and Evaluate Performance

To ensure efficient execution, you learn how to measure transcription speeds and compare different optimization techniques. The guide provides insights into interpreting performance metrics and making informed decisions on your deployment.

Try it Yourself

Upon completion of this tutorial, you know how to:

Deploy Whisper on an Arm-based cloud instance.
Implement performance optimizations for efficient execution.
Evaluate transcription speeds and optimize further based on results.

Try the live demo today and see audio transcription in action on Arm: Whisper on Arm Demo.

Accelerating Whisper on Arm with PyTorch and Hugging Face Transformers

April 8, 2025

by Pareena Verma, Arm PyTorch

A new Arm Learning Path is now available that explains how to accelerate Whisper on Arm-based cloud instances using PyTorch and Hugging Face transformers.

Why Run Whisper on Arm?

What You’ll Learn

The Arm Learning Path provides a structured approach to setting up and accelerating Whisper on Arm-based cloud instances. Here’s what you cover:

1. Set Up Your Environment

2. Run Whisper with PyTorch and Hugging Face Transformers

3. Measure and Evaluate Performance

Try it Yourself

Upon completion of this tutorial, you know how to:

Deploy Whisper on an Arm-based cloud instance.
Implement performance optimizations for efficient execution.
Evaluate transcription speeds and optimize further based on results.

Try the live demo today and see audio transcription in action on Arm: Whisper on Arm Demo.

PyTorch Day France 2025: Call For Proposals Open

April 3, 2025

by Facebook PyTorch

We’re pleased to announce PyTorch Day France 2025, a dedicated gathering of the PyTorch community held 7 May 2025 in Paris, France. Proudly hosted by the PyTorch Foundation and co-located with GOSIM AI Paris 2025, this event will bring together developers, researchers, and practitioners driving innovation in open source AI and machine learning.

Whether you’re building cutting-edge models or contributing to the ecosystem, PyTorch Day France is your opportunity to connect, collaborate, and help shape the future of deep learning.

Why Attend?

Set in the vibrant atmosphere of STATION F, the world’s largest startup campus, PyTorch Day France will offer a full day of:

Insightful Technical Talks
Interactive Discussions
Engaging Poster Sessions

The event is designed to foster open exchange across the PyTorch ecosystem, providing a space to learn from peers, share practical insights, and explore the latest research and applications in AI.

Submit a Proposal

We are currently accepting proposals for talks. If you have a project, idea, or research story you’d like to share with the PyTorch community, we want to hear from you.

📩 Email your talk title and abstract to pytorchevents@linuxfoundation.org for consideration.

Registration

To register for PyTorch Day France, please visit the GOSIM AI Paris website, and use the code PYTORCHFRIEND to receive 25% off.

👉 https://paris2025.gosim.org/

We encourage early registration to secure your spot and ensure access to both PyTorch Day France and the broader GOSIM AI Paris programming.

Venue

STATION F
5 Parv. Alan Turing, 75013 Paris, France
A landmark of innovation and entrepreneurship in the heart of Paris.

Travel and Accommodations

Participants are responsible for their own travel and lodging. For those arriving internationally, Paris Charles de Gaulle Airport is approximately 38.4 km from STATION F. Additional information about accommodations and transportation may be available on the GOSIM AI Paris website.

Questions?

For any inquiries, please contact us at pytorchevents@linuxfoundation.org.

We look forward to welcoming the PyTorch community to Paris this May for a day of collaboration, learning, and open source AI innovation.

Background – what is an H200?

What is PyTorch Float8 rowwise?

Showcasing Float8 Rowwise Loss convergence vs BF16 at 1600 and 1920 GPU Scale:

Long Term Training stability with Float8 Rowwise

Determinism in TorchTitan

Net speedups at various Scales with Float8 rowwise:

How can I use Float8 Rowwise in my training?

Future Updates:

Incremental improvements of Intel GPU support in PyTorch

What’s New in PyTorch 2.7

Future Work

Summary

Acknowledgements

Product and Performance Information

Notices and Disclaimers

AI Disclaimer

Incremental improvements of Intel GPU support in PyTorch

What’s New in PyTorch 2.7

Future Work

Summary

Acknowledgements

Product and Performance Information

Notices and Disclaimers

AI Disclaimer

Incremental improvements of Intel GPU support in PyTorch

What’s New in PyTorch 2.7

Future Work

Summary

Acknowledgments

Product and Performance Information

Notices and Disclaimers

AI Disclaimer

BETA FEATURES

[Beta] Torch.Compile support for Torch Function Modes

[Beta] Mega Cache

PROTOTYPE FEATURES

[Prototype] NVIDIA Blackwell Architecture Support

[Prototype] PyTorch Native Context Parallel

[Prototype] Enhancing Intel GPU Acceleration

[Prototype] FlexAttention LLM first token processing on x86 CPUs

[Prototype] FlexAttention LLM throughput mode optimization

[Prototype] Foreach Map

[Prototype] Flex Attention for Inference

[Prototype] Prologue Fusion Support in Inductor

BETA FEATURES

[Beta] Torch.Compile support for Torch Function Modes

[Beta] Mega Cache

PROTOTYPE FEATURES

[Prototype] NVIDIA Blackwell Architecture Support

[Prototype] PyTorch Native Context Parallel

[Prototype] Enhancing Intel GPU Acceleration

[Prototype] FlexAttention LLM first token processing on x86 CPUs

[Prototype] FlexAttention LLM throughput mode optimization

[Prototype] Foreach Map

[Prototype] Flex Attention for Inference

[Prototype] Prologue Fusion Support in Inductor

BETA FEATURES

[Beta] Torch.Compile support for Torch Function Modes

[Beta] Mega Cache

PROTOTYPE FEATURES

[Prototype] NVIDIA Blackwell Architecture Support

[Prototype] PyTorch Native Context Parallel

[Prototype] Enhancing Intel GPU Acceleration

[Prototype] FlexAttention LLM first token processing on X86 CPUs

[Prototype] FlexAttention LLM throughput mode optimization

[Prototype] Foreach Map

[Prototype] Flex Attention for Inference

[Prototype] Prologue Fusion Support in Inductor

Why Attend?

Submit a Proposal

Registration

Venue

Travel and Accommodations

Questions?

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.