The Path to Achieve PyTorch Performance Boost on Windows CPU

The challenge of PyTorch’s lower CPU performance on Windows compared to Linux has been a significant issue. There are multiple factors leading to this performance disparity. Through our investigation, we’ve identified several reasons for poor CPU performance on Windows, two primary issues have been pinpointed: the inefficiency of the Windows default malloc memory allocator and the absence of SIMD for vectorization optimizations on the Windows platform. In this article, we show how PyTorch CPU performance on Windows has improved from the previous releases and where it stands as of PyTorch 2.4.1.

Memory Allocation Optimization in PyTorch 2.1.2 and later

In versions prior to PyTorch 2.1.2, PyTorch relied on the operating system’s default malloc function for memory allocation. The default malloc memory allocation on the Windows platform was less efficient compared to the malloc implementation mechanism on the Linux platform, leading to increased memory allocation times and reduced performance. To address this, we have substituted the default Windows malloc with mimalloc, a more efficient memory allocator developed by Microsoft. This update, included with the release of PyTorch 2.1.2 and later, has significantly enhanced the CPU performance of PyTorch on Windows, as shown in Figure 1.1.

performance comparison chart

PyTorch CPU Performance Improvement on Windows with Memory Allocation Optimization

Figure 1.1: Relative throughput improvement achieved by upgrading from Windows PyTorch version 2.0.1 to 2.1.2 (higher is better).

The graph illustrates that with the release of PyTorch 2.1.2, there has been a notable enhancement in CPU performance on the Windows platform. The degree of improvement varies across different models, which can be attributed to the diverse mix of operations they perform and their corresponding memory access patterns. While the BERT model shows a modest performance gain, models like ResNet50 and MobileNet-v3 Large benefit from more pronounced improvements.

On a high-performance CPU, memory allocation becomes a performance bottleneck. This is also why addressing this issue has led to such significant performance improvements.

As shown in the graphs below, we see that PyTorch CPU performance on Windows can significantly be improved. However, there is still a noticeable gap when compared to its performance on Linux. The absence of vectorization optimizations in the Windows variant of PyTorch CPU is a key factor to the remaining performance gap.

performance comparison chart

Windows vs Linux Performance on PyTorch 2.0.1

Figure 1.2: Relative performance of Windows vs Linux with PyTorch version 2.0.1 (higher is better).

performance comparison chart

Windows vs Linux Performance on PyTorch 2.1.2

Figure 1.3: Relative performance of Windows vs Linux with PyTorch version 2.1.2 (higher is better).

Vectorization Optimization in PyTorch 2.4.1 and later

Prior to PyTorch 2.4.1, the Windows build of PyTorch lacked SIMD for vectorization optimizations, a feature that the Linux build leveraged for improved performance. This discrepancy was due to the SLEEF Library’s integration issues on Windows, which is a SIMD Library for Evaluating Elementary Functions, vectorized libm and DFT and is essential for efficient trigonometric calculations. Through a collaborative effort with engineers from ARM and Qualcomm, these challenges were resolved, enabling the integration of SIMD into PyTorch for Windows. The PyTorch 2.4.1 update has thus significantly enhanced PyTorch’s CPU performance on Windows, as shown in Figure 2.1.

performance comparison chart

PyTorch CPU Performance Improvement on Windows with Vertorization Optimization

Figure 2.1: Relative throughput improvement achieved by upgrading from PyTorch CPU version 2.1.2 to 2.4.1 (higher is better).

As shown in the graph below, we see that PyTorch CPU performance on Windows ahieved the performance on Linux.

performance comparison chart

Windows vs Linux Performance on PyTorch 2.4.1

Figure 2.2: Relative performance of Windows vs Linux with PyTorch version 2.4.1 (higher is better).

CONCLUSION

From PyTorch 2.0.1 to PyTorch 2.4.1, the CPU performance gap between Windows and Linux has been continuously narrowing. We compared the ratio of CPU performance on Windows to CPU performance on Linux across different versions, and the results are shown in the following graph.

performance comparison chart

Windows vs Linux Performance on different version of PyTorch

Figure 3: Performance Ratio for Windows to Linux with different version of PyTorch (higher is better).

The graph shows that with PyTorch 2.4.1, CPU performance on Windows has nearly converged with that on Linux, and on some models, it has even surpassed Linux. For example, in the case of DistillBERT and RoBERTa models, the CPU performance ratio of Windows to Linux has achieved a remarkable 102%. However, certain models, including MobileNet-v3, still show a performance discrepancy. Intel engineers will continue to collaborate with Meta engineers, to reduce the performance gap of PyTorch CPU between Windows and Linux.

HOW TO TAKE ADVANTAGE OF THE OPTIMIZATIONS

Install PyTorch CPU 2.4.1 or later on Windows from the official repository, and you may automatically experience a performance boost with memory allocation and vectorizations.

ACKNOWLEDGMENTS

The results presented in this blog post was achieved through the collaborative effort of the Intel PyTorch team and Meta. We would like to express our sincere gratitude to Xu Han, Jiong Gong, Haozhe Zhu, Mingfei Ma, Chuanqi Wang, Guobing Chen and Eikan Wang. Their expertise and dedication have been instrumental in achieving the optimizations and performance improvements discussed here. Thanks to Jiachen Pu from community for his participation in the issue discussion and suggesting the use of mimalloc. We’d also like to express our gratitude to Microsoft for providing such an easily integrated and performant mallocation library. Thanks to Pierre Blanchard , Nathan Sircombe from ARM and Alex Reinking from Qualcomm for their contribution in overcome the compatibility issues with the sleef integrated to PyTorch Windows. Finally we want to thank Jing Xu, Weizhuo Zhang and Zhaoqiong Zheng for their contributions to this blog.

Product and Performance Information

The configurations in the table are collected with svr-info. Test by Intel on August 30, 2024.

Specification Configuration1 Configuration2
Name ThinkBook 14 G5+ IRH ThinkBook 14 G5+ IRH
Time Fri Aug 30 02:43:02 PM UTC 2024 Fri Aug 30 02:43:02 PM UTC 2024
System LENOVO LENOVO
Baseboard LENOVO LENOVO
Chassis LENOVO LENOVO
CPU Model 13th Gen Intel(R) Core(TM) i7-13700H 13th Gen Intel(R) Core(TM) i7-13700H
Microarchitecture Unknown Intel Unknown Intel
Sockets 1 1
Cores per Socket 14 14
Hyperthreading Enabled Enabled
CPUs 20 20
Intel Turbo Boost Enabled Enabled
Base Frequency 2.4GHz 2.4GHz
All-core Maximum Frequency 4.7GHz 4.7GHz
Maximum Frequency 4.8GHz 4.8GHz
NUMA Nodes 1 1
Prefetchers L2 HW: Enabled, L2 Adj.: Enabled, DCU HW: Enabled, DCU IP: Enabled L2 HW: Enabled, L2 Adj.: Enabled, DCU HW: Enabled, DCU IP: Enabled
PPINs
Accelerators DLB, DSA, IAA, QAT DLB, DSA, IAA, QAT
Installed Memory 32GB (8x4GB LPDDR4 7400 MT/s [5200 MT/s]) 32GB (8x4GB LPDDR4 7400 MT/s [5200 MT/s])
Hugepagesize 2048kb 2048kb
Transparent Huge Pages madvise madvise
Automatic NUMA Balancing Disabled Disabled
NIC “1. Raptor Lake PCH CNVi WiFi 2. Intel Corporation” “1. Raptor Lake PCH CNVi WiFi 2. Intel Corporation”
Disk Micron MTFDKBA512TFH 500G Micron MTFDKBA512TFH 500G
BIOS LBCN22WW LBCN22WW
Microcode 0x411c 0x411c
OS Windows 11 Desktop Ubuntu 23.10
Kernel OS Build 19045.4412 6.5.0-27-generic
TDP 200 watts 200 watts
Power & Perf Policy Normal Powersave (7) Normal Powersave (7)
Frequency Governor performance performance
Frequency Driver intel_pstate intel_pstate
Max C-State 9 9

Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Read More