Large Scale Training of Hugging Face Transformers on TPUs With PyTorch/XLA FSDP

Large Scale Training of Hugging Face Transformers on TPUs With PyTorch/XLA FSDP

AI is transforming many industries through advanced capabilities such as understanding and generating language, answering questions, and delivering accurate recommendations. These capabilities are fueled by ever-increasing size and complexity of AI models, which require vast amounts of computing power to train.

To meet the growing demands of AI training at scale, last year we introduced Fully Sharded Data Parallel (FSDP) in PyTorch/XLA. FSDP is a model parallelism architecture that unlocks the ability to easily and efficiently scale AI models into hundreds of billions of parameters. With PyTorch/XLA FSDP, during distributed training, each device can store a specific model shard, and all-gather the full model weights when it is time to perform the forward pass. Nested FSDP further optimizes performance by only using a given layer’s full parameters during its forward pass.

We are excited to announce that PyTorch/XLA FSDP has landed in Hugging Face Transformers. Now, Hugging Face users can train PyTorch models with up to 20 times more parameters using the same amount of computing power as before.

We built PyTorch/XLA FSDP support directly into the Hugging Face Trainer class, so that any model using Trainer can leverage FSDP. And with the addition of automatic wrapping to PyTorch/XLA FSDP, nested FSDP wrapping is both flexible and simple to apply. These new features make it easy to train a wide range of Hugging Face models at large scales. In this guide, we demonstrate training GPT-2 models with up to 128B parameters on Google Cloud TPUs. PyTorch/XLA FSDP training on TPUs is highly efficient, achieving up to 45.1% model FLOPS utilization (MFU) for GPT-2:

Figure 1: Model FLOPS utilization for Hugging Face GPT-2 on Google Cloud TPU v4

Figure 1: Model FLOPS utilization for Hugging Face GPT-2 on Google Cloud TPU v4

Configuring PyTorch/XLA FSDP in the Hugging Face Trainer

First, follow your preferred method to create your TPU(s) and install PyTorch and PyTorch/XLA. You need versions >= 2.0 for PyTorch and PyTorch/XLA.

Unset

pip3 install https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torc h-2.0-cp38-cp38-linux_x86_64.whl --user

pip3 install https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torc h_xla-2.0-cp38-cp38-linux_x86_64.whl

Next, clone and install the Hugging Face Transformers repo. Install all necessary dependencies (e.g., datasets, evaluate, scikit-learn, accelerate).

Unset

cd $HOME

git clone https://github.com/huggingface/transformers.git cd transformers

git checkout v4.31-release

pip3 install -e .

pip3 install datasets evaluate scikit-learn

pip3 install accelerate==0.21.0

In $HOME/transformers, create any model-specific configuration files you might need. Here is an example of a configuration file for a GPT-2 model with 2B parameters, which we later refer to as gpt2_config.json:

Unset

{

"activation_function": "gelu_new", "architectures": [

"GPT2LMHeadModel"

],

"attn_pdrop": 0.1,

"bos_token_id": 50256, "embd_pdrop": 0.1, "eos_token_id": 50256, "initializer_range": 0.02, "layer_norm_epsilon": 1e-05, "model_type": "gpt2",

"n_embd": 3072,

"n_head": 24,

"n_layer": 18,

"n_positions": 1024, "resid_pdrop": 0.1, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true,

"task_specific_params": { "text-generation":![ref1] { "do_sample": true, "max_length": 50

}

},

"vocab_size": 50257

}

With PyTorch/XLA FSDP, it is possible to train model sizes much bigger than this on large accelerator slices. We have trained GPT-2 models as large as 128B parameters with these techniques; for expert tips on how to replicate this scale, see the appendix.

In $HOME/transformers, create your FSDP configuration file, a JSON file containing all of the configurable aspects of your XLA FSDP wrapping stored as a dictionary. Following the official Hugging Face Transformers XLA FSDP documentation, the following arguments are available to set:

  • xla (bool, *optional*, defaults to False): This is a boolean which determines whether or not you use XLA FSDP. Make sure to set this to true.
  • xla_fsdp_settings (dict, *optional*): This is a dictionary which stores all of the XLA FSDP wrapping parameters you want to set; note that you do not have to specify settings for parameters where you are using the default value. For a complete list of settings, see here.

For compute_dtype and buffer_dtype, enter these as strings which contain the corresponding torch data type, e.g. bfloat16.

  • fsdp_min_num_params (int, *optional*, defaults to 0): An integer which sets the minimum number of parameters for size-based auto wrapping. Every module with at least as many parameters as fsdp_min_num_params will be XLA FSDP wrapped.
  • fsdp_transformer_layer_cls_to_wrap (List[str], *optional*): A list of (case-sensitive) transformer layer class names to wrap. Note that this is mutually exclusive with fsdp_min_num_params. Example: ["GPT2Block", "GPT2MLP"].
  • xla_fsdp_grad_ckpt (bool, *optional*, defaults to False): This is a boolean which determines whether to use gradient checkpointing over each nested XLA FSDP wrapped layer. This setting can only be used when the xla flag is set to true, and an auto wrapping policy is specified through fsdp_min_num_params or fsdp_transformer_layer_cls_to_wrap.

Note: For transformer-based models, use fsdp_transformer_layer_cls_to_wrap instead of fsdp_min_num_params when performing automatic nested FSDP wrapping. Layers which share weights should not belong to separate FSDP wrapped units, and the input and output embedding layers in transformer-based models share weights.

For this GPT-2 example, here is what the corresponding fsdp_config.json file looks like:

Unset

{
  "fsdp_transformer_layer_cls_to_wrap": [
    "GPT2Block"
  ],
  "xla": true,
  "xla_fsdp_settings": {
    "compute_dtype": "bfloat16",
    "shard_param_on_dim_0": true,
    "pin_layout_in_collective_ops": true
},
       "xla_fsdp_grad_ckpt": true
     }
Now, it’s time to train your model! First, ensure that you have your PyTorch/XLA runtime set up appropriately by setting
Unset
  export PJRT_DEVICE=TPU

When running training, the key flags to pass are:

a) --fsdp "full_shard"
b) --fsdp_config fsdp_config.json

where you should replace fsdp_config.json with whatever you named your FSDP configuration file. Here is a sample command to train our example 2B GPT-2 model, where training is started by xla_spawn.py, a launcher script for distributed TPU training.

Unset

python3 -u examples/pytorch/xla_spawn.py --num_cores 4 examples/pytorch/language-modeling/run_clm.py  --num_train_epochs 1 

--dataset_name wikitext 

--dataset_config_name wikitext-2-raw-v1  --per_device_train_batch_size 32  --per_device_eval_batch_size 32 

--do_train 

--do_eval 

--output_dir /tmp/test-clm 

--overwrite_output_dir 

--config_name gpt2_config.json 

--cache_dir /tmp 

--tokenizer_name gpt2 

--block_size 1024 

--optim adafactor 

--adafactor true 

--save_strategy no 

--logging_strategy no 

--fsdp "full_shard" 

--fsdp_config fsdp_config.json

Measuring Model FLOPS Utilization (MFU) for GPT-2

Model FLOPS are the floating point operations required to perform a single forward and backward pass. Model FLOPS are hardware- and implementation- independent, and only depend on the underlying model. In each step, the number of FLOPS is computed via the following formulas:

Unset

tokens_per_batch = global_batch_size * seq_len

FLOPS_per_step = 6 * tokens_per_batch * num_params

where seq_len is the sequence length and num_params is the number of parameters in the model. We note that this estimation assumes that d_model » sequence length. If this assumption is violated the self-attention FLOPs start to be significant enough and this expression.

Based on the step time and the hardware details (numbers of chips and the peak FLOPS per chip), we can compute Model FLOPS Utilization (MFU), which measures how effectively our implementation is using the underlying hardware. Achieving 100% MFU means that the hardware is being used perfectly by that model. We calculate MFU using the following formula:

Unset

model_FLOPS_utilization = FLOPS_per_step / step_time(s) / chip_count / FLOPS_per_chip

When training a GPT-2 model with 2B parameters with the XLA FSDP configuration file above on a Cloud TPU v4-8, we measure a step time of 4.191s. Using the above formula, we calculate 35.7% MFU on a v4-8. For further details on calculating MFU, refer to the PaLM paper.

The table below presents MFU for GPT-2 models with sizes between 2B and 128B, with a sequence length of 1024.

TPU NumCores v4-8 v4-64 v4-128 v4-128 v4-256 v4-512
# of Tokens / Batch 131,072 524,288 524,288 524,288 1,048,576 1,048,576
# of Parameters 2B 16B 20B 32B 64B 128B
Step Time (ms) 4,191 14,592 7,824 12,970 25,653 30,460
PFLOPS / Step 1.65 50 62 101 404 809
MFU 35.7% 38.8% 45.1% 44.4% 44.7% 37.7%

Table 1: GPT-2 model FLOPS utilization calculation details

Among these configurations, MFU peaks at 45.1% for the 20B parameter model on v4-128. This result compares favorably to, for example, 41.5% MFU for a 22B Megatron-like model.

There are two actionable insights from these experiments:

First, simply increasing the number of chips without increasing the batch size generally means lower FLOPS utilization, because more time is spent on sharing the model shards. FSDP uses all-reduce communication collectives which are not asynchronous, which means that chip-to-chip communication cannot be overlapped with computation. As the number of chips increases, the number of model shards that must be communicated increases, and so we should expect the portion of the step time spent on communication to increase with the number of chips.

Second, increasing the batch size generally means better FLOPS utilization. As the number of chips increases, the memory footprint of the model decreases, which often frees up high bandwidth memory (HBM) to scale up the global batch size. With a larger global batch size, the number of tokens processed in each step increases, and thus, so does the FLOPS per step. As long as the step time does not increase proportionally, we expect a larger global batch size to improve MFU.

Therefore, to maximize the MFU, we recommend training with the largest global batch size possible that can fit in the HBM of the TPU slice, using FSDP to reduce memory required for the model parameters.

Training Very Large Models (tested to 128B parameters)

When using PyTorch/XLA, tensors must be initialized on the CPU before being moved to the XLA device. This means one may encounter host-side out-of-memory errors if the model is sufficiently large, even though the model can fit in the device HBM after sharding. To avoid this, we must defer each submodule’s initialization until it is FSDP wrapped, which ensures that submodules are sharded as soon as their values are populated, avoiding host-side limitations.

Below, we explain how to modify a local copy of the Hugging Face transformers repository to train a GPT-2 model with up to 128B parameters using this technique.

First, using the commands below, install torchdistX, which is a library containing experimental PyTorch Distributed features. This is the engine behind deferred initialization, and allows you to create tensors that don’t require immediate storage and can be materialized later. You also need to install a specific PyTorch/XLA 2.0 version that takes advantage of this package; note that you must uninstall PyTorch and PyTorch/XLA first, if you installed them earlier.

Unset

pip3 install torch==2.0 --index-url [https://download.pytorch.org/whl/test/cpu --user](https://download.pytorch.org/whl/test/cpu)

pip3 install torch_xla[torchdistx] -f https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/experimen tal/torch_xla-2.0-cp38-cp38-linux_x86_64.whl

Next, apply the following changes to your local copy of Hugging Face Transformers:

In src/transformers/trainer.py, add the following function in _wrap_model on the line immediately prior to PyTorch/XLA FSDP wrapping:

Python

from torchdistx import deferred_init

def _init_with_torchdistX(module):

def check_fn(k):

return not isinstance(k, FSDP) deferred_init.materialize_module(module, check_fn=check_fn)

The function materialize_module will initialize the model tensors if check_fn returns True. In this case, check_fn checks whether the module has been FSDP wrapped.

Within _wrap_model, modify your FSDP wrapping to accept the additional argument param_init_fn=_init_with_torchdistX:

Python

self.model = model = FSDP(

model,

auto_wrap_policy=auto_wrap_policy, auto_wrapper_callable=auto_wrapper_callable, param_init_fn=_init_with_torchdistX, **fsdp_kwargs,

)

In examples/pytorch/language-modeling/run_clm.py, add the following import statement at the beginning of the file:

Python

from torchdistx import deferred_init

Edit the model initialization so that the model is wrapped with deferred_init.deferred_init by replacing the line

Python

model = AutoModelForCausalLM.from_config(config)

with

Python

model = deferred_init.deferred_init(AutoModelForCausalLM.from_config, config)

Note that this assumes you are supplying your own model configuration file. Otherwise, you should modify your model initialization statement accordingly.

You should also comment out these two lines which immediately follow the line above:

Python

n_params = sum({p.data_ptr(): p.numel() for p in model.parameters()}.values()) logger.info(f"Training new model from scratch - Total size={n_params/2**20:.2f}M params")

They will cause an error if left unmodified, since the model tensors do not actually have storage when these lines are executed.

With these changes, you can now run GPT-2 models with as many as 128B parameters, provided the accelerator size is suitably large.

Next Steps & Acknowledgements

To learn more, the docs can be found here. We’d love to hear from you if you run into any issues with FSDP in PyTorch/XLA, or just want to tell us about how you are using it.

We are ecstatic about what’s ahead for PyTorch/XLA and invite the community to join us. PyTorch/XLA is developed fully in open source. So, please file issues, submit pull requests, and send RFCs to GitHub so that we can openly collaborate.

We’d like to thank Ronghang Hu and Ross Girshick at Meta AI and Lysandre Debut, Sourab Mangrulkar, Sylvain Gugger and Arthur Zucker for all the support and collaboration. We’d also like to thank Jiewen Tan, Liyang Lu, Will Cromar, Vaibhav Singh, and Chandra Devarakonda for their assistance in preparing this post.

Cheers!

The PyTorch/XLA Team at Google

Read More

Intel logo

Intel Joins the PyTorch Foundation as a Premier Member

Intel logo

The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Intel has joined as a premier member.

“The PyTorch Foundation is thrilled to welcome Intel as a premier member, marking a significant milestone in our mission to empower the global AI community. Intel’s extensive expertise and commitment to advancing cutting-edge technologies align perfectly with our vision of fostering open-source innovation,” said PyTorch Foundation Executive Director Ibrahim Haddad. “Together, we will accelerate the development and democratization of PyTorch, and use the collaboration to shape a vibrant future of AI for all.”

Intel has developed and released several PyTorch-based tools and libraries to enable developers to accelerate their AI workflows, and is actively working on optimizing PyTorch to leverage Intel hardware capabilities.

“At Intel, we believe in the power of collaboration and open-source innovation to propel the ecosystem towards an AI Everywhere future. Joining the Governing Board of the PyTorch Foundation is a testament to Intel’s commitment to advancing and democratizing AI,” said Wei Li, Vice President and General Manager of Artificial Intelligence and Analytics (AIA) at Intel. “By harnessing the collective expertise and resources within the deep learning community, we aim to accelerate the development of PyTorch and continue to drive breakthroughs in AI research and applications.”

Intel fosters industry collaboration, co-engineering, and open source contributions to accelerate software innovation and develop new technologies that bring benefits to the open source community. By working together with other member companies and under the guidance of the PyTorch Foundation, Intel remains committed to actively contributing to and advocating for the community.

As a premier member, Intel is granted one seat to the PyTorch Foundation Governing Board. The Board sets policy through our bylaws, mission and vision statements, describing the overarching scope of foundation initiatives, technical vision, and direction.

Wei Li

We’re happy to welcome Wei Li, Vice President and General Manager of Artificial Intelligence and Analytics (AIA) at Intel, to our board. Dr. Wei Li is Vice President and General Manager of Artificial Intelligence and Analytics (AIA) at Intel, where he leads a world-wide team of engineering “magicians” who make AI Everywhere a reality by supercharging machine performance and developer productivity. Wei and his team have been instrumental in Intel’s recent multi-billion-dollar AI revenue growth by delivering 10-100X software acceleration, across deep learning, statistical machine learning and big data analytics, to complement Intel’s AI-optimized hardware portfolio.

To learn more about how you can be a part of the PyTorch Foundation, visit our website.

Read more about Intel’s commitment to the PyTorch Community here.

About Intel

Intel (Nasdaq: INTC) is an industry leader, creating world-changing technology that enables global progress and enriches lives. Inspired by Moore’s Law, we continuously work to advance the design and manufacturing of semiconductors to help address our customers’ greatest challenges. By embedding intelligence in the cloud, network, edge and every kind of computing device, we unleash the potential of data to transform business and society for the better. To learn more about Intel’s innovations, go to newsroom.intel.com and intel.com.

© Intel Corporation. Intel, the Intel logo and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

About PyTorch Foundation

The PyTorch Foundation is a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem. The PyTorch Foundation is supported by its members and leading contributors to the PyTorch open source project. The Foundation leverages resources provided by members and contributors to enable community discussions and collaboration.

About The Linux Foundation

The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects are critical to the world’s infrastructure including Linux, Kubernetes, Node.js, ONAP, PyTorch, RISC-V, SPDX, OpenChain, and more. The Linux Foundation focuses on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see its trademark usage page. Linux is a registered trademark of Linus Torvalds.

Read More

INT8 Quantization for x86 CPU in PyTorch

INT8 Quantization for x86 CPU in PyTorch

Overview

INT8 quantization is a powerful technique for speeding up deep learning inference on x86 CPU platforms. By reducing the precision of the model’s weights and activations from 32-bit floating-point (FP32) to 8-bit integer (INT8), INT8 quantization can significantly improve the inference speed and reduce memory requirements without sacrificing accuracy.

In this blog, we will discuss the recent progress on INT8 quantization for x86 CPU in PyTorch, focusing on the new x86 quantization backend. We will also briefly look at the new quantization path with PyTorch 2.0 Export (PT2E) and TorchInductor.

X86 Quantization Backend

The current recommended way of quantization in PyTorch is FX. Before PyTorch 2.0, the default quantization backend (a.k.a. QEngine) on x86 CPUs was FBGEMM, which leveraged the FBGEMM performance library to achieve the performance speedup. In the PyTorch 2.0 release, a new quantization backend called X86 was introduced to replace FBGEMM. The x86 quantization backend offers improved INT8 inference performance when compared to the original FBGEMM backend by leveraging the strengths of both FBGEMM and the Intel® oneAPI Deep Neural Network Library (oneDNN) kernel libraries.

Performance Benefit from X86 Backend

To measure the performance benefits of the new X86 backend, we ran INT8 inference on 69 popular deep learning models (shown in Figures 1-3 below) using 4th Gen Intel® Xeon® Scalable processors. The results showed a 2.97X geomean performance speedup compared to FP32 inference performance, while the speedup was 1.43X with the FBGEMM backend. The charts below show the per-model performance speedup comparing the x86 backend and the FBGEMM backend.

Figure 1: Models with less than 2x performance boost with x86 backend1

Figure 1: Models with less than 2x performance boost with x86 backend1

Figure 2: Models with 2x-4x performance boost with x86 backend1

Figure 2: Models with 2x-4x performance boost with x86 backend1

Figure 3: Models with larger than 4x performance boost with x86 backend1

Figure 3: Models with larger than 4x performance boost with x86 backend1

Usage of x86 Backend

By default in 2.0, users on x86 platforms will use the x86 quantization backend and their PyTorch programs will remain unchanged when using the default backend. Alternatively, users can specify x86 as the quantization backend explicitly.
Below is an example code snippet of PyTorch static post-training quantization with x86 quantization backend.

import torch
from torch.ao.quantization import get_default_qconfig_mapping
from torch.quantization.quantize_fx import prepare_fx, convert_fx

qconfig_mapping = get_default_qconfig_mapping()
# Or explicity specify the qengine
# qengine = 'x86'
# torch.backends.quantized.engine = qengine
# qconfig_mapping = get_default_qconfig_mapping(qengine)

model_fp32 = MyModel().eval()
x = torch.randn((1, 3, 224, 224), dtype=torch.float)
x = x.to(memory_format=torch.channels_last)

# Insert observers according to qconfig and backend config
prepared_model = prepare_fx(model_fp32, qconfig_mapping, example_inputs=x)

# Calibration code not shown

# Convert to quantized model
quantized_model = convert_fx(prepared_model)

Technical Details of x86 Backend

We devised heuristic dispatching rules according to the performance numbers from the models we benchmarked to decide whether to invoke oneDNN or FBGEMM performance library to execute the convolution or matrix multiplication operations. The rules are a combination of operation kinds, shapes, CPU architecture information, etc. Detailed logic is available here. For more design and technical discussion, please refer to the Request for Comments.

Next Steps With a New Quantization Path PyTorch 2.0 Export

Although still far from finalized, a new quantization path, PyTorch 2.0 Export (PT2E), is in early design and PoC stage. The new approach is slated to replace the FX quantization path in the future. It is built upon the capabilities of TorchDynamo Export, a feature introduced in the PyTorch 2.0 release for FX graph capturing. This graph is then quantized and lowered to different backends. TorchInductor, the new DL compiler of PyTorch, has shown promising results in terms of FP32 inference speedup on x86 CPU. We are working actively to enable it as one of the quantization backends of PT2E. We believe the new path will lead to further improvements in INT8 inference performance due to more flexibility of fusion at different levels.

Conclusion

The x86 backend introduced in PyTorch 2.0 release has demonstrated a remarkable improvement in INT8 inference speed on x86 CPU platforms. It offers a 1.43X speedup compared to the original FBGEMM backend while maintaining backward compatibility. This enhancement can benefit end users with minimal or no modifications to their programs. Furthermore, a new quantization path, PT2E, is currently in development and is expected to provide even more possibilities in the future.

Acknowledgement

Special thanks to Nikita Shulga, Vasiliy Kuznetsov, Supriya Rao, and Jongsoo Park. Together, we made one more step forward on the path of improving the PyTorch CPU ecosystem.

Configuration

1 AWS EC2 r7iz.metal-16xl instance (Intel(R) Xeon(R) Gold 6455B, 32-core/64-thread, Turbo Boost On, Hyper-Threading On, Memory: 8x64GB, Storage: 192GB); OS: Ubuntu 22.04.1 LTS; Kernel: 5.15.0-1028-aws; Batch Size: 1; Core per Instance: 4; PyTorch 2.0 RC3; TorchVision 0.15.0+cpu, test by Intel on 3/77/2023. May not reflect all publicly available security updates.

Read More

Hugging Face Joins the PyTorch Foundation as a Premier Member

Hugging Face Joins the PyTorch Foundation as a Premier Member

Smiling hugging face

The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Hugging Face has joined as a premier member.

Hugging Face has been a long time supporter and contributor to the PyTorch Ecosystem by providing powerful models and resources that accelerate research, development, and adoption of AI technologies, particularly in the field of natural language processing.

“Our mission has always been to democratize AI and make it accessible to everyone. We’re truly aligned with PyTorch’s objective of reducing the barrier of entry to practitioners. By joining the PyTorch Foundation, we can further amplify that impact and support this very important framework of the ecosystem that is PyTorch,” said Lysandre Debut, Head of Open Source at Hugging Face. “We believe the two ecosystems have significant overlap, and collaborating with the foundation will allow us to bridge the gap to provide the best software, the best tools to the machine learning community at large.”

Hugging Face’s Model Hub and open source libraries promote collaboration and knowledge sharing within the AI open source community, making Hugging Face a great match to the growing PyTorch Foundation. They continue to drive industry adoption and collaboration by creating user-friendly tools and resources and providing accessible and well-documented libraries.

“Hugging Face’s commitment to open source development and their exceptional contributions to the PyTorch ecosystem have truly impressed us. With their help, we will drive innovation, foster collaboration, and empower the global AI community to create transformative solutions for the AI community,” said PyTorch Foundation Executive Director Ibrahim Haddad. “We welcome Hugging Face to the PyTorch Foundation and look forward to the achievements that lie ahead.”

As a premier member, Hugging Face is granted one seat to the PyTorch Foundation Governing Board. The Board sets policy through our bylaws, mission and vision statements, describing the overarching scope of foundation initiatives, technical vision, and direction.

Lysandre Debut

We’re happy to welcome Lysandre Debut, Head of Open Source at Hugging Face to our board. Lysandre has been at Hugging Face since the company’s pivot to open-source, and was the first engineer to focus entirely on the open-source mission. Now leading the open-source part of the organization, Lysandre remains technically involved by being a core maintainer of the Transformers library.

To learn more about how you can be a part of the PyTorch Foundation, visit our website.

About Hugging Face

Hugging Face is a community and company dedicated to lowering the barrier of entry to Machine Learning and Deep Learning. Strong advocates for open-source and open-science, their model Hub hosts more than 250,000 public models and 50,000 public datasets that are very simple to use. Transformers, Diffusers, PEFT, Accelerate, and Datasets are some of the open-source tools made available by Hugging Face.

About PyTorch Foundation

The PyTorch Foundation is a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem. The PyTorch Foundation is supported by its members and leading contributors to the PyTorch open source project. The Foundation leverages resources provided by members and contributors to enable community discussions and collaboration.

About The Linux Foundation

The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects are critical to the world’s infrastructure including Linux, Kubernetes, Node.js, ONAP, PyTorch, RISC-V, SPDX, OpenChain, and more. The Linux Foundation focuses on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see its trademark usage page: www.linuxfoundation.org/trademark-usage. Linux is a registered trademark of Linus Torvalds.

Read More

AMD’s Journey to Openness and Performance

AMD has gained progress in building a robust software stack that supports an open ecosystem of models, libraries, frameworks, and tools. With proven platforms gaining momentum, there is significance of a leadership software stack and an optimized ecosystem for achieving application performance. PyTorch is a key part of AMD’s AI journey, and AMD’s Victor Peng, AMD President and Soumith Chintala, founder of PyTorch discussed the latest progress at the DC & AI Keynote on June 12.

Building a Powerful SW Stack with ROCm

Victor introduced ROCm, AMD’s SW stack for Instinct Data Center GPUs. It offers a comprehensive set of open-source libraries, runtime, compilers, and tools for developing, running, and fine-tuning AI models. The fifth generation ROCm incorporates optimizations for AI and high-performance computing workloads, including tailored kernels for low-latency memory systems, support for new data types, and integration with OpenAI Triton. With tools for porting AI software to AMD Instinct platforms, ROCm ensures quality and robustness, tested extensively and compliant with PyTorch and TensorFlow frameworks.

Collaboration with PyTorch

To shed light on the partnership between AMD and PyTorch, Victor invited Soumith Chintala, the founder of PyTorch, to discuss the advancements and integration between the two. PyTorch, the industry’s most famous AI framework, boasts a vibrant developer community and is known for its continuous innovation and incorporation of cutting-edge research.

To highlight the AMD and PyTorch partnership, Victor hosted a discussion with Soumith Chintala, the founder of PyTorch. PyTorch, renowned for its innovation and community, is the industry’s leading AI framework. The latest version, PyTorch 2.0, integrates with hardware-agnostic software compilers like OpenAI Triton, enabling efficient training and deployment of AI models. With optimized techniques, PyTorch 2.0 enhances productivity and offers remarkable speed improvements. The collaboration between AMD and the PyTorch Foundation ensures seamless utilization of AMD GPUs, expanding AI accelerator accessibility worldwide and paving the way for future optimizations and broader hardware support.

Empowering the Developer Community

The partnership between AMD and PyTorch benefits the developer community by democratizing access to AI accelerators. Support for AMD GPUs in PyTorch allows developers to train and deploy models across various platforms, including CPUs like EPYC and Ryzen, GPUs like Instinct and Radeon, and embedded devices like Versal SoCs. By ensuring immediate compatibility of new models on AMD platforms, the collaboration streamlines the development process and empowers developers to leverage the full potential of AMD’s hardware. This increased accessibility and flexibility enable developers worldwide to push the boundaries of AI innovation.

Hugging Face and AI Model Innovation

Victor praised Hugging Face as the leading force behind open-source AI model innovation, empowering generative AI with transformative transformers. AMD’s optimized software enables a high-performing development stack, supporting groundbreaking AI advancements for customers and developers through scalable real-world deployments.

Conclusion

At the DC & AI Keynote, AMD demonstrated its dedication to openness, performance, and collaboration. The ROCm SW stack, PyTorch integration, and support for Hugging Face exemplify AMD’s commitment to empowering developers and researchers to achieve AI breakthroughs. By offering accessible, high-performing solutions, AMD fuels the future of AI as a leading GPU platform integrated with PyTorch.

To listen to the full keynote visit the AMD Youtube channel

To listen to Soumith Chintala’s section of the keynote

Read More

Performant Distributed checkpointing in Production with IBM

Performant Distributed checkpointing in Production with IBM

Params saved per minute

Last year, IBM Research began collaborating with us to onboard Fully Sharded Data Parallelism (FSDP) for their large foundation models. They became interested as FSDP is a PyTorch native offering for scaling their distributed training efforts on IBM Cloud.

We are pleased to share that, in collaboration with IBM, we have achieved substantial checkpointing speedups for large models (72x vs the original PyTorch 1.13 save speed), proven model and optimizer checkpoint scaling to 30B parameters, and enabled cloud first training using FSDP + Distributed Checkpoint on S3 backends.

What is a Distributed Checkpoint?

Distributed checkpointing is the PyTorch native solution for saving and loading PyTorch models and optimizer states from multiple ranks, as well as supporting dynamically changing world sizes between reloads.

Checkpoint time vs model params

PyTorch Distributed Checkpoint (DCP) APIs were introduced in PyTorch 1.13, and are included as an official prototype feature in PyTorch 2.0.

Distributed checkpoint is different from torch.save() and torch.load() in a few significant ways:

  1. DCP produces multiples files per checkpoint, with at least one file per rank,
  2. DCP operates in place, meaning that the model should allocate its data first and the Distributed Checkpoint will then use the storage.

A major improvement from 1.13 to 2.0 includes adding sharded_state_dict support for checkpointing FSDP models. This allows checkpointing for larger sized models, as well as adding support for load-time resharding. Load time resharding enables saving in one cluster topology, and loading into another. This feature was highly requested as it allows training jobs to be run on one cluster, saved, and then continued on a different cluster with different world size.

Another major change is that we decouple the storage layer from the checkpoint planning layer and separate implementation from the interface for both layers. With this change, users can now specify how their state_dict should be chunked or transformed during the checkpoint planning phase. Additionally, the customizable storage layer can easily accommodate different backends.

More information on the Distributed Checkpoint package can be found here.

Performant Distributed checkpointing in Production with IBM

IBM at Think 2023 announced its watsonx.ai platform for development and deployment of foundation models for the enterprise. Built on Hybrid Cloud, the platform enables use cases across multiple modalities such as NLP, timeseries, weather, chemistry, tabular data, and cybersecurity, with model sizes from 100s of millions to 10s of billions of parameters. Model architectures range from vision transformers, to multi-modal RoBERTa-style feature extractors, to large-scale generative language models similar to T5, GPT and Llama.

As of today, IBM has now enabled checkpointing for T5-style architectures up to 11B parameters, and decoder architectures (GPT style) up to 30B.

IBM helped us identify that this limits the scaling power of DCP from both memory and performance standpoints. With their suggestion, we enhanced our FileSystemWriter to produce single checkpoint per rank to reduce read write overhead.

With this option as the new default, DCP now creates a single file per rank during checkpoint saving, which would then be sliced when reading parameters at load time.

By combining sharded_state_dict support with single filer per rank writer, distributed checkpoint was able to accelerate checkpoint saving time over 72x vs the original PyTorch 1.13 save speed, and enable rapid checkpointing for models sizes over 15B which would previously simply time out.

“Looking back, it’s really astounding the speedups we’ve seen, handling training for many of these models. We went from taking almost half an hour to write a single 11B checkpoint in PyTorch 1.13, to being able to handle a 30B parameter model, with optimizer and dataloader state – so that’s over eight times the raw data – in just over 3 minutes. That’s done wonders for both the stability and efficiency of our jobs, as we scale up training to hundreds of gpus.” – Davis Wertheimer, IBM Research

IBM’s adoption has also helped us validate and improve our solutions in a real world, large-scale training environment. As an example, IBM discovered that DCP was working well for them on a single node with multiple GPUs, but erred out when used on multiple nodes.

Upon investigating the issue, we realized that we were assuming writing to a NFS-like shared file system, which assumes strong read-after-write consistencies. Object stores with file system APIs such as S3FS provide eventual consistency semantics, thus causing the distributed checkpoint in such a setting to fail. Working together with IBM, we identified this issue and fixed it by making one line code change and enabled object storage backend for DCP! Such storage approaches are typically an order of magnitude cheaper than shared file systems thus enabling finer grained checkpointing.

Looking for Collaboration

If you are interested in trying Distributed Checkpoint, feel free to reach out to us!

If you run into any issue when trying it, you can open an issue at our Github repo.

Acknowledgements

This project would not have been possible without the assistance from many collaborators. We would like to thank Yanli Zhao, Andrew Gu, Rohan Varma for their support of FSDP. Thanks to Pritam Damania, Junjie Zhao, and Wanchao Liang for their support of ShardedTensor.

Read More

IBM Joins the PyTorch Foundation as a Premier Member

IBM Joins the PyTorch Foundation as a Premier Member

The PyTorch Foundation, part of The Linux Foundation, is pleased to announce that IBM has joined as a premier member.

IBM Logo

The foundation serves as a neutral space for the deep learning community to collaborate on the open source PyTorch framework and ecosystem. With its extensive industry expertise and leadership in open source and AI, IBM is committed to actively contributing to the PyTorch community.

IBM offers a comprehensive portfolio of enterprise AI solutions and recently released watsonx, its next-generation data and AI platform. IBM’s watsonx platform leverages PyTorch to offer an enterprise-grade software stack for end-to-end training and fine-tuning of AI foundation models.

“By joining the PyTorch Foundation, we aim to contribute our expertise and resources to further advance PyTorch’s capabilities and make AI more accessible in hybrid cloud environments with flexible hardware options,” said Priya Nagpurkar, Vice President, Hybrid Cloud Platform and Developer Productivity, IBM Research. “We intend for our collaboration with PyTorch to bring the power of foundation models and generative AI to enterprises using the watsonx platform to drive business transformation.”

IBM and PyTorch have already collaborated on two projects. The first enables foundation models with billions of parameters to train efficiently on standard cloud networking infrastructure, such as Ethernet networking. Together, IBM and PyTorch have also worked on ways to make checkpointing for AI training considerably more cost-effective, by fixing the distributed checkpointing within PyTorch to support certain types of object storage.

“We’re happy to welcome IBM as a premier member. IBM’s expertise and dedication to advancing the field of artificial intelligence align perfectly with the mission of the PyTorch community,” said PyTorch Foundation Executive Director Ibrahim Haddad. “Their commitment to open collaboration and innovation will strengthen our collective efforts to empower developers and researchers worldwide.”

As a premier member, IBM is granted one seat to the PyTorch Foundation Governing Board. The Board sets policy through our bylaws, mission and vision statements, describing the overarching scope of foundation initiatives, technical vision, and direction.

Raghu Ganti Headshot

We’re happy to welcome Raghu Ganti, Principal Research Scientist at IBM Research, to our board. Raghu co-leads IBM Research’s foundation model training and validation platform, built on Red Hat OpenShift. His team primarily contributes to the PyTorch training components, with the mission of democratizing training and validation of foundation models.

To learn more about how you can be a part of the PyTorch Foundation, visit our website.

Read More

Announcing CPP-based S3 IO DataPipes

Announcing CPP-based S3 IO DataPipes

Training large deep learning models requires large datasets. Amazon Simple Storage Service (Amazon S3) is a scalable cloud object store service used for storing large training datasets. Machine learning (ML) practitioners need an efficient data pipe that can download data from Amazon S3, transform the data, and feed the data to GPUs for training models with high throughput and low latency.

In this post, we introduce the new S3 IO DataPipes for PyTorch, S3FileLister and S3FileLoader. For memory efficiency and fast runs, the new DataPipes use the C++ extension to access Amazon S3. Benchmarking shows that S3FileLoader is 59.8% faster than FSSpecFileOpener for downloading a natural language processing (NLP) dataset from Amazon S3. You can build IterDataPipe training pipelines with the new DataPipes. We also demonstrate that the new DataPipe can reduce overall Bert and ResNet50 training time by 7%. The new DataPipes have been upstreamed to the open-source TorchData 0.4.0 with PyTorch 1.12.0.

Overview

Amazon S3 is a scalable cloud storage service with no limit on data volume. Loading data from Amazon S3 and feeding the data to high-performance GPUs such as NVIDIA A100 can be challenging. It requires an efficient data pipeline that can meet the data processing speed of GPUs. To help with this, we released a new high performance tool for PyTorch: S3 IO DataPipes. DataPipes are subclassed from torchdata.datapipes.iter.IterDataPipe, so they can interact with the IterableDataPipe interface. Developers can quickly build their DataPipe DAGs to access, transform, and manipulate data with shuffle, sharding, and batch features.

The new DataPipes are designed to be file format agnostic and Amazon S3 data is downloaded as binary large objects (BLOBs). It can be used as a composable building block to assemble a DataPipe graph that can load tabular, NLP, and computer vision (CV) data into your training pipelines.

Under the hood, the new S3 IO DataPipes employ a C++ S3 handler with the AWS C++ SDK. In general, a C++ implementation is more memory efficient and has better CPU core usage (no Global Interpreter Lock) in threading compared to Python. The new C++ S3 IO DataPipes are recommended for high throughput, low latency data loading in training large deep learning models.

The new S3 IO DataPipes provide two first-class citizen APIs:

  • S3FileLister – Iterable that lists S3 file URLs within the given S3 prefixes. The functional name for this API is list_files_by_s3.
  • S3FileLoader – Iterable that loads S3 files from the given S3 prefixes. The functional name for this API is load_files_by_s3.

Usage

In this section, we provide instructions for using the new S3 IO DataPipes. We also provide a code snippet for load_files_by_s3().

Build from source

The new S3 IO DataPipes use the C++ extension. It is built into the torchdata package by default. However, if the new DataPipes are not available within the environment, for example Windows on Conda, you need to build from the source. For more information, refer to Iterable Datapipes.

Configuration

Amazon S3 supports global buckets. However, a bucket is created within a Region. You can pass a Region to the DataPipes by using __init__(). Alternatively, you can either export AWS_REGION=us-west-2 into your shell or set an environment variable with os.environ['AWS_REGION'] = 'us-east-1' in your code.

To read objects in a bucket that aren’t publicly accessible, you must provide AWS credentials through one of the following methods:

Example code

The following code snippet provides a typical usage of load_files_by_s3():

from torch.utils.data import DataLoader

from torchdata.datapipes.iter import IterableWrapper



s3_shard_urls = IterableWrapper(["s3://bucket/prefix/",])

s3_shards = s3_shard_urls.load_files_by_s3()

# text data

training_data = s3_shards.readlines(return_path=False)

data_loader = DataLoader(
      training_data,
      batch_size=batch_size,
      num_workers=num_workers,

)
# training loop

for epoch in range(epochs):
    
      # training step
    
      for bach_data in data_loader:
        
         # forward pass, backward pass, model update 


Benchmark

In this section, we demonstrate how the new DataPipe can reduce overall Bert and ResNet50 training time.

Isolated DataLoader performance evaluation against FSSpec

FSSpecFileOpener is another PyTorch S3 DataPipe. It uses botocore and aiohttp/asyncio to access S3 data. The following is the performance test setup and result (quoted from Performance Comparison between native AWSSDK and FSSpec (boto3) based DataPipes).

The S3 data in the test is a sharded text dataset. Each shard has about 100,000 lines and each line is around 1.6 KB, making each shard about 156 MB. The measurements in this benchmark are averaged over 1,000 batches. No shuffling, sampling, or transforms were performed.

The following chart reports the throughput comparison for various batch sizes for num_workers=0, the data loader runs in the main process. S3FileLoader has higher queries per second (QPS). It is 90% higher than fsspec at batch size 512.

Batch Sizes 1

The following chart reports the results for num_workers=4, the data loaders runs in the main process. S3FileLoader is 59.8% higher than fsspec at batch size 512.

Batch Sizes 2

Training ResNet50 Model against Boto3

For the following chart, we trained a ResNet50 model on a cluster of 4 p3.16xlarge instances with a total 32 GPUs. The training dataset is ImageNet with 1.2 million images organized into 1,000-image shards. The training batch size is 64. The training time is measured in seconds. For eight epochs, S3FileLoader is 7.5% faster than Boto3.

Boto3

Training a Bert model against Boto3

For the following cart, we trained a Bert model on a cluster of 4 p3.16xlarge instances with a total 32 GPUs. The training corpus has 1474 files. Each file has around 150,000 samples. To run a shorter epoch, we use 0.05% (approximately 75 samples) per file. The batch size is 2,048. The training time is measured in seconds. For one epoch, S3FileLoader is 7% faster than Boto3.

Boto3 2

Comparison against the original PyTorch S3 plugin

The new PyTorch S3 DataPipes perform substantially better than the original PyTorch S3 plugin. We have tuned the internal buffer size for S3FileLoader. The loading time is measured in seconds.

For the 10 sharded charades files (approximately 1.5 GiB each), S3FileLoader was 3.5 times faster in our experiments.

Best practices

Training large deep learning models may require a massive compute cluster with tens or even hundreds of nodes. Each node in the cluster may generate a large number of data loading requests that hit a specific S3 shard. To avoid throttle, we recommend sharding training data across S3 buckets and S3 folders.

Best Practices

To achieve good performance, it helps to have file sizes that are big enough to parallelize across a given file, but not so big that we hit the limits of throughput on that object on Amazon S3 depending on the training job. The optimal size can be between 50–200 MB.

Conclusion and next steps

In this post, we introduced you to the new PyTorch IO DataPipes. The new DataPipes use aws-sdk-cpp and show better performance against Boto3-based data loaders.

For next steps, we plan to improve on usability, performance, and functionality by focusing on the following features:

  • S3 authorization with IAM roles – Currently, the S3 DataPipes support explicit access credentials, instance profiles, and S3 bucket policies. However, there are use cases where IAM roles are preferred.
  • Double buffering – We plan to offer double buffering to support multi-worker downloading.
  • Local caching – We plan on making model training able to traverse the training dataset for multiple passes. Local caching after the first epoch can cut out time of flight delays from Amazon S3, which can substantially accelerate data retrieval time for subsequent epochs.
  • Customizable configuration – We plan to expose more parameters such as internal buffer size, multi-part chunk size, and executor count and allow users to further tune data loading efficiency.
  • Amazon S3 upload – We plan to expand the S3 DataPipes to support upload for checkpointing.
  • Merge with fsspecfsspec is used in other systems such as torch.save(). We can integrate the new S3 DataPipes with fsspec so they can have more use cases.

Acknowledgement

We would like to thank Vijay Rajakumar and Kiuk Chung from Amazon for providing their guidance for S3 Common RunTime and PyTorch DataLoader. We also want to thank Erjia Guan, Kevin Tse, Vitaly Fedyunin , Mark Saroufim, Hamid Shojanazeri, Matthias Reso, and Geeta Chauhan from Meta AI/ML, and Joe Evans from AWS for reviewing the blog and the GitHub PRs.

References

Read More

How to Accelerate PyTorch Geometric on Intel® CPUs

How to Accelerate PyTorch Geometric on Intel® CPUs

Overview

The Intel PyTorch team has been collaborating with the PyTorch Geometric (PyG) community to provide CPU performance optimizations for Graph Neural Network (GNN) and PyG workloads. In the PyTorch 2.0 release, several critical optimizations were introduced to improve GNN training and inference performance on CPU. Developers and researchers can now take advantage of Intel’s AI/ML Framework optimizations for significantly faster model training and inference, which unlocks the ability for GNN workflows directly using PyG.

In this blog, we will perform a deep dive on how to optimize PyG performance for both training and inference while using the PyTorch 2.0 flagship torch.compile feature to speed up PyG models.

Message Passing Paradigm

Message passing refers to the process of nodes exchanging information with their respective neighbors by sending messages to one another. In PyG, the process of message passing can be generalized into three steps:

  1. Gather: Collect edge-level information of adjacent nodes and edges.
  2. Apply: Update the collected information with user-defined functions (UDFs).
  3. Scatter: Aggregate to node-level information, e.g., via a particular reduce function such as sum, mean, or max.

Figure 1: The message passing paradigm

Figure 1: The message passing paradigm (Source: Matthias Fey)

Message passing performance is highly related to the storage format of the adjacency matrix of the graph, which records how pairs of nodes are connected. Two methods for the storage format are:

  • Adjacency matrix in COO (Coordinate Format): The graph data is physically stored in a two-dimensional tensor shape of [2, num_edges], which maps each connection of source and destination nodes. The performance hotspot is scatter-reduce.
  • Adjacency matrix in CSR (Compressed Sparse Row): Similar format to COO, but compressed on the row indices. This format allows for more efficient row access and faster sparse matrix-matrix multiplication (SpMM). The performance hotspot is sparse matrix related reduction ops.

Scatter-Reduce

The pattern of scatter-reduce is parallel in nature, which updates values of a self tensor using values from a src tensor at the entries specified by index. Ideally, parallelizing on the outer dimension would be most performant. However, direct parallelization leads to write conflicts, as different threads might try to update the same entry simultaneously.

Figure 2: Scatter-reduce and its optimization scheme

Figure 2: Scatter-reduce and its optimization scheme (Source: Mingfei Ma)

To optimize this kernel, we use sorting followed by a reduction:

  • Sorting: Sort the index tensor in ascending order with parallel radix sort, such that indices pointing to the same entry in the self tensor are managed in the same thread.
  • Reduction: Paralleled on the outer dimension of self, and do vectorized reduction for each indexed src entry.

For its backward path during the training process (i.e., gather), sorting is not needed because its memory access pattern will not lead to any write conflicts.

SpMM-Reduce

Sparse matrix-matrix reduction is a fundamental operator in GNNs, where A is sparse adjacency matrix in CSR format and B is a dense feature matrix where the reduction type could be sum, mean or max.

Figure 3: SpMM optimization scheme

Figure 3: SpMM optimization scheme (Source: Mingfei Ma)

The biggest challenge when optimizing this kernel is how to balance thread payload when parallelizing along rows of the sparse matrix A. Each row in A corresponds to a node, and its number of connections may vary vastly from one to another; this results in thread payload imbalance. One technique to address such issues is to do payload scanning before thread partition. Aside from that, other techniques are also introduced to further exploit CPU performance such as vectorization and unrolling and blocking.

These optimizations are done via torch.sparse.mm using the reduce flags of amax, amin, mean, sum.

Performance Gains: Up to 4.1x Speedup

We collected benchmark performance for both inference and training in pytorch_geometric/benchmark and in the Open Graph Benchmark (OGB) to demonstrate the performance improvement from the above-mentioned methods on Intel® Xeon® Platinum 8380 Processor.

Model – Dataset Option Speedup ratio
GCN-Reddit (inference) 512-2-64-dense 1.22x
1024-3-128-dense 1.25x
512-2-64-sparse 1.31x
1024-3-128-sparse 1.68x
GraphSage-ogbn-products (inference) 1024-3-128-dense 1.15x
512-2-64-sparse 1.20x
1024-3-128-sparse 1.33x
full-batch-sparse 4.07x
GCN-PROTEINS (training) 3-32 1.67x
GCN-REDDIT-BINARY (training) 3-32 1.67x
GCN-Reddit (training) 512-2-64-dense 1.20x
1024-3-128-dense 1.12x

Table 1: Performance Speedup on PyG Benchmark1

From the benchmark results, we can see that our optimizations in PyTorch and PyG achieved 1.1x-4.1x speed-up for inference and training.

torch.compile for PyG

The PyTorch2.0 flagship feature torch.compile is fully compatible with PyG 2.3 release, bringing additional speed-up in PyG model inference/training over imperative mode, thanks to TorchInductor C++/OpenMP backend for CPUs. In particular, a 3.0x – 5.4x performance speed-up is measured on basic GNN models with Intel Xeon Platinum 8380 Processor on model training2.

Figure 4: Performance Speedup with Torch Compile

Figure 4: Performance Speedup with Torch Compile

Torch.compile can fuse the multiple stages of message passing into a single kernel, which provides significant speedup due to the saved memory bandwidth. Refer to this pytorch geometric tutorial for additional support.

Please note that torch.compile within PyG is in beta mode and under active development. Currently, some features do not yet work together seamlessly such as torch.compile(model, dynamic=True), but fixes are on the way from Intel.

Conclusion & Future Work

In this blog, we introduced the GNN performance optimizations included in PyTorch 2.0 on CPU. We are closely collaborating with the PyG community for future optimization work, which will focus on in-depth optimizations from torch.compile, sparse optimization, and distributed training.

Acknowledgement

The results presented in this blog is a joint effort of Intel PyTorch team and Kumo. Special thanks to Matthias Fey (Kumo), Pearu Peterson (Quansight) and Christian Puhrsch (Meta) who spent precious time and gave substantial assistance! Together, we made one more step forward on the path of improving the PyTorch CPU ecosystem.

References

Footnotes

Product and Performance Information

1Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 256GB (16 slots/ 16GB/3200) total DDR4 memory, uCode 0xd000389, HT on, Turbo on, Ubuntu 20.04.5 LTS, 5.4.0-146-generic, INTEL SSDPE2KE016T8 1.5T; GCN + Reddit FP32 inference, GCN+Reddit FP32 training, GraphSAGE + ogbn-products FP32 inference, GCN-PROTAIN, GCN-REDDIT-BINARY FP32 training; Software: PyTorch 2.1.0.dev20230302+cpu, pytorch_geometric 2.3.0, torch-scatter 2.1.0, torch-sparse 0.6.16, test by Intel on 3/02/2023.

2Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 256GB (16 slots/ 16GB/3200) total DDR4 memory, uCode 0xd000389, HT on, Turbo on, Ubuntu 20.04.5 LTS, 5.4.0-146-generic, INTEL SSDPE2KE016T8 1.5T; GCN, GraphSAGE, GIN and EdgeCNN, FP32; Software: PyTorch 2.1.0.dev20230411+cpu, pytorch_geometric 2.4.0, torch-scatter 2.1.1+pt20cpu, torch-sparse 0.6.17+pt20cpu, test by Intel on 4/11/2023.

3Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Read More

Optimizing LibTorch-based inference engine memory usage and thread-pooling

Optimizing LibTorch-based inference engine memory usage and thread-pooling

Outline

In this blog post we show how to optimize LibTorch-based inference engine to maximize throughput by reducing memory usage and optimizing the thread-pooling strategy. We apply these optimizations to Pattern Recognition engines for audio data, for example, music and speech recognition or acoustic fingerprinting. The optimizations discussed in this blog post allow for memory usage reduction by 50% and reduction in end-to-end latency for Inference by 37.5%. These optimizations are applicable to computer vision and natural language processing.

Audio Recognition Inferencing

Audio Recognition (AR) engines can be used to recognize and identify sound patterns. As an example, identifying the type and species of a bird from audio recordings, distinguishing music from the singer’s voice, or detecting an abnormal sound indicating a breach in a building. To identify sounds of interest, AR engines process audio through 4 stages:

  1. File Validation: The AR engine validates the input audio file.
  2. Feature Extraction: Features are extracted from each segment within the audio file.
  3. Inference: LibTorch performs inference using CPUs or accelerators. In our case Intel processors on an Elastic Cloud Compute (EC2) instance.
  4. Post-processing: A post-processing model decodes the results and calculates scores that are used to convert inference output into tags or transcripts.

Of these 4 steps, inference is the most computationally intensive and can take up to 50% of the pipeline processing time depending on the model complexity. This means that any optimization at this stage has a significant impact on the overall pipeline. 

Optimizing the Audio Recognition engine with concurrency…is not so simple

Our objective for this processing pipeline is to extract audio segments into tags or transcripts through a processing. The input data is an audio file composed of several short sound segments (S1 to S6 in Figure 1). The output data corresponds to tags or transcripts ordered by timestamps.

Figure 1: Example audio file with segment boundaries

Figure 1: Example audio file with segment boundaries

Each segment can be processed independently and in an out-of-order fashion. This offers the opportunity to process segments concurrently and in parallel to optimize the overall inference throughput as well as maximize the usage of the resources.

Parallelization on an instance can be achieved through multi-threading (pThreads, std::threads, OpenMP) or multi-processing. The advantage of multi-threading over multi-processing is the ability to use shared memory. It enables developers to minimize data duplication across threads by sharing data across threads; the AR models in our case (Figure 2). Furthermore, a reduction in memory allows us to run more pipelines in parallel by increasing the number of engine threads in order to utilize all vCPUs on our Amazon EC2 instance (c5.4xlarge in our case, it offers 16 vCPUs). In theory, we expect to see higher hardware utilization and higher throughput for our AR engine as a result.

Figure 2: Multi-threaded AR Engine

Figure 2: Multi-threaded AR Engine

But we found these assumptions to be wrong. Indeed, we found that increasing the number of threads of the application led to an increase of the end-to-end latency for each audio segment and to a decrease of the engine throughput. For example, increasing the concurrency from 1 to 5 threads led to an increase of the latency by 4x which had a proportional effect on decreasing the throughput. In fact, metrics showed that within the pipeline, the latency of the inference stage alone was 3x higher than it’s single thread baseline. 

Using a profiler, we found that the CPU Spin Time increased, potentially due to CPU oversubscription which impacts system and application performance. Given our control over the application’s multi-thread implementation, we chose to dive deeper into the stack and identify potential conflicts with LibTorch’s default settings.

Diving deeper on LibTorch’s multi-threading and its impact on concurrency

LibTorch’s parallel implementations on CPU for inference are based on  global thread pools. Examples of implementations are Inter-op and intra-op parallelism, which can be chosen depending on the model’s properties. In both cases, it is possible to set the number of threads in each thread-poll to optimize the latency and throughput. 

To test if LibTorch’s parallel default implementation settings had a counter effect on our inference latency, we ran an experiment on a 16 vCPus machine with a 35-minute audio file, keeping the LibTorch inter-threads constant at 1 (because our models didn’t utilize the inter-op thread pool). We collected the following data as shown in Figure 3 and 4. 

Figure 3: CPU Utilization for different number of engine threads

Figure 3: CPU Utilization for different number of engine threads

Figure 4: Processing times for different number of engine threads

Figure 4: Processing times for different number of engine threads

Execution time in Figure 4 is the end-to-end processing time for processing all the segments of the given audio file. We have 4 different configurations of LibTorch intra-threads which are 1, 4, 8, 16 and we change the number of engine threads from 1 to 16 for each intra-thread LibTorch configuration. As we see in Figure 3, CPU utilization increases with an increase in the number of engine threads for all LibTorch intra-thread configurations. But as we see in Figure 4, an increase in CPU utilization doesn’t translate into lower execution time. We found out that in all but one case, as the number of engine threads shot up, so did execution time. The one exception was the case where the intra-thread pool size was 1.

Resolving the global thread pool issue

Using too many threads with a global thread pool led to performance degradation and caused an over-subscription problem. Without disabling LibTorch global thread pools, it was difficult to match the performance of the multi-process engine.

Disabling the LibTorch global thread pool is as simple as setting the intra-op/inter-op parallelism threads to 1, as shown here:

at::set_num_threads(1)           // Disables the intraop thread pool.
at::set_num_interop_threads(1). // Disables the interop thread pool.

As shown in Figure 4, the lowest processing time was measured when the LibTorch global thread pool was disabled.

This solution improved AR engine throughput in several cases. However, when evaluating long datasets (audio files longer than 2 hours in load test), we found that the memory footprint of the engine gradually started to increase.

Optimizing memory usage

We ran a load-test on the system with two hours long audio files and found out that the observed memory increase was the result of memory fragmentation within a multi-threaded LibTorch inference. We resolved this using jemalloc, which is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support. Using jemalloc, our peak memory usage decreased by an average of 34% and average memory usage decreased by 53%.

Figure 5: Memory usage over time using the same input file with and without jemalloc

Figure 5: Memory usage over time using the same input file with and without jemalloc

Summary

To optimize the performance of multi-threaded LibTorch-based inference engines, we recommend verifying that there is no oversubscription problem in LibTorch. In our case, all threads in the multi-threaded engine were sharing the LibTorch global thread pool, which caused an oversubscription problem. This was remedied by disabling the global thread pool: we disabled the interop and intraop global thread pool by setting threads to 1. To optimize the memory of a multi-threaded engine, we recommend using Jemalloc as a memory allocator tool rather than the default malloc function.

Read More