INT8 Quantization for x86 CPU in PyTorch

INT8 Quantization for x86 CPU in PyTorch

Overview

INT8 quantization is a powerful technique for speeding up deep learning inference on x86 CPU platforms. By reducing the precision of the model’s weights and activations from 32-bit floating-point (FP32) to 8-bit integer (INT8), INT8 quantization can significantly improve the inference speed and reduce memory requirements without sacrificing accuracy.

In this blog, we will discuss the recent progress on INT8 quantization for x86 CPU in PyTorch, focusing on the new x86 quantization backend. We will also briefly look at the new quantization path with PyTorch 2.0 Export (PT2E) and TorchInductor.

X86 Quantization Backend

The current recommended way of quantization in PyTorch is FX. Before PyTorch 2.0, the default quantization backend (a.k.a. QEngine) on x86 CPUs was FBGEMM, which leveraged the FBGEMM performance library to achieve the performance speedup. In the PyTorch 2.0 release, a new quantization backend called X86 was introduced to replace FBGEMM. The x86 quantization backend offers improved INT8 inference performance when compared to the original FBGEMM backend by leveraging the strengths of both FBGEMM and the Intel® oneAPI Deep Neural Network Library (oneDNN) kernel libraries.

Performance Benefit from X86 Backend

To measure the performance benefits of the new X86 backend, we ran INT8 inference on 69 popular deep learning models (shown in Figures 1-3 below) using 4th Gen Intel® Xeon® Scalable processors. The results showed a 2.97X geomean performance speedup compared to FP32 inference performance, while the speedup was 1.43X with the FBGEMM backend. The charts below show the per-model performance speedup comparing the x86 backend and the FBGEMM backend.

Figure 1: Models with less than 2x performance boost with x86 backend1

Figure 1: Models with less than 2x performance boost with x86 backend1

Figure 2: Models with 2x-4x performance boost with x86 backend1

Figure 2: Models with 2x-4x performance boost with x86 backend1

Figure 3: Models with larger than 4x performance boost with x86 backend1

Figure 3: Models with larger than 4x performance boost with x86 backend1

Usage of x86 Backend

By default in 2.0, users on x86 platforms will use the x86 quantization backend and their PyTorch programs will remain unchanged when using the default backend. Alternatively, users can specify x86 as the quantization backend explicitly.
Below is an example code snippet of PyTorch static post-training quantization with x86 quantization backend.

import torch
from torch.ao.quantization import get_default_qconfig_mapping
from torch.quantization.quantize_fx import prepare_fx, convert_fx

qconfig_mapping = get_default_qconfig_mapping()
# Or explicity specify the qengine
# qengine = 'x86'
# torch.backends.quantized.engine = qengine
# qconfig_mapping = get_default_qconfig_mapping(qengine)

model_fp32 = MyModel().eval()
x = torch.randn((1, 3, 224, 224), dtype=torch.float)
x = x.to(memory_format=torch.channels_last)

# Insert observers according to qconfig and backend config
prepared_model = prepare_fx(model_fp32, qconfig_mapping, example_inputs=x)

# Calibration code not shown

# Convert to quantized model
quantized_model = convert_fx(prepared_model)

Technical Details of x86 Backend

We devised heuristic dispatching rules according to the performance numbers from the models we benchmarked to decide whether to invoke oneDNN or FBGEMM performance library to execute the convolution or matrix multiplication operations. The rules are a combination of operation kinds, shapes, CPU architecture information, etc. Detailed logic is available here. For more design and technical discussion, please refer to the Request for Comments.

Next Steps With a New Quantization Path PyTorch 2.0 Export

Although still far from finalized, a new quantization path, PyTorch 2.0 Export (PT2E), is in early design and PoC stage. The new approach is slated to replace the FX quantization path in the future. It is built upon the capabilities of TorchDynamo Export, a feature introduced in the PyTorch 2.0 release for FX graph capturing. This graph is then quantized and lowered to different backends. TorchInductor, the new DL compiler of PyTorch, has shown promising results in terms of FP32 inference speedup on x86 CPU. We are working actively to enable it as one of the quantization backends of PT2E. We believe the new path will lead to further improvements in INT8 inference performance due to more flexibility of fusion at different levels.

Conclusion

The x86 backend introduced in PyTorch 2.0 release has demonstrated a remarkable improvement in INT8 inference speed on x86 CPU platforms. It offers a 1.43X speedup compared to the original FBGEMM backend while maintaining backward compatibility. This enhancement can benefit end users with minimal or no modifications to their programs. Furthermore, a new quantization path, PT2E, is currently in development and is expected to provide even more possibilities in the future.

Acknowledgement

Special thanks to Nikita Shulga, Vasiliy Kuznetsov, Supriya Rao, and Jongsoo Park. Together, we made one more step forward on the path of improving the PyTorch CPU ecosystem.

Configuration

1 AWS EC2 r7iz.metal-16xl instance (Intel(R) Xeon(R) Gold 6455B, 32-core/64-thread, Turbo Boost On, Hyper-Threading On, Memory: 8x64GB, Storage: 192GB); OS: Ubuntu 22.04.1 LTS; Kernel: 5.15.0-1028-aws; Batch Size: 1; Core per Instance: 4; PyTorch 2.0 RC3; TorchVision 0.15.0+cpu, test by Intel on 3/77/2023. May not reflect all publicly available security updates.

Read More

Hugging Face Joins the PyTorch Foundation as a Premier Member

Hugging Face Joins the PyTorch Foundation as a Premier Member

Smiling hugging face

The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Hugging Face has joined as a premier member.

Hugging Face has been a long time supporter and contributor to the PyTorch Ecosystem by providing powerful models and resources that accelerate research, development, and adoption of AI technologies, particularly in the field of natural language processing.

“Our mission has always been to democratize AI and make it accessible to everyone. We’re truly aligned with PyTorch’s objective of reducing the barrier of entry to practitioners. By joining the PyTorch Foundation, we can further amplify that impact and support this very important framework of the ecosystem that is PyTorch,” said Lysandre Debut, Head of Open Source at Hugging Face. “We believe the two ecosystems have significant overlap, and collaborating with the foundation will allow us to bridge the gap to provide the best software, the best tools to the machine learning community at large.”

Hugging Face’s Model Hub and open source libraries promote collaboration and knowledge sharing within the AI open source community, making Hugging Face a great match to the growing PyTorch Foundation. They continue to drive industry adoption and collaboration by creating user-friendly tools and resources and providing accessible and well-documented libraries.

“Hugging Face’s commitment to open source development and their exceptional contributions to the PyTorch ecosystem have truly impressed us. With their help, we will drive innovation, foster collaboration, and empower the global AI community to create transformative solutions for the AI community,” said PyTorch Foundation Executive Director Ibrahim Haddad. “We welcome Hugging Face to the PyTorch Foundation and look forward to the achievements that lie ahead.”

As a premier member, Hugging Face is granted one seat to the PyTorch Foundation Governing Board. The Board sets policy through our bylaws, mission and vision statements, describing the overarching scope of foundation initiatives, technical vision, and direction.

Lysandre Debut

We’re happy to welcome Lysandre Debut, Head of Open Source at Hugging Face to our board. Lysandre has been at Hugging Face since the company’s pivot to open-source, and was the first engineer to focus entirely on the open-source mission. Now leading the open-source part of the organization, Lysandre remains technically involved by being a core maintainer of the Transformers library.

To learn more about how you can be a part of the PyTorch Foundation, visit our website.

About Hugging Face

Hugging Face is a community and company dedicated to lowering the barrier of entry to Machine Learning and Deep Learning. Strong advocates for open-source and open-science, their model Hub hosts more than 250,000 public models and 50,000 public datasets that are very simple to use. Transformers, Diffusers, PEFT, Accelerate, and Datasets are some of the open-source tools made available by Hugging Face.

About PyTorch Foundation

The PyTorch Foundation is a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem. The PyTorch Foundation is supported by its members and leading contributors to the PyTorch open source project. The Foundation leverages resources provided by members and contributors to enable community discussions and collaboration.

About The Linux Foundation

The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects are critical to the world’s infrastructure including Linux, Kubernetes, Node.js, ONAP, PyTorch, RISC-V, SPDX, OpenChain, and more. The Linux Foundation focuses on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see its trademark usage page: www.linuxfoundation.org/trademark-usage. Linux is a registered trademark of Linus Torvalds.

Read More

AMD’s Journey to Openness and Performance

AMD has gained progress in building a robust software stack that supports an open ecosystem of models, libraries, frameworks, and tools. With proven platforms gaining momentum, there is significance of a leadership software stack and an optimized ecosystem for achieving application performance. PyTorch is a key part of AMD’s AI journey, and AMD’s Victor Peng, AMD President and Soumith Chintala, founder of PyTorch discussed the latest progress at the DC & AI Keynote on June 12.

Building a Powerful SW Stack with ROCm

Victor introduced ROCm, AMD’s SW stack for Instinct Data Center GPUs. It offers a comprehensive set of open-source libraries, runtime, compilers, and tools for developing, running, and fine-tuning AI models. The fifth generation ROCm incorporates optimizations for AI and high-performance computing workloads, including tailored kernels for low-latency memory systems, support for new data types, and integration with OpenAI Triton. With tools for porting AI software to AMD Instinct platforms, ROCm ensures quality and robustness, tested extensively and compliant with PyTorch and TensorFlow frameworks.

Collaboration with PyTorch

To shed light on the partnership between AMD and PyTorch, Victor invited Soumith Chintala, the founder of PyTorch, to discuss the advancements and integration between the two. PyTorch, the industry’s most famous AI framework, boasts a vibrant developer community and is known for its continuous innovation and incorporation of cutting-edge research.

To highlight the AMD and PyTorch partnership, Victor hosted a discussion with Soumith Chintala, the founder of PyTorch. PyTorch, renowned for its innovation and community, is the industry’s leading AI framework. The latest version, PyTorch 2.0, integrates with hardware-agnostic software compilers like OpenAI Triton, enabling efficient training and deployment of AI models. With optimized techniques, PyTorch 2.0 enhances productivity and offers remarkable speed improvements. The collaboration between AMD and the PyTorch Foundation ensures seamless utilization of AMD GPUs, expanding AI accelerator accessibility worldwide and paving the way for future optimizations and broader hardware support.

Empowering the Developer Community

The partnership between AMD and PyTorch benefits the developer community by democratizing access to AI accelerators. Support for AMD GPUs in PyTorch allows developers to train and deploy models across various platforms, including CPUs like EPYC and Ryzen, GPUs like Instinct and Radeon, and embedded devices like Versal SoCs. By ensuring immediate compatibility of new models on AMD platforms, the collaboration streamlines the development process and empowers developers to leverage the full potential of AMD’s hardware. This increased accessibility and flexibility enable developers worldwide to push the boundaries of AI innovation.

Hugging Face and AI Model Innovation

Victor praised Hugging Face as the leading force behind open-source AI model innovation, empowering generative AI with transformative transformers. AMD’s optimized software enables a high-performing development stack, supporting groundbreaking AI advancements for customers and developers through scalable real-world deployments.

Conclusion

At the DC & AI Keynote, AMD demonstrated its dedication to openness, performance, and collaboration. The ROCm SW stack, PyTorch integration, and support for Hugging Face exemplify AMD’s commitment to empowering developers and researchers to achieve AI breakthroughs. By offering accessible, high-performing solutions, AMD fuels the future of AI as a leading GPU platform integrated with PyTorch.

To listen to the full keynote visit the AMD Youtube channel

To listen to Soumith Chintala’s section of the keynote

Read More

Performant Distributed checkpointing in Production with IBM

Performant Distributed checkpointing in Production with IBM

Params saved per minute

Last year, IBM Research began collaborating with us to onboard Fully Sharded Data Parallelism (FSDP) for their large foundation models. They became interested as FSDP is a PyTorch native offering for scaling their distributed training efforts on IBM Cloud.

We are pleased to share that, in collaboration with IBM, we have achieved substantial checkpointing speedups for large models (72x vs the original PyTorch 1.13 save speed), proven model and optimizer checkpoint scaling to 30B parameters, and enabled cloud first training using FSDP + Distributed Checkpoint on S3 backends.

What is a Distributed Checkpoint?

Distributed checkpointing is the PyTorch native solution for saving and loading PyTorch models and optimizer states from multiple ranks, as well as supporting dynamically changing world sizes between reloads.

Checkpoint time vs model params

PyTorch Distributed Checkpoint (DCP) APIs were introduced in PyTorch 1.13, and are included as an official prototype feature in PyTorch 2.0.

Distributed checkpoint is different from torch.save() and torch.load() in a few significant ways:

  1. DCP produces multiples files per checkpoint, with at least one file per rank,
  2. DCP operates in place, meaning that the model should allocate its data first and the Distributed Checkpoint will then use the storage.

A major improvement from 1.13 to 2.0 includes adding sharded_state_dict support for checkpointing FSDP models. This allows checkpointing for larger sized models, as well as adding support for load-time resharding. Load time resharding enables saving in one cluster topology, and loading into another. This feature was highly requested as it allows training jobs to be run on one cluster, saved, and then continued on a different cluster with different world size.

Another major change is that we decouple the storage layer from the checkpoint planning layer and separate implementation from the interface for both layers. With this change, users can now specify how their state_dict should be chunked or transformed during the checkpoint planning phase. Additionally, the customizable storage layer can easily accommodate different backends.

More information on the Distributed Checkpoint package can be found here.

Performant Distributed checkpointing in Production with IBM

IBM at Think 2023 announced its watsonx.ai platform for development and deployment of foundation models for the enterprise. Built on Hybrid Cloud, the platform enables use cases across multiple modalities such as NLP, timeseries, weather, chemistry, tabular data, and cybersecurity, with model sizes from 100s of millions to 10s of billions of parameters. Model architectures range from vision transformers, to multi-modal RoBERTa-style feature extractors, to large-scale generative language models similar to T5, GPT and Llama.

As of today, IBM has now enabled checkpointing for T5-style architectures up to 11B parameters, and decoder architectures (GPT style) up to 30B.

IBM helped us identify that this limits the scaling power of DCP from both memory and performance standpoints. With their suggestion, we enhanced our FileSystemWriter to produce single checkpoint per rank to reduce read write overhead.

With this option as the new default, DCP now creates a single file per rank during checkpoint saving, which would then be sliced when reading parameters at load time.

By combining sharded_state_dict support with single filer per rank writer, distributed checkpoint was able to accelerate checkpoint saving time over 72x vs the original PyTorch 1.13 save speed, and enable rapid checkpointing for models sizes over 15B which would previously simply time out.

“Looking back, it’s really astounding the speedups we’ve seen, handling training for many of these models. We went from taking almost half an hour to write a single 11B checkpoint in PyTorch 1.13, to being able to handle a 30B parameter model, with optimizer and dataloader state – so that’s over eight times the raw data – in just over 3 minutes. That’s done wonders for both the stability and efficiency of our jobs, as we scale up training to hundreds of gpus.” – Davis Wertheimer, IBM Research

IBM’s adoption has also helped us validate and improve our solutions in a real world, large-scale training environment. As an example, IBM discovered that DCP was working well for them on a single node with multiple GPUs, but erred out when used on multiple nodes.

Upon investigating the issue, we realized that we were assuming writing to a NFS-like shared file system, which assumes strong read-after-write consistencies. Object stores with file system APIs such as S3FS provide eventual consistency semantics, thus causing the distributed checkpoint in such a setting to fail. Working together with IBM, we identified this issue and fixed it by making one line code change and enabled object storage backend for DCP! Such storage approaches are typically an order of magnitude cheaper than shared file systems thus enabling finer grained checkpointing.

Looking for Collaboration

If you are interested in trying Distributed Checkpoint, feel free to reach out to us!

If you run into any issue when trying it, you can open an issue at our Github repo.

Acknowledgements

This project would not have been possible without the assistance from many collaborators. We would like to thank Yanli Zhao, Andrew Gu, Rohan Varma for their support of FSDP. Thanks to Pritam Damania, Junjie Zhao, and Wanchao Liang for their support of ShardedTensor.

Read More

IBM Joins the PyTorch Foundation as a Premier Member

IBM Joins the PyTorch Foundation as a Premier Member

The PyTorch Foundation, part of The Linux Foundation, is pleased to announce that IBM has joined as a premier member.

IBM Logo

The foundation serves as a neutral space for the deep learning community to collaborate on the open source PyTorch framework and ecosystem. With its extensive industry expertise and leadership in open source and AI, IBM is committed to actively contributing to the PyTorch community.

IBM offers a comprehensive portfolio of enterprise AI solutions and recently released watsonx, its next-generation data and AI platform. IBM’s watsonx platform leverages PyTorch to offer an enterprise-grade software stack for end-to-end training and fine-tuning of AI foundation models.

“By joining the PyTorch Foundation, we aim to contribute our expertise and resources to further advance PyTorch’s capabilities and make AI more accessible in hybrid cloud environments with flexible hardware options,” said Priya Nagpurkar, Vice President, Hybrid Cloud Platform and Developer Productivity, IBM Research. “We intend for our collaboration with PyTorch to bring the power of foundation models and generative AI to enterprises using the watsonx platform to drive business transformation.”

IBM and PyTorch have already collaborated on two projects. The first enables foundation models with billions of parameters to train efficiently on standard cloud networking infrastructure, such as Ethernet networking. Together, IBM and PyTorch have also worked on ways to make checkpointing for AI training considerably more cost-effective, by fixing the distributed checkpointing within PyTorch to support certain types of object storage.

“We’re happy to welcome IBM as a premier member. IBM’s expertise and dedication to advancing the field of artificial intelligence align perfectly with the mission of the PyTorch community,” said PyTorch Foundation Executive Director Ibrahim Haddad. “Their commitment to open collaboration and innovation will strengthen our collective efforts to empower developers and researchers worldwide.”

As a premier member, IBM is granted one seat to the PyTorch Foundation Governing Board. The Board sets policy through our bylaws, mission and vision statements, describing the overarching scope of foundation initiatives, technical vision, and direction.

Raghu Ganti Headshot

We’re happy to welcome Raghu Ganti, Principal Research Scientist at IBM Research, to our board. Raghu co-leads IBM Research’s foundation model training and validation platform, built on Red Hat OpenShift. His team primarily contributes to the PyTorch training components, with the mission of democratizing training and validation of foundation models.

To learn more about how you can be a part of the PyTorch Foundation, visit our website.

Read More

Announcing CPP-based S3 IO DataPipes

Announcing CPP-based S3 IO DataPipes

Training large deep learning models requires large datasets. Amazon Simple Storage Service (Amazon S3) is a scalable cloud object store service used for storing large training datasets. Machine learning (ML) practitioners need an efficient data pipe that can download data from Amazon S3, transform the data, and feed the data to GPUs for training models with high throughput and low latency.

In this post, we introduce the new S3 IO DataPipes for PyTorch, S3FileLister and S3FileLoader. For memory efficiency and fast runs, the new DataPipes use the C++ extension to access Amazon S3. Benchmarking shows that S3FileLoader is 59.8% faster than FSSpecFileOpener for downloading a natural language processing (NLP) dataset from Amazon S3. You can build IterDataPipe training pipelines with the new DataPipes. We also demonstrate that the new DataPipe can reduce overall Bert and ResNet50 training time by 7%. The new DataPipes have been upstreamed to the open-source TorchData 0.4.0 with PyTorch 1.12.0.

Overview

Amazon S3 is a scalable cloud storage service with no limit on data volume. Loading data from Amazon S3 and feeding the data to high-performance GPUs such as NVIDIA A100 can be challenging. It requires an efficient data pipeline that can meet the data processing speed of GPUs. To help with this, we released a new high performance tool for PyTorch: S3 IO DataPipes. DataPipes are subclassed from torchdata.datapipes.iter.IterDataPipe, so they can interact with the IterableDataPipe interface. Developers can quickly build their DataPipe DAGs to access, transform, and manipulate data with shuffle, sharding, and batch features.

The new DataPipes are designed to be file format agnostic and Amazon S3 data is downloaded as binary large objects (BLOBs). It can be used as a composable building block to assemble a DataPipe graph that can load tabular, NLP, and computer vision (CV) data into your training pipelines.

Under the hood, the new S3 IO DataPipes employ a C++ S3 handler with the AWS C++ SDK. In general, a C++ implementation is more memory efficient and has better CPU core usage (no Global Interpreter Lock) in threading compared to Python. The new C++ S3 IO DataPipes are recommended for high throughput, low latency data loading in training large deep learning models.

The new S3 IO DataPipes provide two first-class citizen APIs:

  • S3FileLister – Iterable that lists S3 file URLs within the given S3 prefixes. The functional name for this API is list_files_by_s3.
  • S3FileLoader – Iterable that loads S3 files from the given S3 prefixes. The functional name for this API is load_files_by_s3.

Usage

In this section, we provide instructions for using the new S3 IO DataPipes. We also provide a code snippet for load_files_by_s3().

Build from source

The new S3 IO DataPipes use the C++ extension. It is built into the torchdata package by default. However, if the new DataPipes are not available within the environment, for example Windows on Conda, you need to build from the source. For more information, refer to Iterable Datapipes.

Configuration

Amazon S3 supports global buckets. However, a bucket is created within a Region. You can pass a Region to the DataPipes by using __init__(). Alternatively, you can either export AWS_REGION=us-west-2 into your shell or set an environment variable with os.environ['AWS_REGION'] = 'us-east-1' in your code.

To read objects in a bucket that aren’t publicly accessible, you must provide AWS credentials through one of the following methods:

Example code

The following code snippet provides a typical usage of load_files_by_s3():

from torch.utils.data import DataLoader

from torchdata.datapipes.iter import IterableWrapper



s3_shard_urls = IterableWrapper(["s3://bucket/prefix/",])

s3_shards = s3_shard_urls.load_files_by_s3()

# text data

training_data = s3_shards.readlines(return_path=False)

data_loader = DataLoader(
      training_data,
      batch_size=batch_size,
      num_workers=num_workers,

)
# training loop

for epoch in range(epochs):
    
      # training step
    
      for bach_data in data_loader:
        
         # forward pass, backward pass, model update 


Benchmark

In this section, we demonstrate how the new DataPipe can reduce overall Bert and ResNet50 training time.

Isolated DataLoader performance evaluation against FSSpec

FSSpecFileOpener is another PyTorch S3 DataPipe. It uses botocore and aiohttp/asyncio to access S3 data. The following is the performance test setup and result (quoted from Performance Comparison between native AWSSDK and FSSpec (boto3) based DataPipes).

The S3 data in the test is a sharded text dataset. Each shard has about 100,000 lines and each line is around 1.6 KB, making each shard about 156 MB. The measurements in this benchmark are averaged over 1,000 batches. No shuffling, sampling, or transforms were performed.

The following chart reports the throughput comparison for various batch sizes for num_workers=0, the data loader runs in the main process. S3FileLoader has higher queries per second (QPS). It is 90% higher than fsspec at batch size 512.

Batch Sizes 1

The following chart reports the results for num_workers=4, the data loaders runs in the main process. S3FileLoader is 59.8% higher than fsspec at batch size 512.

Batch Sizes 2

Training ResNet50 Model against Boto3

For the following chart, we trained a ResNet50 model on a cluster of 4 p3.16xlarge instances with a total 32 GPUs. The training dataset is ImageNet with 1.2 million images organized into 1,000-image shards. The training batch size is 64. The training time is measured in seconds. For eight epochs, S3FileLoader is 7.5% faster than Boto3.

Boto3

Training a Bert model against Boto3

For the following cart, we trained a Bert model on a cluster of 4 p3.16xlarge instances with a total 32 GPUs. The training corpus has 1474 files. Each file has around 150,000 samples. To run a shorter epoch, we use 0.05% (approximately 75 samples) per file. The batch size is 2,048. The training time is measured in seconds. For one epoch, S3FileLoader is 7% faster than Boto3.

Boto3 2

Comparison against the original PyTorch S3 plugin

The new PyTorch S3 DataPipes perform substantially better than the original PyTorch S3 plugin. We have tuned the internal buffer size for S3FileLoader. The loading time is measured in seconds.

For the 10 sharded charades files (approximately 1.5 GiB each), S3FileLoader was 3.5 times faster in our experiments.

Best practices

Training large deep learning models may require a massive compute cluster with tens or even hundreds of nodes. Each node in the cluster may generate a large number of data loading requests that hit a specific S3 shard. To avoid throttle, we recommend sharding training data across S3 buckets and S3 folders.

Best Practices

To achieve good performance, it helps to have file sizes that are big enough to parallelize across a given file, but not so big that we hit the limits of throughput on that object on Amazon S3 depending on the training job. The optimal size can be between 50–200 MB.

Conclusion and next steps

In this post, we introduced you to the new PyTorch IO DataPipes. The new DataPipes use aws-sdk-cpp and show better performance against Boto3-based data loaders.

For next steps, we plan to improve on usability, performance, and functionality by focusing on the following features:

  • S3 authorization with IAM roles – Currently, the S3 DataPipes support explicit access credentials, instance profiles, and S3 bucket policies. However, there are use cases where IAM roles are preferred.
  • Double buffering – We plan to offer double buffering to support multi-worker downloading.
  • Local caching – We plan on making model training able to traverse the training dataset for multiple passes. Local caching after the first epoch can cut out time of flight delays from Amazon S3, which can substantially accelerate data retrieval time for subsequent epochs.
  • Customizable configuration – We plan to expose more parameters such as internal buffer size, multi-part chunk size, and executor count and allow users to further tune data loading efficiency.
  • Amazon S3 upload – We plan to expand the S3 DataPipes to support upload for checkpointing.
  • Merge with fsspecfsspec is used in other systems such as torch.save(). We can integrate the new S3 DataPipes with fsspec so they can have more use cases.

Acknowledgement

We would like to thank Vijay Rajakumar and Kiuk Chung from Amazon for providing their guidance for S3 Common RunTime and PyTorch DataLoader. We also want to thank Erjia Guan, Kevin Tse, Vitaly Fedyunin , Mark Saroufim, Hamid Shojanazeri, Matthias Reso, and Geeta Chauhan from Meta AI/ML, and Joe Evans from AWS for reviewing the blog and the GitHub PRs.

References

Read More

How to Accelerate PyTorch Geometric on Intel® CPUs

How to Accelerate PyTorch Geometric on Intel® CPUs

Overview

The Intel PyTorch team has been collaborating with the PyTorch Geometric (PyG) community to provide CPU performance optimizations for Graph Neural Network (GNN) and PyG workloads. In the PyTorch 2.0 release, several critical optimizations were introduced to improve GNN training and inference performance on CPU. Developers and researchers can now take advantage of Intel’s AI/ML Framework optimizations for significantly faster model training and inference, which unlocks the ability for GNN workflows directly using PyG.

In this blog, we will perform a deep dive on how to optimize PyG performance for both training and inference while using the PyTorch 2.0 flagship torch.compile feature to speed up PyG models.

Message Passing Paradigm

Message passing refers to the process of nodes exchanging information with their respective neighbors by sending messages to one another. In PyG, the process of message passing can be generalized into three steps:

  1. Gather: Collect edge-level information of adjacent nodes and edges.
  2. Apply: Update the collected information with user-defined functions (UDFs).
  3. Scatter: Aggregate to node-level information, e.g., via a particular reduce function such as sum, mean, or max.

Figure 1: The message passing paradigm

Figure 1: The message passing paradigm (Source: Matthias Fey)

Message passing performance is highly related to the storage format of the adjacency matrix of the graph, which records how pairs of nodes are connected. Two methods for the storage format are:

  • Adjacency matrix in COO (Coordinate Format): The graph data is physically stored in a two-dimensional tensor shape of [2, num_edges], which maps each connection of source and destination nodes. The performance hotspot is scatter-reduce.
  • Adjacency matrix in CSR (Compressed Sparse Row): Similar format to COO, but compressed on the row indices. This format allows for more efficient row access and faster sparse matrix-matrix multiplication (SpMM). The performance hotspot is sparse matrix related reduction ops.

Scatter-Reduce

The pattern of scatter-reduce is parallel in nature, which updates values of a self tensor using values from a src tensor at the entries specified by index. Ideally, parallelizing on the outer dimension would be most performant. However, direct parallelization leads to write conflicts, as different threads might try to update the same entry simultaneously.

Figure 2: Scatter-reduce and its optimization scheme

Figure 2: Scatter-reduce and its optimization scheme (Source: Mingfei Ma)

To optimize this kernel, we use sorting followed by a reduction:

  • Sorting: Sort the index tensor in ascending order with parallel radix sort, such that indices pointing to the same entry in the self tensor are managed in the same thread.
  • Reduction: Paralleled on the outer dimension of self, and do vectorized reduction for each indexed src entry.

For its backward path during the training process (i.e., gather), sorting is not needed because its memory access pattern will not lead to any write conflicts.

SpMM-Reduce

Sparse matrix-matrix reduction is a fundamental operator in GNNs, where A is sparse adjacency matrix in CSR format and B is a dense feature matrix where the reduction type could be sum, mean or max.

Figure 3: SpMM optimization scheme

Figure 3: SpMM optimization scheme (Source: Mingfei Ma)

The biggest challenge when optimizing this kernel is how to balance thread payload when parallelizing along rows of the sparse matrix A. Each row in A corresponds to a node, and its number of connections may vary vastly from one to another; this results in thread payload imbalance. One technique to address such issues is to do payload scanning before thread partition. Aside from that, other techniques are also introduced to further exploit CPU performance such as vectorization and unrolling and blocking.

These optimizations are done via torch.sparse.mm using the reduce flags of amax, amin, mean, sum.

Performance Gains: Up to 4.1x Speedup

We collected benchmark performance for both inference and training in pytorch_geometric/benchmark and in the Open Graph Benchmark (OGB) to demonstrate the performance improvement from the above-mentioned methods on Intel® Xeon® Platinum 8380 Processor.

Model – Dataset Option Speedup ratio
GCN-Reddit (inference) 512-2-64-dense 1.22x
1024-3-128-dense 1.25x
512-2-64-sparse 1.31x
1024-3-128-sparse 1.68x
GraphSage-ogbn-products (inference) 1024-3-128-dense 1.15x
512-2-64-sparse 1.20x
1024-3-128-sparse 1.33x
full-batch-sparse 4.07x
GCN-PROTEINS (training) 3-32 1.67x
GCN-REDDIT-BINARY (training) 3-32 1.67x
GCN-Reddit (training) 512-2-64-dense 1.20x
1024-3-128-dense 1.12x

Table 1: Performance Speedup on PyG Benchmark1

From the benchmark results, we can see that our optimizations in PyTorch and PyG achieved 1.1x-4.1x speed-up for inference and training.

torch.compile for PyG

The PyTorch2.0 flagship feature torch.compile is fully compatible with PyG 2.3 release, bringing additional speed-up in PyG model inference/training over imperative mode, thanks to TorchInductor C++/OpenMP backend for CPUs. In particular, a 3.0x – 5.4x performance speed-up is measured on basic GNN models with Intel Xeon Platinum 8380 Processor on model training2.

Figure 4: Performance Speedup with Torch Compile

Figure 4: Performance Speedup with Torch Compile

Torch.compile can fuse the multiple stages of message passing into a single kernel, which provides significant speedup due to the saved memory bandwidth. Refer to this pytorch geometric tutorial for additional support.

Please note that torch.compile within PyG is in beta mode and under active development. Currently, some features do not yet work together seamlessly such as torch.compile(model, dynamic=True), but fixes are on the way from Intel.

Conclusion & Future Work

In this blog, we introduced the GNN performance optimizations included in PyTorch 2.0 on CPU. We are closely collaborating with the PyG community for future optimization work, which will focus on in-depth optimizations from torch.compile, sparse optimization, and distributed training.

Acknowledgement

The results presented in this blog is a joint effort of Intel PyTorch team and Kumo. Special thanks to Matthias Fey (Kumo), Pearu Peterson (Quansight) and Christian Puhrsch (Meta) who spent precious time and gave substantial assistance! Together, we made one more step forward on the path of improving the PyTorch CPU ecosystem.

References

Footnotes

Product and Performance Information

1Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 256GB (16 slots/ 16GB/3200) total DDR4 memory, uCode 0xd000389, HT on, Turbo on, Ubuntu 20.04.5 LTS, 5.4.0-146-generic, INTEL SSDPE2KE016T8 1.5T; GCN + Reddit FP32 inference, GCN+Reddit FP32 training, GraphSAGE + ogbn-products FP32 inference, GCN-PROTAIN, GCN-REDDIT-BINARY FP32 training; Software: PyTorch 2.1.0.dev20230302+cpu, pytorch_geometric 2.3.0, torch-scatter 2.1.0, torch-sparse 0.6.16, test by Intel on 3/02/2023.

2Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 256GB (16 slots/ 16GB/3200) total DDR4 memory, uCode 0xd000389, HT on, Turbo on, Ubuntu 20.04.5 LTS, 5.4.0-146-generic, INTEL SSDPE2KE016T8 1.5T; GCN, GraphSAGE, GIN and EdgeCNN, FP32; Software: PyTorch 2.1.0.dev20230411+cpu, pytorch_geometric 2.4.0, torch-scatter 2.1.1+pt20cpu, torch-sparse 0.6.17+pt20cpu, test by Intel on 4/11/2023.

3Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Read More

Optimizing LibTorch-based inference engine memory usage and thread-pooling

Optimizing LibTorch-based inference engine memory usage and thread-pooling

Outline

In this blog post we show how to optimize LibTorch-based inference engine to maximize throughput by reducing memory usage and optimizing the thread-pooling strategy. We apply these optimizations to Pattern Recognition engines for audio data, for example, music and speech recognition or acoustic fingerprinting. The optimizations discussed in this blog post allow for memory usage reduction by 50% and reduction in end-to-end latency for Inference by 37.5%. These optimizations are applicable to computer vision and natural language processing.

Audio Recognition Inferencing

Audio Recognition (AR) engines can be used to recognize and identify sound patterns. As an example, identifying the type and species of a bird from audio recordings, distinguishing music from the singer’s voice, or detecting an abnormal sound indicating a breach in a building. To identify sounds of interest, AR engines process audio through 4 stages:

  1. File Validation: The AR engine validates the input audio file.
  2. Feature Extraction: Features are extracted from each segment within the audio file.
  3. Inference: LibTorch performs inference using CPUs or accelerators. In our case Intel processors on an Elastic Cloud Compute (EC2) instance.
  4. Post-processing: A post-processing model decodes the results and calculates scores that are used to convert inference output into tags or transcripts.

Of these 4 steps, inference is the most computationally intensive and can take up to 50% of the pipeline processing time depending on the model complexity. This means that any optimization at this stage has a significant impact on the overall pipeline. 

Optimizing the Audio Recognition engine with concurrency…is not so simple

Our objective for this processing pipeline is to extract audio segments into tags or transcripts through a processing. The input data is an audio file composed of several short sound segments (S1 to S6 in Figure 1). The output data corresponds to tags or transcripts ordered by timestamps.

Figure 1: Example audio file with segment boundaries

Figure 1: Example audio file with segment boundaries

Each segment can be processed independently and in an out-of-order fashion. This offers the opportunity to process segments concurrently and in parallel to optimize the overall inference throughput as well as maximize the usage of the resources.

Parallelization on an instance can be achieved through multi-threading (pThreads, std::threads, OpenMP) or multi-processing. The advantage of multi-threading over multi-processing is the ability to use shared memory. It enables developers to minimize data duplication across threads by sharing data across threads; the AR models in our case (Figure 2). Furthermore, a reduction in memory allows us to run more pipelines in parallel by increasing the number of engine threads in order to utilize all vCPUs on our Amazon EC2 instance (c5.4xlarge in our case, it offers 16 vCPUs). In theory, we expect to see higher hardware utilization and higher throughput for our AR engine as a result.

Figure 2: Multi-threaded AR Engine

Figure 2: Multi-threaded AR Engine

But we found these assumptions to be wrong. Indeed, we found that increasing the number of threads of the application led to an increase of the end-to-end latency for each audio segment and to a decrease of the engine throughput. For example, increasing the concurrency from 1 to 5 threads led to an increase of the latency by 4x which had a proportional effect on decreasing the throughput. In fact, metrics showed that within the pipeline, the latency of the inference stage alone was 3x higher than it’s single thread baseline. 

Using a profiler, we found that the CPU Spin Time increased, potentially due to CPU oversubscription which impacts system and application performance. Given our control over the application’s multi-thread implementation, we chose to dive deeper into the stack and identify potential conflicts with LibTorch’s default settings.

Diving deeper on LibTorch’s multi-threading and its impact on concurrency

LibTorch’s parallel implementations on CPU for inference are based on  global thread pools. Examples of implementations are Inter-op and intra-op parallelism, which can be chosen depending on the model’s properties. In both cases, it is possible to set the number of threads in each thread-poll to optimize the latency and throughput. 

To test if LibTorch’s parallel default implementation settings had a counter effect on our inference latency, we ran an experiment on a 16 vCPus machine with a 35-minute audio file, keeping the LibTorch inter-threads constant at 1 (because our models didn’t utilize the inter-op thread pool). We collected the following data as shown in Figure 3 and 4. 

Figure 3: CPU Utilization for different number of engine threads

Figure 3: CPU Utilization for different number of engine threads

Figure 4: Processing times for different number of engine threads

Figure 4: Processing times for different number of engine threads

Execution time in Figure 4 is the end-to-end processing time for processing all the segments of the given audio file. We have 4 different configurations of LibTorch intra-threads which are 1, 4, 8, 16 and we change the number of engine threads from 1 to 16 for each intra-thread LibTorch configuration. As we see in Figure 3, CPU utilization increases with an increase in the number of engine threads for all LibTorch intra-thread configurations. But as we see in Figure 4, an increase in CPU utilization doesn’t translate into lower execution time. We found out that in all but one case, as the number of engine threads shot up, so did execution time. The one exception was the case where the intra-thread pool size was 1.

Resolving the global thread pool issue

Using too many threads with a global thread pool led to performance degradation and caused an over-subscription problem. Without disabling LibTorch global thread pools, it was difficult to match the performance of the multi-process engine.

Disabling the LibTorch global thread pool is as simple as setting the intra-op/inter-op parallelism threads to 1, as shown here:

at::set_num_threads(1)           // Disables the intraop thread pool.
at::set_num_interop_threads(1). // Disables the interop thread pool.

As shown in Figure 4, the lowest processing time was measured when the LibTorch global thread pool was disabled.

This solution improved AR engine throughput in several cases. However, when evaluating long datasets (audio files longer than 2 hours in load test), we found that the memory footprint of the engine gradually started to increase.

Optimizing memory usage

We ran a load-test on the system with two hours long audio files and found out that the observed memory increase was the result of memory fragmentation within a multi-threaded LibTorch inference. We resolved this using jemalloc, which is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support. Using jemalloc, our peak memory usage decreased by an average of 34% and average memory usage decreased by 53%.

Figure 5: Memory usage over time using the same input file with and without jemalloc

Figure 5: Memory usage over time using the same input file with and without jemalloc

Summary

To optimize the performance of multi-threaded LibTorch-based inference engines, we recommend verifying that there is no oversubscription problem in LibTorch. In our case, all threads in the multi-threaded engine were sharing the LibTorch global thread pool, which caused an oversubscription problem. This was remedied by disabling the global thread pool: we disabled the interop and intraop global thread pool by setting threads to 1. To optimize the memory of a multi-threaded engine, we recommend using Jemalloc as a memory allocator tool rather than the default malloc function.

Read More

Figure 1: LLaMA Inference Performance on TPU v4 hardware

The Path to Achieve Ultra-Low Inference Latency With LLaMA 65B

Background & State of the Art

In the natural language processing (NLP) space, language models are designed to generate a token (e.g. word) using a sequence of past input tokens. Large Language Models (LLMs) are the latest deep learning innovation in this space built to generate text in a human-like fashion. These models generally use transformers to improve their attention over a large sequence of input tokens.

LLaMA, open sourced by Meta AI, is a powerful foundation LLM trained on over 1T tokens. LLaMA is competitive with many best-in-class models such as GPT-3, Chinchilla, PaLM. LLaMA (13B) outperforms GPT-3 (175B) highlighting its ability to extract more compute from each model parameter.

In this blog post, we use LLaMA as an example model to demonstrate the capabilities of PyTorch/XLA for LLM inference. We discuss how the computation techniques and optimizations discussed here improve inference latency by 6.4x on 65B parameter LLaMA models powered by Google Cloud TPU v4 (v4-16).

Model Overview

We demonstrate the performance capabilities of PyTorch/XLA on LLaMA, the latest LLM from Meta. We showcase performance optimizations on a series of common LLaMA configurations. Notice the 175B parameter model configuration is absent in the public domain. For the 175B parameter model mentioned below, we apply OPT 175B model configuration to the LLaMA code base. Unless stated otherwise, in all configurations, we use max_seq_len=256 and dtype=bfloat16 for weights and activations.

Table 1: Model Configurations Explored in this article

LLaMA Model Hyper Parameters
# Parameters Dimensions N Heads N Layers Max Seq Len
7B 4,096 32 32 256
33B 6,656 52 60 256
65B 8,192 64 80 256
175B 12,288 96 96 256

Performance Challenges of LLMs

LLMs have a few properties that make them challenging for compiler optimizations. (a) LLMs use autoregressive decoding to generate the next token baked on the previous ones; this means prompt tensors and coaches have a dynamic shape. (b) LLMs must work with variable input prompt lengths without triggering recompilation due to input tensor shape changes; input tensors must be properly bucketized and padded to avoid recompilation. (c) LLMs often require more memory than a single TPU (or GPU) device can support. A model-sharding scheme is required to fit the model across a distributed compute architecture. For instance, a LLaMA model with 65B parameters can fit on a v4-16 Cloud TPU, which is comparable to 8 A100 GPUs. (d) running LLMs in production can be expensive; one way to improve performance per total cost of ownership (Perf/TCO) is via quantization; quantization can potentially reduce hardware requirements.

Inference Tech Stack in PyTorch/XLA

Our goal is to offer the AI community a high performance inference stack. PyTorch/XLA integrates with TorchDynamo, PjRt, OpenXLA, and various model parallelism schemes. TorchDynamo eliminates tracing overhead at runtime, PjRt enables efficient host-device communication; PyTorch/XLA traceable collectives enable model and data parallelism on LLaMA via TorchDynamo. To try our results, please use our custom torch, torch-xla wheels to reproduce our LLaMA inference solution. PyTorch/XLA 2.1 will support the features discussed in this post by default.

Parallel Computing

FairScale Sharding

LLaMA uses FairScale model sharding API (fairscale.nn.model_parallel.layers). We built an equivalent representation of this API using PyTorch/XLA communication collective (CC) ops such as all-reduce to communicate program state (e.g. activations) between accelerators. TorchDynamo does not fully support capturing CC ops currently (a.k.a. traceable collectives). Without this support, a TorchDynamo FX graph would be cut at every device communication, meaning at every model layer. Graph cuts lead to performance loss as the underlying XLA compiler loses full graph optimization opportunities. To resolve this, we offer PyTorch/XLA traceable collectives by integrating the dispatcher collectives into our existing CC APIs. The difference is we don’t need to insert c10d.wait() ops after collectives, given the lazy execution nature of PyTorch/XLA. With support for traceable collectives, PyTorch/XLA allows singular FX graph generation in TorchDynamo.

Autoregressive Decoding on PyTorch/XLA

LLMs need autoregressive decoding to feed the previous word as a prompt to predict the next token. Autoregressive decoding leads to unbounded dynamic shape problems, which in turn causes recompilation of every prompt. We optimized the LLaMA autoregressive decoder to operate with fixed shapes that in-place updates the KV-cache, output sequences, and attention masks during every token generation. With a combination of padding, masking, and index ops, we avoided excessive graph recompilation, thereby achieving efficient autoregressive decoding.

KV-Cache Optimization

LLaMA implements autoregressive decoding with KV-cache. For every generated token, the KV-cache stores the attention key/value activations of each Transformer layer. Thus, upon decoding a new token, the key/values of prior tokens no longer need recomputation.

In LLaMA, the KV-cache tensor slices are updated in-place; this leads to recompilation events every time a token is generated. To address this issue, we use index tensors and tensor.index_copy() ops to replace the in-place slice updates. Attention masks and output sequences also benefit from the same optimization.

Input Prompt Optimization

Variable length input prompts are common in LLM applications. This property causes input tensor shape dynamism and in turn recompilation events. When processing a prompt to fill the KV-cache, we either (a) process the input prompt token-by-token, or (b) process the whole prompt in one iteration. The pros and cons of each method are:

  1. Pre-compile 1 graph and process a prompt token-by-token
    • Practical: 1 graph is compiled during warm-up
    • Slow: O(L) to process an input prompt length L – a disadvantage for long prompts
  2. Pre-compile all graphs with input lengths ranging from 1 to max_seq_len (e.g. 2,048)
    • Impractical: pre-compile and cache max_seq_len graphs during warm-up time
    • Fast: 1 graph execution to process the full prompt

We introduce prompt length bucketization, an optimization to strike a balance between the two alternatives. We define a set of ascending bucket sizes, (b0,b1,b2,…,bB-1), and then pre-compile program graphs with input sizes according to these bucket values, (G0,G1,G2,…,GB-1); B is the number of buckets. For a given input prompt, we round up the prompt length to the closest bucket value bn, pad the sequence, and use Gn to process the prompt in one iteration. The computation on the padding tokens is discarded. For prompts larger than the largest bucket size, we process them section-by-section.

The optimal bucket sizes should be determined by prompt length distribution in a target application. Here, we adopt bucket lengths: 128, 256, 384, 512. Any input prompt with up to 2,047 tokens requires up to 4 graph executions. For example, a 1,500 input prompt with generation length of 256 requires 260 graph executions – 4 to process the input, and 256 to generate the output.

Quantization

Quantization reduces the number of bits necessary to represent a value; it reduces the bandwidth to communicate data across multiple accelerator nodes (via collectives) and lowers the hardware requirements to serve a specific model size.

Normally, with BF16 weights, a 175B parameter model would consume about 351GB of memory, and therefore require a v4-32 instance to accommodate the model. By quantizing the weights to INT8, we reduced the model size by roughly 50%, allowing it to run on a smaller v4-16 instance. Because LLaMA shards model activations, quantization offers negligible communication gain.

In our experiments, we quantized the linear layer. Since LLaMA model checkpoints are unavailable publicly, and our goal is to evaluate performance, the quantized model is initialized with random weights.Recent literature such as AWQ and Integer or Floating Point? offer insights into performance properties of LLaMA under various low-bit quantization schemes.

Effect of Batch Size on Quantization Performance

TPU v4 is programmed to run matmul on the Matrix Multiply Unit (MXU) when the model batch size (BS) > 1. For BS = 1, matmul runs on the Vector Processor Unit (VPU). Since MXU is more efficient than VPU, INT8 quantization gains performance at BS>1. See Performance Analysis section for details.

Op Support

Occasionally, new models introduce new mathematical operations that require PyTorch/XLA to extend its supported op set for compilation. For LLaMA, we supported: multinomial.

Methodology

LLaMA works on PyTorch/XLA out of the box on LazyTensorCore. We use this configuration as a baseline for our follow up analysis. All experiments assume 256-long input prompts. In the absence of a publicly available model checkpoint, we used random tensor initialization for this inference stack optimization effort. A model checkpoint is not expected to change latency results discussed here.

Model Sizing

Assuming N is the number of parameters, dimensions is the hidden size, n_layers is the number of layers, n_heads is the number of attention heads, the equation below can be used to approximate the model size. See the Model Overview section for details.

N = (dimensions)^2 * n_layers * 12

n_heads doesn’t affect N, but the following equation holds for the open sourced model configs.

dim = 128 * n_heads

Cache Sizing

Both model parameters and the cache layers in the Attention block contribute to memory consumption. Since the default LLaMA model uses BF16 weights, the memory consumption calculation in this section is based on BF16 weights.

The size of the cache layer is calculated by cache_size = max_batch_size * max_seq_len * dimensions. max_batch_size = 1 and max_seq_len = 256 are used as an example configuration in the following calculations. There are 2 cache layers in each Attention block. So, the total LLaMA cache size (in Bytes) is total_cache_size = n_layers * 2 * cache_size * (2 bytes).

TPU v4 Hardware Sizing

Each TPU v4 chip has 32GB of available High-Bandwidth Memory (HBM). Table 2 has the details on memory consumption and the number of required TPU chips to hold a LLaMA model.

Table 2: LLaMA TPU v4 HBM requirements (i.e. TPU v4 chip requirements)

# Parameters Parameter (MB) Cache (MB) Total (GB) Min # of TPU v4 Chips
7B 14,000 134 14.128 1
33B 66,000 408 66.41 3
65B 130,000 671 130.67 5
175B 350,000 1,208 351.21 11

Metrics

Below are useful metrics to measure inference speed. Assuming T is the total time, B is the batch size, L is the decoded sequence length.

Latency Definition

Latency is the time it takes to get the decoded result at target length L, regardless of the batch size B. Latency represents how long the user should wait to get the response from the generation model.

Latency = T (s)

Per-token latency

One step of autoregressive decoding generates a token for each sample in the batch. Per-token latency is the average time for that one step.

Per-token latency = T / L (s/token)

Throughput

Throughput measures how many tokens are generated per unit time. While it’s not a useful metric for evaluating online serving it is useful to measure the speed of batch processing.

Throughput = B * L / T (tokens/s)

To minimize confusion and misinterpretation, it’s better to avoid metrics like T / (B * L), which mixes latency and throughput.

Results

Figure 1 shows latency / token results for LLaMA 7B to 175B models. In each case, the model is run on a range of TPU v4 configurations. For instance, LLaMA 7B shows 4.7ms/token and 3.8ms/token on v4-8 and v4-16 respectively. For more comparison, visit the HuggingFace LLM performance leaderboard.

In the absence of the features discussed in this blog post, the LLaMA 65B running on v4-32 delivers 120ms/token instead of 14.5ms/token obtained here, leading to 8.3x speedup. As discussed earlier, developers are encouraged to try our custom torch, torch-xla wheels that unlock the repro of LLaMA inference results shared here.

Figure 1: LLaMA Inference Performance on TPU v4 hardware

Figure 1: LLaMA Inference Performance on TPU v4 hardware

PyTorch/XLA:GPU performance is better than PyTorch:GPU eager and similar to PyTorch Inductor. PyTorch/XLA:TPU performance is superior to PyTorch/XLA:GPU. In the near future, XLA:GPU will deliver optimizations that bring parity with XLA:TPU. The single A100 configuration only fits LLaMA 7B, and the 8-A100 doesn’t fit LLaMA 175B.

Figure 2: LLaMA Inference Performance on GPU A100 hardware

Figure 2: LLaMA Inference Performance on GPU A100 hardware

As the batch size increases, we observe a sublinear increase in per-roken latency highlighting the tradeoff between hardware utilization and latency.

Figure 3: LLaMA Inference Performance across different batch sizes

Figure 3: LLaMA Inference Performance across different batch sizes

Our studies suggest the impact of maximum sequence input length (max_seq_len) on inference latency is relatively minimal. We attribute this to the sequential and iterative nature of token generation. The small difference in performance can be due to KV cache access latency changes as the storage size increases.

Figure 4: LLaMA Inference Performance across different prompt lengths

Figure 4: LLaMA Inference Performance across different prompt lengths

LLMs are often memory bound applications; thus, by quantizing model parameters we enable loading and executing a larger tensor on MXUs per unit time (i.e. HBM ⇒ CMEM and CMEM ⇒ MXU data moevment). Figure 5 shows INT8 weight-only quantization offers 1.6x-1.9x speedup allowing running a larger model on a given hardware.

When BS=1, INT8 tensors are dispatched to VPU which is smaller than MXU (see the TPU v4 paper); otherwise, MXU is used. As a result, when BS=1, quantization memory bandwidth gains are offset by lack of MXU utilization. When BS>1, however, memory gains deliver superior latency on the quantized model. For example, in the case of 175B parameters LLaMA, v4-16 with quantiztion and v4-32 without quantiztion deliver similar performance. Note we do not provied FP8 comparisons because PyTorch is yet to offer this data type.

Figure 5: LLaMA Inference Performance vs. weight-only quantization. The missing blue bars suggest the model size doesn’t fit in the specified TPU hardware.

Figure 5: LLaMA Inference Performance vs. weight-only quantization. The missing blue bars suggest the model size doesn’t fit in the specified TPU hardware.

Figure 6 demonstrates the steady performance advantage of PyTorch/XLA as the input prompt length grows from 10 tokens to 1,500 tokens. This strong scaling capability suggests minimal PyTorch/XLA recompilation events enabling a wide range of real-world applications. In this experiment, the maximum length is 2,048 and maximum generation length is 256.

Figure 6: LLaMA Inference Performance vs. Input Prompt Length

Figure 6: LLaMA Inference Performance vs. Input Prompt Length

Final Thoughts

We are ecstatic about what’s ahead for PyTorch/XLA and invite the community to join us. PyTorch/XLA is developed fully in open source. So, please file issues, submit pull requests, and send RFCs to GitHub so that we can openly collaborate. You can also try out PyTorch/XLA for yourself on various XLA devices including TPUs and GPUs.

Cheers,
The PyTorch/XLA Team at Google
#PoweredByPyTorch

Read More

Optimized PyTorch 2.0 Inference with AWS Graviton processors

Optimized PyTorch 2.0 Inference with AWS Graviton processors

New generations of CPUs offer significant performance improvement in machine learning (ML) inference due to specialized built-in instructions. Combined with their flexibility, high speed of development, and low operating cost, these general-purpose processors offer an alternative ML inference solution to other existing hardware solutions.

AWS, Arm, Meta, and others helped optimize the performance of PyTorch 2.0 inference for Arm-based processors. As a result, we are delighted to announce that Arm-based AWS Graviton instance inference performance for PyTorch 2.0 is up to 3.5 times the speed for ResNet-50 compared to the previous PyTorch release, and up to 1.4 times the speed for BERT, making Graviton-based instances the fastest compute optimized instances on AWS for these models (see the following graph).

Relative speed improvement achieved by upgrading PyTorch to 2.0

Image 1: Relative speed improvement achieved by upgrading from PyTorch version 1.13 to 2.0 (higher is better). The performance is measured on c7g.4xlarge instances.

As shown in the next graph, we measured up to 50% cost savings for PyTorch inference with Graviton3-based c7g instances across Torch Hub ResNet-50 and multiple Hugging Face models compared to comparable x86-based compute optimized Amazon EC2 instances. For that graph, we first measured the cost per million inference for the five instance types. Then, we normalized the cost per million inference results to a c5.4xlarge instance, which is the baseline measure of “1” on the Y-axis of the chart.

Relative cost of PyTorch inference running on different AWS instances

Image 2: Relative cost of PyTorch inference running on different AWS instances (lower is better).
Source: AWS ML Blog on Graviton PyTorch2.0 inference performance.

Similar to the preceding inference cost comparison graph, the following graph shows the model p90 latency for the same five instance types. We normalized the latency results to the c5.4xlarge instance, which is the baseline measure of “1” on the Y-axis of the chart. The c7g.4xlarge (AWS Graviton3) model inference latency is up to 50% better than the latencies measured on c5.4xlarge, c6i.4xlarge, and c6a.4xlarge.

Relative latency (p90) of PyTorch inference running on different AWS instances

Image 3: Relative latency (p90) of PyTorch inference running on different AWS instances (lower is better).
Source: AWS ML Blog on Graviton PyTorch2.0 inference performance.

Optimization details

PyTorch supports Compute Library for the Arm® Architecture (ACL) GEMM kernels via the oneDNN backend (previously called “MKL-DNN”) for AArch64 platforms. The optimizations are primarily for PyTorch ATen CPU BLAS, ACL kernels for fp32 and bfloat16, and oneDNN primitive caching. There are no frontend API changes, so no changes are required at the application level to get these optimizations working on Graviton3-based instances.

PyTorch level optimizations

We extended the ATen CPU BLAS interface to accelerate more operators and tensor configurations via oneDNN backend for aarch64 platform. The following diagram highlights (in orange) the optimized components that improved the PyTorch inference performance on aarch64 platform.

PyTorch software stack highlighting (in orange) the components optimized for inference performance improvement on AArch64 platform

Image 4: PyTorch software stack highlighting (in orange) the components optimized for inference performance improvement on AArch64 platform

ACL kernels and BFloat16 FPmath mode

The ACL library provides Neon and SVE optimized GEMM kernels for both fp32 and bfloat16 formats: These kernels improve the SIMD hardware utilization and reduce the end to end inference latencies. The bfloat16 support in Graviton3 allows efficient deployment of models trained using bfloat16, fp32 and Automatic Mixed Precision (AMP). The standard fp32 models use bfloat16 kernels via oneDNN FPmath mode without model quantization. They provide up to two times faster performance compared to existing fp32 model inference without bfloat16 FPmath support. For more details on ACL GEMM kernel support, refer to Arm Compute Library github.

Primitive Caching

The following call sequence diagram shows how ACL operators are integrated into oneDNN backend. As shown in the diagram, ACL objects are handled as oneDNN resources instead of the primitive objects. This is because the ACL objects are stateful and mutable. Since the ACL objects are handled as resource objects, they are not cacheable with the default primitive caching feature supported in oneDNN. We implemented primitive caching at ideep operator level for “convolution”, “matmul” and “inner product” operators to avoid redundant GEMM kernel initialization and tensor allocation overhead.

Call sequence diagram showing how the Compute Library for the Arm® Architecture (ACL) GEMM kernels are integrated into oneDNN backend

Image 5: Call sequence diagram showing how the Compute Library for the Arm® Architecture (ACL) GEMM kernels are integrated into oneDNN backend

How to take advantage of the optimizations

Install the PyTorch 2.0 wheel from the official repo and set environment variables to enable the additional optimizations.

# Install Python
sudo apt-get update
sudo apt-get install -y python3 python3-pip

# Upgrade pip3 to the latest version
python3 -m pip install --upgrade pip

# Install PyTorch and extensions
python3 -m pip install torch
python3 -m pip install torchvision torchaudio torchtext

# Turn on Graviton3 optimization
export DNNL_DEFAULT_FPMATH_MODE=BF16
export LRU_CACHE_CAPACITY=1024

Running an inference

You can use PyTorch torchbench to measure the CPU inference performance improvements, or to compare different instance types.

# Pre-requisite:
# pip install PyTorch2.0 wheels and set the above mentioned environment variables

# Clone PyTorch benchmark repo
git clone https://github.com/pytorch/benchmark.git

# Setup ResNet-50 benchmark
cd benchmark
python3 install.py resnet50

# Install the dependent wheels
python3 -m pip install numba

# Run ResNet-50 inference in jit mode. On successful completion of the inference runs,
# the script prints the inference latency and accuracy results
python3 run.py resnet50 -d cpu -m jit -t eval --use_cosine_similarity

Performance Analysis

Now, we will analyze the inference performance of ResNet-50 on Graviton3-based c7g instance using PyTorch profiler. We run the code below with PyTorch 1.13 and PyTorch 2.0 and run the inference for a few iterations as a warmup before measuring the performance.

# Turn on Graviton3 optimization
export DNNL_DEFAULT_FPMATH_MODE=BF16
export LRU_CACHE_CAPACITY=1024
import torch
from torchvision import models
sample_input = [torch.rand(1, 3, 224, 224)]
eager_model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
model = torch.jit.script(eager_model, example_inputs=[sample_input, ])

model = model.eval()
model = torch.jit.optimize_for_inference(model)

with torch.no_grad():
    # warmup runs
    for i in range(10):
        model(*sample_input)
    prof = torch.profiler.profile(
      on_trace_ready=torch.profiler.tensorboard_trace_handler('./logs'), record_shapes=True, with_stack=True)
    # profile after warmup
    prof.start()
    model(*sample_input)
    prof.stop()

We use tensorboard to view results of the profiler and analyze model performance.

Install PyTorch Profiler Tensorboard plugin as follows

pip install torch_tb_profiler

Launch the tensorboard using

tensorboard --logdir=./logs

Launch the following in the browser to view the profiler output. The profiler supports ‘Overview’, ‘Operator’, ‘Trace’ and ‘Module’ views to get insight into the inference execution.

http://localhost:6006/#pytorch_profiler

The following diagram is the profiler ‘Trace’ view which shows the call stack along with the execution time of each function. In the profiler, we selected the forward() function to get the overall inference time. As shown in the diagram, the inference time for the ResNet-50 model on Graviton3-based c7g instance is around 3 times faster in PyTorch 2.0 compared to PyTorch 1.13.

Profiler Trace view: Forward pass wall duration on PyTorch 1.13 and PyTorch 2.0

Image 6: Profiler Trace view: Forward pass wall duration on PyTorch 1.13 and PyTorch 2.0

The next diagram is the ‘Operator’ view which shows the list of PyTorch operators and their execution time. Similar to the preceding Trace view, the Operator view shows that the operator host duration for the ResNet-50 model on Graviton3-based c7g instance is around 3 times faster in PyTorch 2.0 compared to PyTorch 1.13.

Profiler Operator view: Forward operator Host duration on PyTorch 1.13 and PyTorch 2.0

Image 7: Profiler Operator view: Forward operator Host duration on PyTorch 1.13 and PyTorch 2.0

Benchmarking Hugging Face models

You can use the Amazon SageMaker Inference Recommender utility to automate performance benchmarking across different instances. With Inference Recommender, you can find the real-time inference endpoint that delivers the best performance at the lowest cost for a given ML model. We collected the preceding data using the Inference Recommender notebooks by deploying the models on production endpoints. For more details on Inference Recommender, refer to the amazon-sagemaker-examples GitHub repo. We benchmarked the following models for this post: ResNet50 image classification, DistilBERT sentiment analysis, RoBERTa fill mask, and RoBERTa sentiment analysis.

Conclusion

For PyTorch 2.0, the Graviton3-based C7g instance is the most cost-effective compute optimized Amazon EC2 instance for inference. These instances are available on SageMaker and Amazon EC2. The AWS Graviton Technical Guide provides the list of optimized libraries and best practices that will help you achieve cost benefit with Graviton instances across different workloads.

If you find use cases where similar performance gains are not observed on Graviton, please open an issue on the aws-graviton-getting-started github to let us know about it. We will continue to add more performance improvements to make AWS Graviton-based instances the most cost-effective and efficient general purpose processor for inference using PyTorch.

Acknowledgments

We would like to thank Ali Saidi (Sr. Principal Engineer) and Csaba Csoma (Sr. Manager, Software Development) from AWS, Ashok Bhat (Sr. Product Manager), Nathan Sircombe (Sr. Engineering Manager) and Milos Puzovic (Principal Software Engineer) from Arm for their support during the Graviton PyTorch inference optimization work. We would also like to thank Geeta Chauhan (Engineering Leader, Applied AI) from Meta for her guidance on this blog.

About the authors

Sunita Nadampalli is a ML Engineer and Software Development Manager at AWS.

Ankith Gunapal is an AI Partner Engineer at Meta(PyTorch).

Read More