Accelerating Whisper on Arm with PyTorch and Hugging Face Transformers

Automatic speech recognition (ASR) has revolutionized how we interact with technology, clearing the way for applications like real-time audio transcription, voice assistants, and accessibility tools. OpenAI Whisper is a powerful model for ASR, capable of multilingual speech recognition and translation.

A new Arm Learning Path is now available that explains how to accelerate Whisper on Arm-based cloud instances using PyTorch and Hugging Face transformers.

Why Run Whisper on Arm?

Arm processors are popular in cloud infrastructure for their efficiency, performance, and cost-effectiveness. With major cloud providers such as AWS, Azure, and Google Cloud offering Arm-based instances, running machine learning workloads on this architecture is becoming increasingly attractive.

What You’ll Learn

The Arm Learning Path provides a structured approach to setting up and accelerating Whisper on Arm-based cloud instances. Here’s what you cover:

1. Set Up Your Environment

Before running Whisper, you must set up your development environment. The learning path walks you through setting up an Arm-based cloud instance and installing all dependencies, such as PyTorch, Transformers, and ffmpeg.

2. Run Whisper with PyTorch and Hugging Face Transformers

Once the environment is ready, you will use the Hugging Face transformer library with PyTorch to load and execute Whisper for speech-to-text conversion. The tutorial provides a step-by-step approach for processing audio files and generating audio transcripts.

3. Measure and Evaluate Performance

To ensure efficient execution, you learn how to measure transcription speeds and compare different optimization techniques. The guide provides insights into interpreting performance metrics and making informed decisions on your deployment.

Try it Yourself

Upon completion of this tutorial, you know how to:

  • Deploy Whisper on an Arm-based cloud instance.
  • Implement performance optimizations for efficient execution.
  • Evaluate transcription speeds and optimize further based on results.

Try the live demo today and see audio transcription in action on Arm: Whisper on Arm Demo.

Read More

Accelerating Whisper on Arm with PyTorch and Hugging Face Transformers

Automatic speech recognition (ASR) has revolutionized how we interact with technology, clearing the way for applications like real-time audio transcription, voice assistants, and accessibility tools. OpenAI Whisper is a powerful model for ASR, capable of multilingual speech recognition and translation.

A new Arm Learning Path is now available that explains how to accelerate Whisper on Arm-based cloud instances using PyTorch and Hugging Face transformers.

Why Run Whisper on Arm?

Arm processors are popular in cloud infrastructure for their efficiency, performance, and cost-effectiveness. With major cloud providers such as AWS, Azure, and Google Cloud offering Arm-based instances, running machine learning workloads on this architecture is becoming increasingly attractive.

What You’ll Learn

The Arm Learning Path provides a structured approach to setting up and accelerating Whisper on Arm-based cloud instances. Here’s what you cover:

1. Set Up Your Environment

Before running Whisper, you must set up your development environment. The learning path walks you through setting up an Arm-based cloud instance and installing all dependencies, such as PyTorch, Transformers, and ffmpeg.

2. Run Whisper with PyTorch and Hugging Face Transformers

Once the environment is ready, you will use the Hugging Face transformer library with PyTorch to load and execute Whisper for speech-to-text conversion. The tutorial provides a step-by-step approach for processing audio files and generating audio transcripts.

3. Measure and Evaluate Performance

To ensure efficient execution, you learn how to measure transcription speeds and compare different optimization techniques. The guide provides insights into interpreting performance metrics and making informed decisions on your deployment.

Try it Yourself

Upon completion of this tutorial, you know how to:

  • Deploy Whisper on an Arm-based cloud instance.
  • Implement performance optimizations for efficient execution.
  • Evaluate transcription speeds and optimize further based on results.

Try the live demo today and see audio transcription in action on Arm: Whisper on Arm Demo.

Read More

PyTorch Day France 2025: Call For Proposals Open

PyTorch Day France 2025: Call For Proposals Open

We’re pleased to announce PyTorch Day France 2025, a dedicated gathering of the PyTorch community held 7 May 2025 in Paris, France. Proudly hosted by the PyTorch Foundation and co-located with GOSIM AI Paris 2025, this event will bring together developers, researchers, and practitioners driving innovation in open source AI and machine learning.

Whether you’re building cutting-edge models or contributing to the ecosystem, PyTorch Day France is your opportunity to connect, collaborate, and help shape the future of deep learning.

PT Day CFP

Why Attend?

Set in the vibrant atmosphere of STATION F, the world’s largest startup campus, PyTorch Day France will offer a full day of:

  • Insightful Technical Talks
  • Interactive Discussions
  • Engaging Poster Sessions

The event is designed to foster open exchange across the PyTorch ecosystem, providing a space to learn from peers, share practical insights, and explore the latest research and applications in AI.

Submit a Proposal

We are currently accepting proposals for talks. If you have a project, idea, or research story you’d like to share with the PyTorch community, we want to hear from you.

📩 Email your talk title and abstract to pytorchevents@linuxfoundation.org for consideration.

Registration

To register for PyTorch Day France, please visit the GOSIM AI Paris website, and use the code PYTORCHFRIEND to receive 25% off.

👉 https://paris2025.gosim.org/

We encourage early registration to secure your spot and ensure access to both PyTorch Day France and the broader GOSIM AI Paris programming.

Venue

STATION F
5 Parv. Alan Turing, 75013 Paris, France
A landmark of innovation and entrepreneurship in the heart of Paris.

Travel and Accommodations

Participants are responsible for their own travel and lodging. For those arriving internationally, Paris Charles de Gaulle Airport is approximately 38.4 km from STATION F. Additional information about accommodations and transportation may be available on the GOSIM AI Paris website.

Questions?

For any inquiries, please contact us at pytorchevents@linuxfoundation.org.

We look forward to welcoming the PyTorch community to Paris this May for a day of collaboration, learning, and open source AI innovation.

Read More

OpenReg: A Self-Contained PyTorch Out-of-Tree Backend Implementation Using “PrivateUse1” Mechanism

OpenReg: A Self-Contained PyTorch Out-of-Tree Backend Implementation Using “PrivateUse1” Mechanism

OpenReg is a self-contained demonstration of a PyTorch out-of-tree backend implementation utilizing the core framework’s “PrivateUse1” mechanism. This implementation serves two primary purposes:

  1. Reference Implementation: Provides a practical template for third-party device vendors integrating with PyTorch through PrivateUse1.
  2. CI Testing Infrastructure: Enables device-agnostic testing capabilities for continuous integration pipelines.

Usage

Module Installation

cd {project}/test/cpp_extensions/open_registration_extension
python setup.py install

Use Case

import torch
import pytorch_openreg

if __name__ == "__main__":
   print(torch.ones(1, 2, device='openreg'))

Architectural Overview

Process Management

OpenReg implements virtual device isolation by spawning N independent subprocesses, each maintaining dedicated request/response queues for inter-process communication. The parent process driver encapsulates device operations into command packets that are:

  1. Dispatched to target devices via request queues
  2. Processed asynchronously with results returned through response queues

Parent-Subprocess Communication Flow

Figure: Parent-Subprocess Communication Flow

Memory Management

Device memory allocations occur within individual subprocesses to ensure:

  1. Strict memory isolation between devices
  2. Realistic simulation of physical device constraints

Component Breakdown

_aten_impl.py

This module handles dual responsibilities:

  1. Hook Registration:
    • Utilizes _IMPL_REGISTRY to bind C++ backend hooks (e.g., getDevice, getStream) to device driver implementations
  2. Fallback Mechanism:
    • Define a new torch.Library that registers a fallback that will be called whenever a backend kernel for PrivateUse1 is called. It contains the logic to handle all kind of native functions, computing the output metadata, allocating it and only calling into the device daemon to perform computation

_device_daemon.py

Core Subsystems

  1. Allocators:
    • HostAllocator: Manages pinned memory in parent process
    • DeviceAllocator: Handles device memory with tensor reconstruction capabilities
  2. Driver (Parent Process):
    • Maintains device context (active device/streams)
    • Implements device control operations:
      • setDevice/getDevice
      • deviceCount
      • exchangeStream
    • Orchestrates command execution through queue-based IPC
  3. Executor (Subprocess):
    • Processes command types:
      • Memory operations (malloc/free)
      • Tensor computations (run_op)
      • Data transfers (send_data/recv_data)
      • Stream/event management (primarily no-op due to CPU sync nature)

_meta_parser.py

Key Features:

  • Implements serialization utilities for cross-process object transfer
  • OpenRegTensorMeta class encapsulates complete tensor metadata for:
    • Output tensor reconstruction
    • Device-side computation preparation

Design Considerations

Execution Characteristics

  • Synchronous Computation: CPU operator execution necessitates synchronous processing
  • Stream/Event Semantics: Implemented as no-ops due to synchronous execution model
  • Memory Isolation: Strict per-device memory boundaries enforced through subprocess allocation

This architecture enables realistic simulation of device integration while maintaining PyTorch compatibility through standard backend interfaces.

Read More

SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine

SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine

sglang logo

We’re thrilled to announce that the SGLang project has been integrated into the PyTorch ecosystem! This integration ensures that SGLang aligns with PyTorch’s standards and practices, providing developers with a reliable and community-supported framework for fast and flexible serving of LLMs.

To view the PyTorch Ecosystem, see the PyTorch Landscape and learn more about how projects can join the PyTorch Ecosystem.

About SGLang

SGLang is a fast-serving engine for large language models and vision language models. It makes the interaction with models faster and more controllable by co-designing the backend runtime and frontend language.

The core features include:

  • Fast Backend Runtime: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, and quantization (FP8/INT4/AWQ/GPTQ).
  • Flexible Frontend Language: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
  • Extensive Model Support: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
  • Active Community: SGLang is open source and backed by an active community with industry adoption.

SGLang is famous for its fast speed. It can often significantly outperform other state-of-the-art frameworks in terms of serving throughput and latency. You can learn more about the underlying techniques from the past release blog posts: v0.2 blog, v0.3 blog, v0.4 blog.

SGLang has been widely adopted by leading industry companies and frontier research labs. For example, xAI uses SGLang to serve its flagship model, Grok 3, which is currently the best model according to the Chatbot Arena leaderboard. Microsoft Azure uses SGLang to serve DeepSeek R1 on AMD GPUs, which is currently the best open source model.

Serving DeepSeek Models

You can easily launch a Docker container to serve a DeepSeek model with the following command:

# Pull the latest image
docker pull lmsysorg/sglang:latest

# Launch a server
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest 
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000

Then you can query the server with the OpenAI-compatible API

import openai
client = openai.Client(base_url=f"http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

The server launch command above works for 8xH200. You can find detailed instructions for other hardware (MI300X, H100, A100, H20, L40S) at https://docs.sglang.ai/references/deepseek.html.

SGLang integrates DeepSeek-specific optimizations, such as MLA throughput optimizations, MLA-optimized kernels, data-parallel attention, multi-token prediction, and DeepGemm, making it the top choice for serving DeepSeek models by dozens of companies, including AMD, NVIDIA, and many cloud providers. The team is actively working on integrating more optimizations following the 2025 H1 roadmap below.

Serving Llama Models

Similarly, you can launch the server for a Llama 3.1 text model with:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct

Or a Llama 3.2 multimodal model with:

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct  --chat-template=llama_3_vision

Roadmap

This year, the SGLang team will continue to push the boundaries of system efficiency. You can find the roadmap of 2025H1 here. The focus is

  • Throughput-oriented large-scale deployment similar to the DeepSeek inference system
  • Long context optimizations
  • Low latency speculative decoding
  • Reinforcement learning training framework integration
  • Kernel optimizations

Community

SGLang has been deployed to large-scale production, generating trillions of tokens every day. It has an active community with over three hundred contributors on GitHub. It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, iFlytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.

logos

Conclusion

We’re excited to welcome SGLang to the PyTorch ecosystem. SGLang accelerates the serving of large language and vision language models. It’s widely adopted by industry, powering the large-scale online serving of frontier models like Grok and DeepSeek.

We invite you to explore the SGLang GitHub repo, join the community on Slack, and reach out to contact@sglang.ai for inquiries or collaboration opportunities. Together, we can make powerful AI models accessible to everyone.

Read More

PyTorch Day China 2025 Call for Proposals Open

PyTorch Day China 2025 Call for Proposals Open

We’re excited to announce the first-ever PyTorch Day China! This new event, hosted by the PyTorch Foundation, will take place on June 7 in Beijing, China, bringing together AI practitioners, researchers, and industry professionals to explore the latest advancements in open source AI and machine learning. Co-located with the BAAI Conference, PyTorch Day China is a chance to connect with the community, share knowledge, and help shape the future of deep learning.

PyTorch Day China 2025 Call for Proposals Open

Why Submit a Proposal?

PyTorch Day China offers a platform for AI practitioners and researchers to showcase their work, exchange ideas, and connect with others in the community. If you’re working on innovative applications, tools, or research in the PyTorch ecosystem, we encourage you to share your expertise.

Topics for Submission:

  • AI Applications and Use Cases
  • Core PyTorch Framework
  • DL Compilers and Kernel Authoring
  • Edge AI and On-Device
  • Ethical AI, Governance, and Regulation
  • Generative AI and Large Language Models (LLMs) with PyTorch
  • Open Source Collaboration, Education, and Community Building
  • Optimization for Training and Inference
  • PyTorch on Accelerator Hardware
  • PyTorch Ecosystem and Tools
  • PyTorch in Research and Academia
  • Performance Measurement and Benchmarking
  • Scaling Training and Inference

The submission deadline is April 13. Submit and learn more here: https://www.lfasiallc.com/pytorch-day-china/call-for-proposals-cfp/

Why Attend?

PyTorch Day China will feature technical talks, discussions, and poster sessions that highlight real-world applications and developments in AI and machine learning. Attendees will have the opportunity to learn from experts, contribute to the open source community, and engage with fellow PyTorch users. Registration information will be available in April.

Event Details

  • Date: June 7, 2025
  • Location: Zhongguancun Exhibition Center, Beijing, China
  • Address: 索家坟, Hai Dian Qu, Bei Jing Shi, China, 100080
  • Co-located with: BAAI Conference

Travel Information

The venue, Zhongguancun Exhibition Center, is approximately 39 km from Beijing International Airport. More details on travel and accommodation will be available on the BAAI Conference website and updated here as they become available.

Have Questions?

For inquiries, please contact pytorchevents@linuxfoundation.org.

Submit your proposal by April 13 and join the conversation shaping the future of PyTorch.

Read More

PyTorch at GTC 2025

PyTorch at GTC 2025

GTC is coming back to San Jose on March 17–21, 2025. Join PyTorch Foundation members Arm, AWS, Google Cloud, IBM, Lightning AI, Meta, Microsoft Azure, Snowflake, and thousands of developers as we celebrate PyTorch. Together learn how AI & accelerated computing are helping humanity solve our most complex challenges.

Join in person with discounted GTC registration for PyTorch Foundation or watch online with free registration.

book cover

Scaling Open Source AI: From Foundation Models to Ecosystem Success

Hear from PyTorch Foundation’s Executive Director Matt White & panelists from UC Berkeley, Meta, NVIDIA, & Sequoia Capital how open source is transforming AI development, bringing together experts from industry, academia, and venture capital to discuss the technical and business aspects of collaborative open source AI development They’ll examine how open source projects like PyTorch, vLLM, Ray, and NVIDIA’s NeMo are accelerating AI innovation while creating new opportunities for businesses and researchers. They’ll share real-world experiences from PyTorch’s development, Berkeley’s research initiatives, and successful AI startups. Take away valuable insights into the technical and business aspects of open source AI. – Monday, Mar 17 10:00 AM – 11:00 AM PDT

PyTorch @ GTC

The Performance of CUDA with the Flexibility of PyTorch
Mark Saroufim, Software Engineer, Meta Platforms

This talk explores how PyTorch users are also becoming CUDA developers. We’ll start with motivating examples from eager, the launch of torch.compile and the more recent trend of kernel zoos. We will share details on how we went about integrating low bit matmuls in torchao and the torch.compile CUTLASS backend. We’ll also discuss details on how you can define, build and package your own custom ops in PyTorch so you get the raw performance of CUDA while maintaining the flexibility of PyTorch.

Make My PyTorch Model Fast, and Show Me How You Did It
Thomas Viehmann, Principal Research Engineer, Lightning AI
Luca Antiga, CTO, Lightning AI

PyTorch is popular in deep learning and LLMs for richness and ease of expressions. To make the most of compute resources, PyTorch models benefit from nontrivial optimizations, but this means losing some of their ease and understandability. Learn how with Thunder, a PyTorch-to-Python compiler focused on usability, understandability, and extensibility, you can optimize and transform (i.e., distribute across many machines) models while • leaving the PyTorch code unchanged • targeting a variety of models without needing to adapt to each of them • understanding each transformation step because the results are presented as simple Python code • accessing powerful extension code for your own optimizations with just one or a few lines of code We’ll show how the combination of Thunder transforms and the NVIDIA stack (NVFuser, cuDNN, Apex) delivers optimized performance in training and inference on a variety of models.

FlexAttention: The Flexibility of PyTorch With the Performance of FlashAttention
Driss Guessous, Machine Learning Engineer, Meta Platforms

Introducing FlexAttention: a novel PyTorch API that enables custom, user-defined attention mechanisms with performance comparable to state-of-the-art solutions. By leveraging the PyTorch compiler stack, FlexAttention supports dynamic modifications to attention scores within SDPA, achieving both runtime and memory efficiency through kernel fusion with the FlashAttention algorithm. Our benchmarks on A100 GPUs show FlexAttention achieves 90% of FlashAttention2’s performance in forward passes and 85% in backward passes. On H100 GPUs, FlexAttention’s forward performance averages 85% of FlashAttention3 and is ~25% faster than FlashAttention2, while backward performance averages 76% of FlashAttention3 and is ~3% faster than FlashAttention2. Explore how FlexAttention balances near-state-of-the-art performance with unparalleled flexibility, empowering researchers to rapidly iterate on attention mechanisms without sacrificing efficiency.

Keep Your GPUs Going Brrr : Crushing Whitespace in Model Training
Syed Ahmed, Senior Software Engineer, NVIDIA
Alban Desmaison, Research Engineer, Meta
Aidyn Aitzhan, Senior Software Engineer, NVIDIA

Substantial progress has recently been made on the compute-intensive portions of model training, such as high-performing attention variants. While invaluable, this progress exposes previously hidden bottlenecks in model training, such as redundant copies during collectives and data loading time. We’ll present recent improvements in PyTorch achieved through Meta/NVIDIA collaboration to tackle these newly exposed bottlenecks and how practitioners can leverage them.

Accelerated Python: The Community and Ecosystem
Andy Terrel, CUDA Python Product Lead, NVIDIA
Jeremy Tanner, Open Source Programs, NVIDIA
Anshuman Bhat, CUDA Product Management, NVIDIA

Python is everywhere. Simulation, data science, and Gen AI all depend on it. Unfortunately, the dizzying array of tools leaves a newcomer baffled at where to start. We’ll take you on a guided tour of the vibrant community and ecosystem surrounding accelerated Python programming. Explore a variety of tools, libraries, and frameworks that enable efficient computation and performance optimization in Python, including CUDA Python, RAPIDS, Warp, and Legate. We’ll also discuss integration points with PyData, PyTorch, and JAX communities. Learn about collaborative efforts within the community, including open source projects and contributions that drive innovation in accelerated computing. We’ll discuss best practices for leveraging these frameworks to enhance productivity in developing AI-driven applications and conducting large-scale data analyses.

Supercharge large scale AI with Google Cloud AI hypercomputer (Presented by Google Cloud)
Deepak Patil, Product Manager, Google Cloud
Rajesh Anantharaman, Product Management Lead, ML Software, Google Cloud

Unlock the potential of your large-scale AI workloads with Google Cloud AI Hypercomputer – a supercomputing architecture designed for maximum performance and efficiency. In this session, we will deep dive into PyTorch and JAX stacks on Google Cloud on NVIDIA GPUs, and showcase capabilities for high performance foundation model building on Google Cloud.

Peering Into the Future: What AI and Graph Networks Can Mean for the Future of Financial Analysis
Siddharth Samsi, Sr. Solutions Architect, NVIDIA
Sudeep Kesh, Chief Innovation Officer, S&P Global

Artificial Intelligence, agentic systems, and graph neural networks (GNNs) are providing the new frontier to assess, monitor, and estimate opportunities and risks across work portfolios within financial services. Although many of these technologies are still developing, organizations are eager to understand their potential. See how S&P Global and NVIDIA are working together to find practical ways to learn and integrate such capabilities, ranging from forecasting corporate debt issuance to understanding capital markets at a deeper level. We’ll show a graph representation of market data using the PyTorch-Geometric library and a dataset of issuances spanning three decades and across financial and non-financial industries. Technical developments include generation of a bipartite graph and link-prediction GNN forecasting. We’ll address data preprocessing, pipelines, model training, and how these technologies can broaden capabilities in an increasingly complex world.

Unlock Deep Learning Performance on Blackwell With cuDNN
Yang Xu (Enterprise Products), DL Software Engineering Manager, NVIDIA

Since its launch, cuDNN, a library for GPU-accelerating deep learning (DL) primitives, has been powering many AI applications in domains such as conversational AI, recommender systems, and speech recognition, among others. CuDNN remains a core library for DL primitives in popular frameworks such as PyTorch, JAX, Tensorflow, and many more while covering training, fine-tuning, and inference use cases. Even in the rapidly evolving space of Gen AI — be it Llama, Gemma, or mixture-of-experts variants requiring complex DL primitives such as flash attention variants — cuDNN is powering them all. Learn about new/updated APIs of cuDNN pertaining to Blackwell’s microscaling format, and how to program against those APIs. We’ll deep dive into leveraging its graph APIs to build some fusion patterns, such as matmul fusion patterns and fused flash attention from state-of-the-art models. Understand how new CUDA graph support in cuDNN, not to be mistaken with the cuDNN graph API, could be exploited to avoid rebuilding CUDA graphs, offering an alternative to CUDA graph capture with real-world framework usage.

Train and Serve AI Systems Fast With the Lightning AI Open-Source Stack (Presented by Lightning AI)
Luca Antiga, CTO, Lightning AI

See how the Lightning stack can cover the full life cycle, from data preparation to deployment, with practical examples and particular focus on distributed training and high-performance inference. We’ll show examples that focus on new features like support for multi-dimensional parallelism through DTensors, as well as quantization through torchao.

Connect With Experts (Interactive Sessions)

Meet the Experts From Deep Learning Framework Teams
Eddie Yan, Technical Lead of PyTorch, NVIDIA
Masaki Kozuki, Senior Software Engineer in PyTorch, NVIDIA
Patrick Wang (Enterprise Products), Software Engineer in PyTorch, NVIDIA
Mike Ruberry, Distinguished Engineer in Deep Learning Frameworks, NVIDIA
Rishi Puri, Sr. Deep Learning Engineer and Lead for PyTorch Geometric, NVIDIA

Training Labs

Kernel Optimization for AI and Beyond: Unlocking the Power of Nsight Compute
Felix Schmitt, Sr. System Software Engineer, NVIDIA
Peter Labus, Senior System Software Engineer, NVIDIA

Learn how to unlock the full potential of NVIDIA GPUs with the powerful profiling and analysis capabilities of Nsight Compute. AI workloads are rapidly increasing the demand for GPU computing, and ensuring that they efficiently utilize all available GPU resources is essential. Nsight Compute is the most powerful tool for understanding kernel execution behavior and performance. Learn how to configure and launch profiles customized for your needs, including advice on profiling accelerated Python applications, AI frameworks like PyTorch, and optimizing Tensor Core utilization essential to modern AI performance. Learn how to debug your kernel and use the expert system built into Nsight Compute, known as “Guided Analysis,” that automatically detects common issues and directs you to the most relevant performance data all the way down to the source code level.

Make Retrieval Better: Fine-Tuning an Embedding Model for Domain-Specific RAG
Gabriel Moreira, Sr. Research Scientist, NVIDIA
Ronay Ak, Sr. Data Scientist, NVIDIA

LLMs power AI applications like conversational chatbots and content generators, but are constrained by their training data. This might lead to hallucinations in content generation, which requires up-to-date or domain-specific information. Retrieval augmented generation (RAG) addresses this issue by enabling LLMs to access external context without modifying model parameters. Embedding or dense retrieval models are a key component of a RAG pipeline for retrieving relevant context to the LLM. However, an embedding model’s effectiveness to capture the unique characteristics of the custom data hinges on the quality and domain relevance of its training data. Fine-tuning embedding models is gaining interest to provide more accurate and relevant responses tailored to users’ specific domain.

In this lab, you’ll learn to generate a synthetic dataset with question-context pairs from a domain-specific corpus, and process the data for fine-tuning. Then, fine-tune a text embedding model using synthetic data and evaluate it.

Poster Presentations

Single-View X-Ray 3D Reconstruction Using Neural Back Projection and Frustum Resampling
Tran Minh Quan, Developer Technologist, NVIDIA

Enable Novel Applications in the New AI Area in Medicine: Accelerated Feature Computation for Pathology Slides
Nils Bruenggel, Principal Software Engineer, Roche Diagnostics Int. AG

Read More

Introducing the New PyTorch Landscape: Your Guide to the PyTorch Ecosystem

Introducing the New PyTorch Landscape: Your Guide to the PyTorch Ecosystem

We’re excited to reveal our brand new PyTorch Landscape. The PyTorch Landscape helps researchers, developers, and organizations easily locate useful, curated, community-built tools that augment the PyTorch core framework.

landscape banner

What the Landscape Offers

The Landscape visually organizes projects into three categories—Modeling, Training, and Optimizations—making finding relevant frameworks, libraries, and projects easy. Users can quickly locate curated, valuable tools for a variety of use cases that complement the PyTorch framework. Each tool that is part of the Landscape has been reviewed and vetted by PyTorch project experts. The projects in the Landscape are considered to be mature and healthy and provide valuable capabilities that complement the PyTorch framework in their respective domains.

Explore the AI Landscape

The Explore page presents platforms, tools, and libraries, each with a logo, description, and links to GitHub and further details. This categorized, visual approach simplifies discovery and provides quick access to essential technologies.

Guide Page: A Closer Look

For deeper insights, the Guide page expands on each project, highlighting methodologies and trends shaping AI development, from adversarial robustness to self-supervised learning. There are also project statistics provided for each project, including metrics such as number of stars, contributors, commit history, languages used, license, and other valuable metrics that provide an in-depth understanding of the project and how it may be used.

Tracking AI’s Growth: The Stats Page

The Stats page provides insights into AI development trends, tracking repository activity, programming languages, and industry funding data.

  • Repositories: 117 repositories, 20.5k contributors, and 797.2k stars across 815MB of source code.
  • Development Trends: Weekly commit activity over the last year.
  • Licensing Breakdown: Repositories are categorized by license type.
  • Funding & Acquisitions: Insights into investment trends, including funding rounds and acquisitions.

Why Use the PyTorch Landscape?

Finding useful and high quality open source projects that complement the PyTorch core system can be overwhelming. The PyTorch Landscape offers a clear, accessible way to explore the ecosystem of community-built tools, whether you’re researching, building models, or making strategic decisions.

Stay ahead with the PyTorch Landscape — your guide to the PyTorch Ecosystem.

Want to Contribute a Project to the PyTorch Landscape?

Have you built a useful open source tool that you would like to share with the PyTorch community? Then help us grow the Ecosystem by contributing your tool! You can find the instructions to apply here. We welcome all contributions from the community!

Read More

Scaling Recommendation Systems Training to Thousands of GPUs with 2D Sparse Parallelism

Scaling Recommendation Systems Training to Thousands of GPUs with 2D Sparse Parallelism

At Meta, recommendation systems are the cornerstone of delivering relevant and personalized ads to billions of users globally. Through technologies like PyTorch’s TorchRec, we’ve successfully developed solutions that enable model training across hundreds of GPUs. While these systems have served us well, recent research on scaling laws has revealed a compelling opportunity: we can achieve significantly better model performance by training dramatically larger neural networks.

However, this insight presents us with a new challenge. Our current training infrastructure, though highly optimized for hundreds of GPUs, cannot efficiently scale to the thousands of GPUs needed to train these larger models. The leap from hundreds to thousands of GPUs introduces complex technical challenges, particularly around handling sparse operations in recommendation models. These challenges require fundamentally new approaches to distributed training, which we address with a novel parallelization strategy.

To address these issues, we introduced 2D embedding parallel, a novel parallelism strategy that overcomes the sparse scaling challenges inherent in training large recommendation models across thousands of GPUs. This is available today in TorchRec through the DMPCollection API. This approach combines two complementary parallelization techniques: data parallelism for the sparse components of the model, and model parallelism for the embedding tables, leveraging TorchRec’s robust sharding capabilities. By strategically integrating these techniques, we’ve created a solution that scales to thousands of GPUs and now powers Meta’s largest recommendation model training runs.

What are the sparse scaling challenges?

We identified three key challenges that prevented us from naively scaling our model to thousands of GPUs:

  • Imbalancing and straggler issue: with more GPUs it’s harder to achieve balanced sharding, some ranks can have much heavier workload for embedding computations, which can slow down the entire training.
  • Communication across nodes: As training jobs utilize an increased number of GPUs, the all-to-all communication bandwidth can drop under certain network topologies which can increase communication latency significantly.
  • Memory overhead: The memory used by input features is often negligible, however, as we use thousands of GPUs, we can introduce larger input features and the memory requirements can become significant.

With 2D embedding parallel, we can describe our new parallelism scheme like this, in this example we have 2 model replicas (Replica 1: GPU1/GPU3, Replica 2: GPU2/GPU4)

Flow diagram

Figure 1: Layout illustration of 2D Sparse Parallelism

With 2D sparse parallelism we address these challenges, instead of sharding tables across all ranks, we first evenly divide all ranks into several parallel groups:

  1. Within each group, we use model parallel for the embedding tables, such as column-wise/row-wise sharding. At scale, for our largest tables, we have also developed a grid sharding, which shards embedding tables on the row and column dimension.
  2. Across groups, we do data parallel, such that each rank in a group has its corresponding replica rank in the other groups (replica rank means storing the same embedding table shards).
    1. After each group has completed its own backward pass, we all reduce the embedding table weights across the replicas to keep them synchronized.

Our production solution

TorchRec is our library to build the sparse part of the recommendation models in native PyTorch. With the traditional API being DistributedModelParallel which applies model parallel to the embedding tables. We introduce a new API alongside it, known as DMPCollection, which serves as the main entry point for enabling 2D parallel on TorchRec models. We designed it to be as easy of a change as applying FSDP/DDP is.

To understand what DMPCollection does, we have to understand what DistributedModelParallel (DMP) does first:

  1. Create embedding tables, known as EmbeddingBagCollection and EmbeddingCollections.
  2. Generate a sharding plan with respect to GPU topology, embedding tables, memory available, input data, and more.
  3. Wrap model with DMP and the associated sharding plan passed in.
  4. DMP initializes and shards the embedding tables in accordance with the sharding plan.
  5. On a train step, DMP takes an input batch, communicates it to the appropriate GPUs containing the embedding table shard of interest, looks up the value, and returns it back to the GPU that requested it. This is all done on the global process group, with some exceptions for special sharding (such as table row wise sharding)

DistributedModelParallel was built for model parallel with many parts working under the assumption of sharding and working around the global world size. We need to change these parts in a way where we can introduce additional dimensions of parallelism without losing the optimizations and feature set of TorchRec.

DMPCollection changes a few key parts to enable 2D parallel in an extensible way,

  • Generate sharding plans for the smaller sharding group once, once passed in we communicate to the appropriate ranks across the global group and remap the ranks to fit the new sharding group ranks.
  • Create two new NCCL process groups, known as sharding and replica process groups. The sharding process group is passed into sharding and train step components of TorchRec. The replica process group is used for the weight and optimizer state synchronization, the all reduce call happens over this process group.
    • The sub NCCL process groups allow us to efficiently communicate only between the ranks that are relevant for a particular comm. Each rank will have two associated process groups.

To the user, the change is very simple, while taking away all the complexity around applying the parallelism strategies to the model.

How do we create these sharding and replication groups?

These process groups are one of the keys to DMPCollection’s performant implementation. From our earlier diagram, we showed a simple 2×2 GPU setup, however, at scale, how do we assign which ranks are part of a given sharding group and what are their replica ranks across the sharding groups?

Consider the following setup with 2 nodes, each with 4 GPUs. The sharding and replication groups under 2D parallel will be,

Sharding Group Sharding Ranks
0 0, 2, 4, 6
1 1, 3, 5, 7
Replication Group Replication Ranks
0 0, 1
1 2, 3
2 4, 5
3 6, 7

We use the following formulation,

  1. Divide all trainers into G sharding groups, each with L trainers
    1. Groups, G, is determined by G = T / L, where T is total number of trainers
  2. For each group, G, we assigned non-contiguous trainer ranks based on the group it’s in, following,
    1. [i, G+i, 2G+i, …, (L – 1) G+i], where* i = 0 to G-1*
  3. From the groups, G, we can create the replication group, which is every G continuous ranks
    1. (0 to G-1, G to 2* G – 1) each continuous set stores the duplicate embedding table shards.

This means our sharding groups, G, are of size L, which can be known as the number of ranks to apply model parallel across. This, in turn, gives us replica groups, each of size G, which are the ranks we data parallel across.

In DMPCollection, we’re able to create these process groups efficiently with the use of DeviceMesh, we create the entire GPU topology in a 2×2 matrix, with each row representing the group of sharding ranks and each column representing the corresponding replica ranks,

create peer matrix
num_groups = global_world_size // sharding_group_size
for each group_rank in num_groups:
	peers = [num_groups * rank + group_rank for rank in range(sharding_group_size)]
	add peer to peer matrix

initalize DeviceMesh with two dimensions (shard, replicate)
slice DeviceMesh on shard for sharding process group
slide DeviceMesh on replicate for replica process group

With our DeviceMesh approach, should we want to change the topology or provide further flexibility in the future, we can easily extend our creation logic to any form of topologies and even extend for further dimensions of parallelism if needed.

Performance of 2D parallel

Our rank partitioning strategy optimizes communication patterns by strategically placing model replica ranks for each shard within the same compute node. This architecture provides significant performance benefits for the weight synchronization operation. After the backward pass, we perform all-reduce operations to synchronize model weights—which is an expensive process given the large parameter counts we have to communicate and sync—with our setup of placing replicas on the same node we leverage intra node’s high-bandwidth over-relying on slower inter-node bandwidth.

The effect of this design choice on the other communication collectives generally improves the latencies. The improvement stems from two factors.

  1. By sharding the embedding tables over a reduced number of ranks and conducting communications for the model within the smaller group, we achieve a lower all-to-all latency.
  2. With the replication in 2D parallel, our embedding lookup latency on a rank reduces, we can reduce the local batch size to 1/Nth of the equivalent global batch size, where N is the number of model replicas.

A production model trace exemplifies these two factors, here we run the 2D parallel job on 1024 GPUs, with a sharding group size of 256 GPUs.

State diagram

Figure 2: Comparing latencies between non 2D parallel and 2D parallel workloads

There are two key levers users have to tune to maximize performance for their workloads:

  1. The size of the model sharding group relative to the global world size. The global world size divided by the sharding group size represents the number of model replicas we will have.
    1. To maximize performance, users can look to scale up their model up to 8x, this scaling factor maintains the intra-host all reduce.
      1. For further scaling, the all reduce would have to happen over inter host. From our experiments, we did not see an obvious performance regression and in fact note advantages of an inter host all reduce. We can change our sharding and replica topology to inter host all reduce, which can help us introduce fault tolerance strategies should a particular host go down.
  2. Frequency of all reduce synchronization, DMPCollection comes with a sync() call, which can be tuned to be called every N training steps, performing a sort of local SGD training. With scale, reducing the frequency of synchronization can bring significant gains to performance.

Future Work

Readers should note that 2D sparse parallel training differs from non-parallelized training because we synchronize the embedding table weights rather than the gradients. This approach is made possible by TorchRec’s use of FBGEMM, which provides optimized kernels under the hood. One of FBGEMM’s key optimizations is the fusion of the optimizer in the backward pass. Instead of fully materializing the embedding table gradients—which would consume significant memory—they are passed directly to the optimizer update. Attempting to materialize and synchronize these gradients would create substantial overhead, making that approach impractical.

Our exploration revealed that to achieve training results comparable to the baseline, we synchronize optimizer states on a delayed schedule, with the timing dependent on the number of sharding/replica groups (ie: for Adagrad we update the momentum behind by one sync step). This approach also enables users to implement local SGD or semi-synchronized training strategies, which can achieve convergence and potentially produce better loss curves than the baseline.

We thank you for reading our post! This is an exciting direction we have come across that we hope to develop further to maximize performance of recommendation systems and push the state of the art.

Read More

Powering AI with PyTorch, Fedora, and Open Source Communities

Powering AI with PyTorch, Fedora, and Open Source Communities

man speaking at a conference

At DevConf.IN 2025 in Pune, I had the opportunity to host a PyTorch Meetup on February 28th. The session, titled “Powering AI with PyTorch, Fedora, and Open Source Communities” was aimed at introducing PyTorch to students and professionals, explaining why PyTorch+Fedora form an ideal AI development platform. The other key aspect I covered was collaboration between open source communities.

Introduction to PyTorch

The Power of Deep Learning made simple

With the explosion of GPTs, there is a renowned interest in the field of AI and ML. The myth of developing AI/ML technologies and its applications is rocket science and far-fetched, needs correction. Only open source has the power to demystify this myth and further evolve the technology to make it versatile and developer friendly. Since its inception, PyTorch has evolved and has been a driving force to make AI/ML development extremely simple. I covered the aspects of PyTorch key components, its features and why PyTorch is the best choice as a deep learning framework.

man speaking at a conference

The codewalk through was designed to showcase how easy and simple it is to utilise the power of GPUs, creating a simple neural network and training the model. The code walkthrough was very well received and it was great to hear back from the attendees that they never knew how powerful PyTorch is for deep learning. The real world examples sighted how this powerful framework can be used beyond the common GPTs and has the power to influence across a broad spectrum of applications.

Fedora+PyTorch the Ideal AI/ML Development Platform

man speaking at a conference

man speaking at a conference

One of the highlights of the event was the discussion on Fedora’s role as an AI platform. Fedora’s reliability, flexibility, and strong community support make it an ideal partner for PyTorch, allowing developers to focus on model-building without worrying about infrastructure. The students were intrigued by the idea of contributing to Fedora’s AI/ML ecosystem while building their own projects. Sumantro Mukherjee spoke about the AI policy in Fedora and how one can start contributing to the AI/ML using Fedora as a platform. He highlighted how Fedora is evolving to meet the needs of AI practitioners. The idea that an open-source operating system could provide the perfect foundation for AI research sparked an engaging conversation.

Innovation in Open Source When Communities Come Together

charts

It is important that we learn from history and repeat the good things! When open source communities come together they can create seismic shifts in the industry. To drive this home, I took the audience on a journey through history, revisiting a pivotal moment when Apache and Linux came together, solving common problems and fundamentally reshaping enterprise computing. That moment was not just about technology; it was about collaboration. It was about two powerful communities recognizing that they were stronger together. Today, we stand at the cusp of another such moment – PyTorch and Linux, particularly Fedora, are coming together to shape the future of AI/ML. This is not just an opportunity but a responsibility for contributors, developers, and AI/ML enthusiasts to be part of this movement.

Looking Ahead

man speaking at a conference

One of the best parts of the event was the enthusiasm it generated. Diverse audience, including students, AI enthusiasts, and industry professionals. Notably, Vincent Caldeira (CTO, APAC, Red Hat) and Chris Butler (Senior Principal Chief Architect, Red Hat) were present, reinforcing the growing interest in open-source AI/ML. Many students were eager to explore PyTorch and Fedora, contribute to open-source AI projects, and start their own AI experiments. Industry experts saw the potential for scalable, community-driven AI innovation. The session sparked curiosity and conversations that continued long after the event ended.

Read More