Introducing the PyTorch Ecosystem Working Group and Project Spotlights

Introducing the PyTorch Ecosystem Working Group and Project Spotlights

The PyTorch Ecosystem goes back several years, with some of its earliest projects like Hugging Face, Fast.ai, and PyTorch Lightning going on to grow incredible communities of their own. The goal from the beginning was to bring together innovative open source AI projects that extend, integrate with, or build upon PyTorch. Some of the key aspects we looked at were, for example, that they were well tested and maintained (including CI), were easy to onboard as a user, and there was a growing community. Fast forward several years, and the ecosystem continues to thrive with a vibrant landscape of dozens of projects spanning privacy, computer vision, to reinforcement learning. Enter the PyTorch Ecosystem Working Group.

In early 2025, the PyTorch Foundation created the PyTorch Ecosystem Working Group to showcase projects that could be of interest to the community and represent projects that are mature and healthy, standing out in their respective domain. The working group, composed of members across the ecosystem, was tasked with defining a clear bar including functional requirements (e.g., CI, licensing…), measurable requirements (e.g., commits and contributors), and the implementation of best practices for how to structure their repos. The working group also implemented a streamlined submission and review process and a transparent lifecycle. It’s still very early, but the reception from the community has been great, with 21 submissions so far and a strong pipeline of projects in review.  You can learn more about this working group’s goals here, including the requirements and application process. 

As part of this new blog series, every quarter we will update the community on new entries in the PyTorch Ecosystem, as well as highlight up and coming projects that are in consideration that will benefit from more eyes and contributors.

Ecosystem Project Spotlights

We’re happy to welcome SGlang and docTR to the PyTorch Ecosystem. Here’s a short intro to both.

SGLang

SGLang is a fast-serving engine for large language models and vision language models. It makes the interaction with models faster and more controllable by co-designing the backend runtime and frontend language.

The core features include:

  • Fast Backend Runtime: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, and quantization (FP8/INT4/AWQ/GPTQ).
  • Flexible Frontend Language: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
  • Extensive Model Support: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
  • Active Community: SGLang is open source and backed by an active community with industry adoption.

SGLang is famous for its fast speed. It can often significantly outperform other state-of-the-art frameworks in terms of serving throughput and latency. Learn more.

docTR

docTR is an Apache 2.0 project developed and distributed by Mindee to help developers integrate OCR capabilities into applications with no prior knowledge required.

To quickly and efficiently extract text information, docTR uses a two-stage approach:

  1. First, it performs text detection to localize words.
  2. Then, it conducts text recognition to identify all characters in a word.

Detection and recognition are performed by state-of-the-art models written in PyTorch. Learn more.

Up and Coming Project Spotlights

As part of this series, we highlight projects that are in consideration for the PyTorch Ecosystem, and that we believe will benefit from more eyes and contributors. This time it’s the turn of EIR and torchcvnn.

EIR

EIR is a comprehensive deep learning framework built on PyTorch that enables researchers and developers to perform supervised modeling, sequence generation, image/array generation, and survival analysis across multiple data modalities. EIR specializes in handling complex data types, including genotype, tabular, sequence, image, array, and binary inputs. While it has particular strengths in genomics and biomedical applications, its versatile handling of these diverse data types allows for broader applications across various sectors. For example, EIR’s multi-modal approach can enhance tasks such as detecting manufacturing defects by linking images with equipment readings (e.g., for an imperfect silicon wafer), monitoring infrastructure by analyzing site photos along with operational logs (e.g., to identify cracks in a pipeline), or improving retail insights by combining product images with their descriptions and sales figures. This demonstrates how EIR’s multi-modal capabilities can bring value to a wide range of industries.

The framework provides a high-level, yet modular API that reduces the amount of boilerplate code and pre-processing required to train models, allowing users to focus on their end goals rather than implementation details. To learn more and explore practical examples, please refer to the documentation. 

Key features include:

  • Multi-modal inputs: Seamless integration of genotype, tabular, sequence, image, array, and binary data.
  • Varied modeling options: Use any of the input modalities above for supervised learning, sequence generation, image/array generation, and survival analysis.
  • Scaling: Capabilities for custom data streaming for model training.
  • Explainability: Built-in explainability functionality for when performing supervised learning and survival analysis.
  • Model Deployment: Serve any of your trained models with just one command, allowing you or others to interact with your models via web services.

To explore EIR and consider how it might enhance your work with multi-modal data:

torchcvnn

torchcvnn is a library that helps researchers, developers, and organizations to easily experiment with Complex-valued Neural Networks (CVNNs)! In several domains, data are naturally represented in real-imaginary form, for instance, remote sensing, MRI, and many more. These domains would benefit from direct complex-valued computations, giving understanding about critical physical characteristics to the neural networks during the learning process.

torchcvnn gives you easy access to: 

  • Standard datasets for both remote sensing (SLC and ALOS2 formats) and MRI, and for different tasks (classification, segmentation, reconstruction, super-resolution)

 

  • Various activation functions, either operating independently on the real/imaginary components or fully exploiting the complex nature of the representations,

 

  • Normalization layers with the complex-valued BatchNorm of Trabelsi et al.(2018), LayerNorm, RMSNorm,

 

  • Complex-valued attention layer as introduced in Eilers et al. (2023),

PyTorch already supports optimization of complex-valued neural networks by implementing Wirtinger Calculus. However, there are still complex-valued building blocks missing to really be able to explore the capabilities of complex-valued neural networks. The objective of torchcvnn is to fill in this gap and to provide a library helping the PyTorch users to dig into the realm of complex-valued neural networks.

torchcvnn warmly welcomes contributions to both the core torchcvnn library or to the examples’ repository for whether spotting a bug, having suggestions for improvements, or even wanting to contribute to the source code. All the components are described in the documentation of the project. The torchcvnn team will be present at IJCNN 2025 in July in Rome during the special session on “Complex- and Hypercomplex-valued Neural Networks.”

How to Join the PyTorch Ecosystem

If you’re developing a project that supports the PyTorch community, you’re welcome to apply for inclusion in the Ecosystem. Please review the PyTorch Ecosystem review process to ensure that you meet the minimum expectations before applying.

Cheers!

The PyTorch Ecosystem Working Group

Read More

Open Source AI is Transforming the Economy—Here’s What the Data Shows

Open Source AI is Transforming the Economy—Here’s What the Data Shows

The Economic and Workforce Impacts of Open Source AI

Blog cross-posted on the Linux Foundation blog.

As we approach the midpoint of 2025, the potential of AI to transform businesses, economies, and industries is not only widely anticipated and nearly universal but also well documented. In a commissioned project by Meta, LF Research set out to capture existing evidence on this topic, with the specific aim of understanding how open source is playing a role in this transformation.

In its latest publication, The Economic and Workforce Impacts of Open Source AI, LF Research describes the nuances of how and to what extent open source AI (OSAI) is impacting the global economy and workforce. By examining existing evidence from industry, academic, and open source research, the authors found important insights on OSAI’s adoption rates, cost effectiveness, innovation-boosting potential, and more. Here are the big takeaways.

First, the adoption of open source AI is already widespread. Nearly all software developers have experimented with open models, and about 63% of companies are actively using them. In fact, among organizations that have embraced AI in any form, a striking 89% incorporate open source AI somewhere in their infrastructure. It’s no longer a fringe approach—it’s becoming the standard.

89 percent of orgs incorporate open source AI somewhere in their infrastructure

Why? Cost is a huge factor. Open source tools often come with significantly lower price tags than their proprietary counterparts. My prior research with Manuel Hoffmann and Yanuo Zhou has shown that if open source didn’t exist, companies would spend 3.5 times more on software than they currently do. The new LF report shows that two-thirds of organizations say OSAI is cheaper to deploy, and nearly half cite cost savings as a primary reason for choosing open source. Combine that with studies showing AI’s ability to cut business unit costs by over 50%, while still being user friendly and maintaining high performance, and it’s clear that OSAI represents a strategic advantage for boosting margins and scaling innovation.

two-thirds of organizations say Open Source AI is cheaper to deploy

Innovation and entrepreneurship are other major benefits of open source. In research with Nataliya Langburd Wright and Shane Greenstein, we found that when open source contributions increase at the country level, so do new startups; at the company level, there is a positive relationship between contributing to open source and startup growth. Open source encourages collaboration, inviting contributions from a global pool of developers and researchers. This external input helps accelerate the development of high-quality models. As Daniel Yue and I found when Meta donated the machine learning library PyTorch to the Linux Foundation, there was a notable increase in corporate contributions, especially from chip manufacturers.

Open Source AI encourages collaboration and accelerates the development of high-quality models

AI’s cost-cutting capabilities are not only linked to the increased productivity that comes from freed-up resources, but also from a re-orienting of the way people work—similar to how the full impact of the steam engine led to the industrial revolution, but only after factories re-oriented their entire work flow around it. Manuel Hoffmann, Sam Boysel, Kevin Xu, Sida Peng, and I found this to be the case with software developers. When GitHub rolled out their GenAI coding tool Copilot, developers changed the way that they worked by spending more time writing code and substantially less time doing project management. However, according to existing research identified in the LF study, this has not translated to substantial layoffs: 95% of surveyed hiring managers over the past two years said they do not plan to reduce headcount due to AI. What’s more, being able to use AI tools effectively may actually increase wages by over 20%.

Looking ahead, open source AI is likely to become foundational in areas like edge computing, where smaller, privacy-preserving models need to run efficiently on local devices. OSAI is also making big inroads in industry-specific applications. In manufacturing, for instance, open models offer the flexibility required to integrate AI into complex operational workflows. And in healthcare—a traditionally conservative and risk-averse field—open models are already matching proprietary ones in performance, giving institutions confidence to adopt without compromising on quality. OSAI is an important avenue to level the playing field, no matter your organization’s size or financial resources—as the report found, small businesses are adopting OSAI at higher rates than their larger counterparts.

small businesses are adopting open source AI at higher rates than their larger counterparts

OSAI is an economic force. It’s reducing costs, accelerating innovation, and empowering a wider range of players to shape the future of technology.

Read the Report

What’s Next for OSAI? Five Areas Ripe for Research

While the impact of OSAI is starting to take shape, the full scope of its influence is just beginning to unfold. To better understand and harness the potential of OSAI, the report outlines five key areas for future research, each crucial to shaping smart policy, business strategy, and innovation ecosystems.

  1. Tracking the Bigger Picture: OSAI’s Role in Market Growth
    One pressing question is how open models are influencing the overall AI market. Beyond the tools themselves, OSAI may be driving complementary innovation, spurring growth in services, applications, and platforms built on top of open infrastructure. Understanding this broader ripple effect is essential for grasping the true economic footprint of open AI.
  2. Making the Case for Investment
    To help make informed decisions, researchers are encouraged to analyze the return on investment in OSAI infrastructure at both country and company levels. Quantifying the long-term value of these open components, from datasets and compute to developer tooling, can guide resource allocation and policy decisions in a fast-moving field.
  3. Connecting Openness to Innovation
    Does OSAI directly foster more startups, patents, or efficient R&D? Future studies should explore how open access to models and tools correlates with concrete innovation metrics. This could provide evidence for how openness accelerates not just adoption, but invention.
  4. Crunching the Cost Numbers
    A detailed comparison of costs between open and proprietary AI solutions across sectors, company sizes, and global regions would shed light on who benefits most from going open. These insights would be invaluable for organizations navigating tight budgets and evaluating technology strategies.
  5. Understanding Workforce Impacts
    Finally, the human side matters. As AI tools reshape work, it’s vital to measure how open models affect worker productivity, satisfaction, and work patterns. Do open tools empower workers in certain tasks or industries more than others? Do they lead to more flexible, fulfilling roles? Answers to these questions will help ensure that AI benefits not just business, but people.

By exploring these future research areas, we can unlock a deeper understanding of how open source AI is transforming the global economy and workforce. The era of open source AI is here—and it’s time to study its impact with depth and rigor.

Read More

Build Responsible AI Products with your own Yellow Teaming LLM

Build Responsible AI Products with your own Yellow Teaming LLM

The tools we use to build AI are evolving fast, with PyTorch at the heart of many advances. But unless we evolve the way we approach building AI systems, we risk amplifying harm as fast as we’re scaling up performance. Building AI responsibly means designing systems that not only perform well but do so fairly, safely, and transparently—like making sure an AI hiring tool doesn’t favor one demographic over another.

One useful approach to developing responsible AI systems is Yellow Teaminga proactive exercise that surfaces potential unintended consequences before deployment. Yellow Teaming helps companies stand out in a crowded market by making more thoughtful, impact-aware design choices that lead to an overall better product.

In this blog, we show how you can quickly create a PyTorch-based LLM Yellow Teaming assistant running on AWS Graviton4 with a reusable system prompt. We also give you an example to show you how to use your new assistant to explore unintended business-critical consequences of feature design and ultimately build better products.

Let’s get started.

What is Yellow Teaming:

You may already be aware of the more popular term Red Teaming in cybersecurity, which involves simulating how adversaries might attack your product and fixing vulnerabilities before launch. Other color-coded approaches exist (like Blue Teams that defend against attacks), but Yellow Teaming is distinct in focusing on thoughtful design and implementation from the start of the product’s lifecycle. Red Teaming practices have already been adapted to the AI domain. Yellow Teaming principles are now becoming an important part of AI development as well.

The practice of Yellow Teaming asks a set of probing questions to help reveal the broader, unintended impacts of your product on your business, your users, and society at large. This application of Yellow Teaming, and the rationale behind it, are explained eloquently in the Development in Progress essay by The Consilience Project. A closely related practice is also offered in the module, Minimizing Harmful Consequences, in the Center for Humane Technology free course.

Why Does Yellow Teaming Matter?

The central idea is that by analyzing the consequences of your product decisions with a wide view, you can design better products that create positive feedback loops for your company’s bottom line and your users’ well-being. For example, it helps you avoid building a chatbot that unintentionally reinforces bias.

Traditional product development practices often solve for narrowly defined success metrics. Creating specific product measurables is good for focus and accountability, but can lead to over-optimization on metrics while ignoring other signals that matter to your company. For instance, building an app with AI-driven recommendations that boosts engagement in the short term but makes people feel worse and fails to retain users over time.

Narrow product optimization tends to cause unmeasured negative effects. These include users getting burnt out or frustrated when using your product, reputational harm or less overall engagement with your company, and society fracturing from lack of trust and meaningful communication.

In many cases, what looks like product success on paper is actually harming your users, your company, and your long-term goals.

How to Implement Yellow Teaming Practices

Yellow Teaming is straightforward and powerful. Pick a product you are building, and systematically evaluate the various consequences for your users, your business, and society when adopted at scale. Start with direct consequences, then move to second- and third-order consequences by asking ‘what happens as a result of the previous effects?’ You should think through these consequences across multiple axis:

  1. Good and bad
  2. Short-term and long-term
  3. Intended and unintended
  4. Your company and your users
  5. A single user and groups of users

These types of questions help foster productive brainstorming:

  • What kinds of behaviors will this feature incentivize in users?
  • What affordances does this technology provide (what can users now do that they couldn’t before, even if unintended)?
  • Will this improve or degrade trust in our platform?
  • What social groups might benefit—or be left behind?

Yellow Teaming is based on complex systems thinking and externality analysis—fields that have traditionally felt far removed from engineering workflows. But by incorporating a lightweight Yellow Teaming assistant to help your ideation processes, it can become an intuitive, high ROI part of product development.

Building Your PyTorch YellowTeamGPT

The good news is that you don’t need a PhD in philosophy or a panel of advisors to Yellow Team your AI project. You just need to be willing to act and, in this implementation of Yellow Teaming, use a good LLM with the right prompt. There are several advantages to running your LLM locally. The biggest is that you can safely feed in confidential product plans without worrying about your data being leaked. Another benefit is that the smaller model is not perfect and makes mistakes, forcing us as users to apply critical thinking to every output, and putting us in the right headspace to analyze non-obvious product consequences.

Here is how you can set up a PyTorch-based 8-billion parameter Llama3 model on your Graviton instance. First, create a r8g.4xlarge instance running Ubuntu 24.04 with at least 50 GB of storage, then follow these three steps:

1. Set up your machine with the torchchat repo and other requirements:

sudo apt-get update && sudo apt install gcc g++ build-essential python3-pip python3-venv google-perftools -y

git clone https://github.com/pytorch/torchchat.git && cd torchchat

python3 -m venv .venv && source .venv/bin/activate

./install/install_requirements.sh

2. Download the model from Hugging Face (HF) by entering your HF access token (note the max sequence length parameter, which you can increase to enable longer conversations with a linear increase in memory usage):

pip install -U "huggingface_hub[cli]"

huggingface-cli login

python torchchat.py export llama3.1 --output-dso-path exportedModels/llama3.1.so --device cpu --max-seq-length 8192

3. Run the model with Arm CPU optimizations and 700 max token length per response:

LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python torchchat.py generate llama3.1 --dso-path exportedModels/llama3.1.so --device cpu --max-new-tokens 700 --chat

For more details on these commands and additional code snippets to add a UI to this chatbot, review this Arm Learning Path.

You can then enter a custom system prompt. Below is a simple prompt that turns your local LLM into a Yellow Teaming assistant. Feel free to review and tweak it to get the most out of it for your specific needs. Here’s what it does:
  1. Gathers key product details: What you’re building, how it makes money, who your users are.
  2. Analyzes direct and indirect consequences: YellowTeamGPT presents one at a time, considering non-obvious impacts to your business, users, and beyond (you’ll likely start to think of more impacts on your own).
  3. Iterates with you: You are in control, telling YellowTeamGPT to continue listing general direct consequences, identifying specific company risks, moving to 2nd-order effects, and even brainstorming features to make your product better.

Here is the YellowTeamGPT system prompt for you to copy. If directly copying, make sure to copy as one line into your terminal or the new lines may cause issues.

You are an expert in complex systems thinking and AI product design, called YellowTeamGPT. You help technologists build better products that users love, and lower company risk. You do this by helping the user evaluate their product design decisions via the Yellow Teaming methodology, which identifies the consequences of design decisions on their business, their users, and society.

You will request from the user information about their product under development. Once you have enough information, you will analyze the product’s consequences that arise if deployed at scale. Structure your thinking to first review direct consequences, then 2nd order consequences that follow from the identified direct effects (by asking ‘what might happen next as a result?’). Consider consequences that impact the company, users, and society; are short and long term; are across categories like truth and understanding, human well-being, capability growth, economics, and more.

You are here to constructively challenge users, not reinforce their existing ideas. Play devil’s advocate to help users think in ways they are not currently.

You will output in this format: For each identified consequence, tie the impact to product quality, and prompt the user with a question that helps them design the product better to mitigate that consequence (or turn a negative impact into a positive one). List one consequence at a time and ask the user to continue listing them or explore that consequence further.

Example Yellow Teaming

Give your LLM the provided system prompt and hit enter. Next, your YellowTeamGPT assistant will ask for some product details. Here is a hypothetical example product I used:

I’m building an app that turns a group chat conversation into a catchy pop song. Targeting any user, like WhatsApp users. Key functionality is importing a group chat conversation and outputting a tune with lyrics and beat to match. It is an app on any smartphone. Ideally, millions of users. Would make money by targeted advertising of the users.

You’ll notice, as YellowTeamGPT thinks and generates its reply, that it is notably slower than ChatGPT or other popular GPTs. Like the model’s inaccuracy, its slow speed can be perceived as a benefit. The point of this exercise is to slow down, think through non-obvious product impacts, and brainstorm enhancements that create positive value across the systems your product touches. While your YellowTeamingGPT is ‘thinking,’ you should be too.

And below are snippets of my conversation. First, it starts with one direct consequence:

I then instruct it to continue to another consequence:

I ask to explore the second-order effects of having misinformation spread at scale from this app:

Finally, I ask for help brainstorming product features to mitigate this harm. It generates a few interesting concepts that are not product-ready, but easily spark further ideation:

Using YellowTeamGPT for this use case, we were able to rapidly explore product impacts we may not have considered. We could then brainstorm features solving previously unconsidered problems, leading to an improved product experience that also mitigates the risk of reputational harm to our hypothetical company.

Integrating Yellow Teaming Into Your Practices

Anywhere you’re making decisions that shape your product’s features and the user experience, Yellow Teaming fits. Here are a few examples of where you can leverage your new YellowTeamGPT:

  • New product ideation sessions to expand your thinking.
  • Feature planning docs to stress-test your specs.
  • Code review workflows for flagging potential misuse.
  • Sprint retrospectives to reflect on design choices at scale.
  • Product pitch decks to show responsible AI due diligence.

It can be as formal or informal as you want. The more you and your team think about unintended, Nth-order product consequences across multiple axis, the better your product will be. By incorporating Yellow Teaming into your work, you don’t just do the right thing, you build products that:

  • Users engage with and trust more
  • Mitigate harmful impacts
  • Minimize company risk
  • Create lasting business value

Let’s stop thinking of responsible AI practices as something to check off a list and start thinking of it as what it really is –a competitive edge that creates positive outcomes for your company, for your users, and for our shared society.

Read More

PyTorch Hangzhou Meetup Recap: Exploring the AI Open Source Ecosystem and Cutting-Edge Technology Practices

PyTorch Hangzhou Meetup Recap: Exploring the AI Open Source Ecosystem and Cutting-Edge Technology Practices

On May 17, the PyTorch Meetup was successfully held in Hangzhou, drawing nearly 60 developers and industry experts from companies including Huawei, Tencent, Ant Group, and ByteDance. The event focused on the development of the PyTorch ecosystem, AI acceleration technologies, and industry practices. Through keynote speeches and technical sessions, in-depth discussions were held with participants, providing a valuable platform for exchange and collaboration.

Session Highlights:

Latest Developments in the PyTorch Community and Ecosystem Outlook

Yikun Jiang, a member of the PyTorch Technical Advisory Council (TAC), shared the latest updates from the PyTorch community. Topics included the general progress of PyTorch, PyTorch Foundation Expands to an Umbrella Foundation, the Ambassador Program, and PyTorch Conference planning. He emphasized how PyTorch continues to drive innovation and real-world adoption of AI open source technologies through technical iteration, ecosystem expansion, and global collaboration. He called on developers to actively engage in community building and help shape the future of the AI open source ecosystem.

Torchair: A torch.compile Backend Optimized for Ascend NPU

Peng Xue, Senior Engineer at Huawei, presented technical practices around graph mode optimization on Ascend NPUs. He introduced the two Torchair modes—Reduce-overhead and Max-autotune—and detailed deep optimizations in memory management, dynamic shapes, multi-stream parallelism, and compile-time caching. These improvements aim to enhance model training and inference performance while maintaining ease of use.

PyTorch Ecosystem on Ascend

Yuanhao Ji, Software Engineer at Huawei, discussed support for PyTorch ecosystem projects on Ascend NPUs. Focusing on model training, fine-tuning, and inference, he introduced TorchTitan, TorchTune, and vLLM as case studies. He explained their core features and adaptation strategies for Ascend, offering practical guidance for deploying PyTorch projects on this hardware.

Production Prefill/Decode Disaggregation Based on vLLM at Tencent

Chao Zhang, Senior Engineer at Tencent, presented the practice of Prefill/Decode (PD) separation in large model inference. This technique decouples the compute-intensive prefill stage from the memory-intensive decode stage, significantly improving system throughput and resource utilization. His talk covered key technical implementations such as KV cache transmission optimization, intelligent load balancing, and multi-turn dialogue caching. Real-world deployments on both homogeneous GPUs and heterogeneous setups like Ascend A2 + H20 showed performance improvements of 20%–50%. Tencent has further optimized the vLLM framework for CPUs, GPUs, and uses pipeline decomposition, low-precision KV caches, and graph compilers to enhance adaptability and performance across hardware platforms.

Key Reinforcement Learning (RL) Acceleration Techniques and Training Practices

Chenyi Pan, Senior Engineer at Huawei, shared Ascend’s breakthroughs in reinforcement learning and ecosystem development. Addressing the challenge of low resource utilization in RL systems, introduced a training-inference co-card solution that allows for efficient switching between the two tasks. This approach not only saves 50% in compute resources but also doubles single-card throughput and improves inference memory availability by 80%. To enrich the technical ecosystem, Ascend also launched TransferDock, a streaming data engine that employs dynamic load balancing strategies to improve task efficiency by over 10% compared to traditional caching mechanisms.

On the framework side, MindSpeed-RL combines the MindSpeed training backend with the vLLM inference engine, supporting dynamic weight partitioning and time-sharing of cluster resources while maintaining compatibility with mainstream open source ecosystems. Benchmarks using the Qwen2.5-32B model showed that this setup outperformed the SimpleRL-Zoo baseline on evaluations such as MATH500, demonstrating its technical leadership.

Ray’s Practice and Exploration in Ant Group’s AI Infra Ecosystem

Senlin Zhu, Senior Technical Expert at Ant Group and Head of Ant Ray, shared the practice and exploration of Ray within Ant’s AI Infra ecosystem. He outlined Ray’s architectural design and programming paradigm. Over time, Ray has evolved into critical infrastructure for AI systems, supporting training, inference, hyperparameter tuning, and reinforcement learning.

Since 2017, Ant Group has continuously invested in Ray, which now supports applications at the scale of 2 million cores. Ant has also contributed key features to the community, such as multi-tenancy support and the Flow Insight visual debugging tool. Flow Insight, in particular, has alleviated “black box” issues in complex AI systems and significantly improved observability and deployment efficiency at scale.

Challenges and Standardization in PyTorch Ecosystem Accelerator Development

Zesheng Zong, a community developer from Huawei, provided a systematic overview of the challenges, solutions, and case studies in developing accelerators for the PyTorch ecosystem. Developers integrating out-of-tree hardware face version compatibility issues and a lack of standardized quality benchmarks, making it hard to quantify new device support. In early 2025, a new exploration group was formed in the PyTorch community to tackle these challenges.

Key improvements include: Establishing a standardized testing framework using the public repository pytorch-fdn/oota for daily plugin testing. Developing the OpenReg module to simulate backend behavior and validate with test cases. Optimizing the PrivateUse1 plugin mechanism to reduce integration complexity. Supporting automatic plugin loading to simplify device access. Improving the torch.accelerator device-agnostic API for broader compatibility.

Intel’s community developer Chuanqi Wang followed up with a case study on integrating and running CI infrastructure using Intel Gaudi. He described how to leverage CI from code compilation and unit testing to TorchBench automated benchmarking, ensuring quality for new backend integrations. He also noted plans to reduce testing time, clarify required test items, and define quality standards to improve ecosystem compatibility and development efficiency.

This PyTorch Meetup served as a technical bridge for in-depth developer exchanges and demonstrated the vibrant energy of the PyTorch ecosystem in AI’s cutting-edge domains. Through diverse perspectives, the attendees sketched a picture of how open source collaboration drives technological progress. We look forward to more developers joining this open and thriving wave of innovation, where each exchange can spark new ideas in the age of intelligence.

Read More

The Open Source Legacy and AI’s Licensing Challenge

Open source licensing revolutionized software development, creating a thriving ecosystem built on shared innovation and collaboration. Licenses like MIT and Apache-2.0 gave developers a standard, legally robust way to share code, reducing friction and accelerating adoption.

Today, we stand at a similar inflection point with open AI models. These models, increasingly foundational to research and industry, lack an equivalent licensing standard. Existing open source software licenses weren’t designed with AI models in mind, while most model-specific licenses are either too complex, overly restrictive, or legally ambiguous.

To fully unlock the potential of open AI, we need a license purpose-built for the realities of machine learning. That’s where OpenMDW comes in.

Why AI Models Need a New License

AI models differ fundamentally from traditional software. They are:

  • Composites of multiple types of components: including code, architecture, training data, weights, documentation, and evaluation protocols.
  • Subject to overlapping IP regimes: such as copyright, database rights, and trade secrets, which vary across jurisdictions.
  • Distributed without a consistent definition of “open”: resulting in a fragmented licensing landscape.

This complexity has led to a proliferation of bespoke, incompatible licenses that often:

  • Limit redistribution, reuse, or modification.
  • Fail to address legal nuances unique to models.
  • Create uncertainty for developers and adopters alike.

The result? Friction in open ecosystems, legal ambiguity, and a significant barrier to collaboration and innovation.

The Origins of OpenMDW

OpenMDW,  short for Open Model, Data and Weights License  was born out of the effort to implement the Model Openness Framework (MOF). The MOF is a 3-tier classification system that defines what it means for a model to be truly “open”— not just available with limitations or use restrictions, but licensed openly across its code, architecture, parameters, training data, and documentation.

To make MOF practical, model developers needed a simple, standard license they could drop into any repository,  just like Apache-2.0 or MIT is used in software. Something purpose-built for many types of content including models, not just code.

What Makes OpenMDW Different

OpenMDW is the first truly permissive license designed from the ground up for machine learning models. Here’s what sets it apart:

Covers the Entire Model Stack

It’s designed to apply to all components of a model release:

  • Model architecture
  • Parameters and checkpoints
  • Training and inference code
  • Preprocessing and evaluation data
  • Documentation (e.g., model cards, data cards)

Importantly, OpenMDW does not require inclusion of all components. It applies only to what is distributed, while remaining compatible with many other licenses that may govern certain parts of the repository.

(OpenMDW users will of course have to continue to comply with any other third-party licenses that apply to other pre-existing materials in their repos, such as by providing license text and notices, source code where applicable, etc.)

Comprehensive and Legally Grounded

OpenMDW grants expansive permissions including under copyright, patent, database, and trade secret law, a broad legal spectrum of rights relevant to AI artifacts.

It also includes:

  • patent litigation termination clauses to deter patent assertions by users of the model’s materials
  • Attribution requirements to maintain provenance and trust

Compatible with Policy and Open Source Principles

  • Intended to be fully aligned with the EU AI Act’s references to “free and open-source licenses”
  • Supports the Open Source Initiative (OSI) 10 principles, including free redistribution, source availability, derived works and no discrimination against persons or groups

Designed for Simplicity

  • One license, one file, one place: a LICENSE file at the root of your repo
  • No complex licensing matrix: no confusion for downstream users
  • Easy integration into any repo: just like MIT or Apache-2.0.

Understanding the OpenMDW License

Definitions and Scope

Model Materials under OpenMDW include:

  • Model architecture and trained parameters; and
  • all other related materials provided under OpenMDW, which can include:
    • Preprocessing, training and inference code
    • Datasets and evaluation scripts
    • Documentation, metadata, and tools

This comprehensive scope maps directly to the Model Openness Framework (MOF), ensuring that all critical elements of a model are covered if they are included with the distribution.

The Model Materials are not intended to be a requirement of what has to be included in the distribution. It only specifies that what is included in the distribution is covered by the license, and excludes anything covered by other licenses in the distribution.

Grant of Rights

OpenMDW grants broad rights to “deal in the Model Materials without restriction,” including for example:

  • Use, modify and distribute the Model Materials
  • Operate under copyright, patent, database, and trade secret laws

These rights are granted free of charge, with no field-of-use restrictions,  removing ambiguity for developers and enterprises alike.

Attribution, Not Copyleft

OpenMDW imposes only minimal obligations:

  • Retain the license text
  • Preserve original copyright and attribution notices

There are no copyleft or share-alike conditions, meaning derivative models and integrations can remain fully permissive. This allows for maximum reuse and interoperability.

Patent Protection

To prevent misuse of the commons, OpenMDW includes a patent-litigation termination clause: if a licensee initiates offensive patent litigation over the Model Materials, their license is revoked.

This mirrors best practices in open source software and helps preserve a collaborative ecosystem.

Outputs Are Unrestricted

A major innovation: outputs generated by using a model under OpenMDW are completely free of licensing restrictions imposed by the provider of the Model Materials.

This eliminates confusion over whether generated text, images, code or predictions are encumbered  by the model provider— a common point of uncertainty in existing licenses.

How to Adopt OpenMDW

Adopting OpenMDW is straightforward:

  1. Add the OpenMDW-1.0 license file to your repository: LICENSE
  2. Clearly indicate that your release is under OpenMDW-1.0 in the README
  3. Ensure all components of the model package are covered and disclosed, including prominently highlighting any components that are subject to other licenses

Why This Matters Now

The AI community is reaching an inflection point. Open models  from AI2’s Molmo to Mistral, and open reasoning models like DeepSeek’s R1 to multimodal agents  are reshaping what’s possible in the open. But their licensing status remains hard to characterize, since software licenses may not map cleanly onto AI models.

Some open weights models which use restrictive licenses have become gradually more permissive; but without a strong legal framework available for licensing, model producers have been forced to err towards the side of caution in designing their own licenses.

In his recent post, Nathan Lambert of AI2 rightly notes: “One of the long standing todo items for open-source AI is better licenses”, OpenMDW helps to fill that need.

Just as Apache-2.0 and MIT became foundational licenses for open source software, OpenMDW is positioned to become the standard for open models. Its clarity, scope, and permissiveness lower barriers for developers and create certainty for companies and researchers looking to build responsibly on open foundations.

This isn’t just about legal clarity,  it’s about enabling an innovation-rich and open source AI ecosystem.

Visit openmdw.ai for more details including the FAQ.

Read More

Featured Sessions: Exploring Innovation at PyTorch Day China 2025

Featured Sessions: Exploring Innovation at PyTorch Day China 2025

PyTorch Day China 2025, proudly hosted by the PyTorch Foundation, will take place on June 7 in Beijing, China collocated with the BAAI Conference. This will be the second event in the new PyTorch Day series, following the inaugural PyTorch Day France last month in Paris., PyTorch Days are focused on regional communities and provide a forum for sharing technical advances, project updates and tutorials, and showcasing impactful innovations across research and industry.

PyTorch Day China will highlight cutting-edge tools, frameworks, and practices across the PyTorch ecosystem. The full-day event will feature insightful talks across a multitude of domains and technical discussions on the most cutting-edge and relevant challenges and projects in the open source AI lifecycle. 

PyTorch Day China Featured Sessions:

Running Large Models on Any AI Chip: PyTorch + Open-Source Stack (FlagOS)
Yonghua Lin, VP and Chief Engineer, BAAI
A deep dive into architecture-free deployment of large models using FlagOS and PyTorch—part of BAAI’s open-source stack for cross-hardware model execution.

torch.accelerator: A Unified Runtime API for Accelerators
Yu Guangye, AI Framework Engineer, Intel
Learn how Intel is helping unify PyTorch’s runtime interface across diverse hardware accelerators, streamlining portable and scalable AI workloads.

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone
Kaichao You, Tsinghua University
Explore the design and performance of vLLM, a popular open-source project for efficient inference and serving of large language models.

PyTorch in Production: Boosting LLM Performance on Ascend NPU
Jiawei Li, Huawei
A look at how PyTorch is being deployed in Huawei’s large-scale heterogeneous environments, with a focus on performance tuning and production readiness.

This is just a sample of what PyTorch Day China will offer. To explore the full agenda, visit the BAAI Conference event page.

Whether you’re contributing to the PyTorch ecosystem or deploying it at scale, PyTorch Day China is an opportunity to connect with a growing community and shape the future of AI development.

Read More

Accelerating GPU Performance with Triton: April 30th PyTorch ATX Event

Accelerating GPU Performance with Triton: April 30th PyTorch ATX Event

The PyTorch ATX Triton event, sponsored by Red Hat, was held on April 30, 2025, at the University of Texas. It was an exciting gathering focused on the Triton framework and its role in optimizing and democratizing GPU performance. A key purpose of the event was to highlight the awesome Triton contributors based in Austin and working for companies like Red Hat, Intel, AMD, IBM Research, and the University of Texas. Bringing contributors together helped to share insights and foster a stronger community. 

More than 50 attendees gathered to hear experts from these organizations discuss the growing importance of Triton in optimizing GPU efficiency for various algorithms. Key topics included understanding how to write, optimize, and troubleshoot Triton kernels to maximize GPU utilization and kernel portability.

Presentations covered a range of subjects from an introduction to Triton and its significance in vendor-neutral hardware acceleration, new sub-projects exploring increased developer productivity and runtime performance, to specific use cases such as Triton for vLLM and the Triton implementations by AMD and Intel. All session videos can be found here (YouTube). Speakers also examined the Triton framework itself, along with its release process, providing attendees with a comprehensive overview of the technology and its application.

This event aimed to equip the PyTorch ATX community with the knowledge and skills necessary to leverage Triton effectively and foster a deeper understanding of Triton’s capabilities by introducing and connecting local contributors. And guess what? This event worked out so well that we’re going to be hosting another large PyTorch ATX event focused on vLLM and the future of inferencing, coming up in August! Sign up here.

Read More

How OpenSynth Uses PyTorch to Accelerate Compute for Energy Modelling Applications

How OpenSynth Uses PyTorch to Accelerate Compute for Energy Modelling Applications

PyTorch Case Study LF Energy OpenSynth

OpenSynth has recently leveraged PyTorch to improve the experience of its users and community. OpenSynth is an open source community hosted by LF Energy that is democratising access to synthetic energy demand data. 

Access to smart meter data is essential to rapid and successful energy transitions. Researchers, modelers and policymakers need to understand how energy demand profiles are changing, in a system that requires greater real time optimization of demand and supply on the grid. Yet current global energy modeling and policymaking is still largely based on static and highly aggregated data from the past – when energy flowed in one direction, consumer profiles were relatively predictable, and power generation was highly controllable.

The major challenge is that access to demand data is highly restrictive, as a result of privacy protections. Rather than joining industry calls to unlock raw smart meter data through existing mechanisms, by tackling current data regulations and smart meter legislation, OpenSynth believes generating synthetic data is the fastest way to achieve widespread, global access to smart meter datasets.

The community empowers holders of raw smart meter (i.e. demand) data to generate and share synthetic data and models that can be used by researchers, industry innovators and policy-makers. 

PyTorch allowed the OpenSynth community to use GPU compute to speed up computation and use distributed training. End users with access to multiple GPUs can split the dataset into multiple smaller datasets to parallelise compute, further speeding up compute. This allows scaling of training to much bigger datasets than before.

The Business Challenge

Centre for Net Zero, the non-profit that originally developed OpenSynth before it was contributed to LF Energy, has also developed an algorithm called Faraday, available via OpenSynth to its users, that can generate synthetic smart meter data. The Faraday algorithm consists of two components: an AutoEncoder module and a Gaussian Mixture Module

The Gaussian Mixture Model (GMM) of Faraday was originally implemented using scikit-learn’s implementation. Scikit Learn is a popular library used amongst data scientists to train many different machine learning algorithms. However, that implementation does not scale very well on large datasets, as it only supports CPUs (Central Processing Units) – it does not allow accelerated computation using GPUs (Graphical Processing units). GPUs are a more powerful chip that can perform mathematical operations much faster, and is commonly used to train deep learning models. 

Furthermore, it also does not allow any parallelisation. Parallelisation compute means splitting the original dataset into multiple independent and smaller datasets, and training smaller models on each individual dataset, then combining the smaller models into a single model. 

A different implementation was needed that supports both parallel computation and GPU acceleration. 

How OpenSynth Used PyTorch

The OpenSynth community recently ported the GMM module from Faraday to PyTorch. Originally implemented using scikit-learn, this reimplementation enables the use of GPUs for training GMMs, significantly accelerating computational performance.

By leveraging PyTorch’s powerful GPU capabilities, the new GMM module can now handle much larger datasets and faster computation, making it an invaluable tool for practitioners working with large datasets that cannot fit into memory. This update allows users to scale their models and processes more efficiently, leading to faster insights and improved results in energy modeling applications.

A Word from OpenSynth

PyTorch LF Energy OpenSynth Case Study

“Open source is a powerful catalyst for change. Our open data community, OpenSynth, is democratising global access to synthetic energy demand data – unlocking a diversity of downstream applications that can accelerate the decarbonisation of energy systems. PyTorch has an incredible open source ecosystem that enables us to significantly speed up computation for OpenSynth’s users, using distributed GPUs. Without this open source ecosystem, it would have been impossible to implement this change – and slowed down the efforts of those seeking to affect net zero action.” – Sheng Chai, Senior Data Scientist, Centre for Net Zero

Learn More

For more information, visit the LF Energy OpenSynth website.

Read More

PyTorch/XLA 2.7 Release Usability, vLLM boosts, JAX bridge, GPU Build

PyTorch/XLA is a Python package that uses the XLA deep learning compiler to enable PyTorch deep learning workloads on various hardware backends, including Google Cloud TPUs, GPUs, and AWS Inferentia/Trainium. The PyTorch/XLA team has been working hard to bring new capabilities to researchers and developers using TPUs/GPUs and XLA backends. In this update, we’ve made many additions and improvements to the framework. Some of the notable highlights are: 

  • Usability improvements
  • Experimental bridge with JAX operations
  • A new Pallas-based kernel for ragged paged attention, enabling further optimizations on vLLM TPU

These features, bug fixes, and other details are outlined in the release notes. Let’s now delve into the highlights in detail!

Usability Improvements

Developers are now able to better target areas of code that they want to measure the performance of by marking the exact regions of code that they would like to profile.  An example of this is: 

server = xp.start_server(8001)
xp.start_trace(profiling_dir)
# Run some computation
...
xp.stop_trace()

PyTorch/XLA 2.7 also introduces an API to query the number of cached compilation graphs, aiding in the detection of unexpected compilations during production inference or training. An additional enhancement optimizes host-to-device transfers by avoiding unnecessary tensor copying, thus improving performance.

JAX Bridge in PyTorch/XLA (Prototype)

We’re experimenting with integrating JAX operations directly into PyTorch/XLA graphs as a way to enable a bridge between the frameworks — this method enables users to call JAX functions inside PyTorch models running with XLA.

As a use case, we’ve explored calling `jax.experimental.shard_alike` from PyTorch/XLA. This function improves sharding propagation in certain code patterns like scan, and we’ve integrated it as part of the GSPMD (Generalized SPMD) workflow in the compiler. This tool is utilized in torchprime to enable support for the SplashAttention Pallas kernel.

 import torch_xla.core.xla_builder as xb
# Native function written in JAX
def jax_function(...):
  import jax
  ...
  return ...
res = xb.call_jax(...) </pre?

Ragged Paged Attention Pallas Kernel

Efficient attention for variable-length sequences is critical for scaling large language models, and the new Pallas kernel for ragged paged attention brings a major performance and usability upgrade to vLLM TPU.

This update introduces a custom kernel implemented using the Pallas custom kernel language and is lowered to Mosaic for TPU. It supports ragged (variable-length) input sequences and implements a paged attention pattern. Below are the key features:

  • Supports mixed prefill and decode operations to increase inference throughput (e.g., up to a 5x speedup compared to the padded Multi-Queries Paged Attention implementation for llama-3-8b).
  • No GMM (Grouped Matmul) Metadata required! We calculate the metadata on the fly in the kernel. This can increase performance by 10%.
  • Provides a CUDA Flash Attention equivalent with Paged Attention support and a similar interface.

We are continuously collaborating with the vLLM community to further optimize performance, expand kernel coverage, and streamline TPU inference at scale.

GPU Build is Back

The GPU build was paused in the PyTorch/XLA 2.6 release, but we’ve now re-enabled GPU Continuous Integration (CI) in version 2.7. The current release includes GPU builds with CUDA 12.6, marking an important step forward for GPU support.

While CUDA support is still considered experimental in this release, we plan to expand coverage to additional CUDA versions in upcoming releases.

Get Involved

Please check out the latest changes on GitHub. As always, we’re actively seeking feedback and contributions from the community.

Read More

MetaShuffling: Accelerating Llama 4 MoE Inference

MetaShuffling: Accelerating Llama 4 MoE Inference

Mixture-of-Experts (MoE) is a popular model architecture for large language models (LLMs). Although it reduces computation in training and inference by activating fewer parameters per token, it imposes additional challenges in achieving optimal computation efficiency with high memory and communication pressure, as well as the complexity to handle the dynamism and sparsity nature of the model. Here we introduce a new MoE inference solution, MetaShuffling, which enables us to efficiently deploy Llama 4 models for production inference.

Llama 4 Architecture

Llama 4 Scout and Maverick models are officially released.  Scout / Maverick has a shared expert and 16 / 128 routed experts with dropless token-choice routing and Top-1 selection for each MoE layer. Besides, both shared and routed experts use SwiGLU activation with 3 linear layers. Please refer to The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation for more information about the model.

Key Concept

There are multiple common solutions to handle dynamism and sparsity problems introduced in MoE layers. Here we demonstrate different solutions of token-choice routing with Top-1 selection.

The above diagram shows the padding design.  Each box represents a token, and the yellow / green color represents valid tokens with different routed experts, and the grey color represents padded tokens. Each row of boxes in the second step represents different routed experts. Ti represents the i-th token from the current rank of the data parallel group.

  • Padding: In this approach, we pad activation to maximum sequence length for each expert and run a single batched matrix multiplication (BMM). It incurs:
    • Increased memory on holding paddings.
    • Increased latency on processing paddings. Note that it is possible to avoid processing padding through jagged kernels, but jagged kernels may also incur high overhead when the number of experts is large. 
  • Slicing: In this approach, we slice activation to exact sequence length for each expert and run multiple matrix multiplications (MM). It avoids the problems in padding, but it incurs:
    • Reduced kernel efficiency, caused by repeated kernel launches on small shapes.
    • Reduced device utilization, caused by frequent host and device synchronizations on dynamic shapes, plus extra kernel launch overheads, as it is incompatible with graph capturing mechanisms (e.g. CUDAGraph and torch.compile).

  • Concatenation: In this approach, we further concatenate the activations after slicing and run a single grouped matrix multiplication (GMM). It avoids the kernel efficiency problem in slicing, but still incurs 
    • Reduced device utilization, as it still requires host and device synchronization, and still incompatible with graph capturing mechanisms.

To further improve the solution, we propose a shuffling-based mechanism:

  • Shuffling: In this approach, we directly sort the tokens so that routed tokens are ordered by routed expert’s ID. By doing so, no padding or splitting is introduced, and tokens assigned to the same experts are stored together and can be processed together inside GroupedGEMM. It provides a dense model interface and avoids all the problems mentioned above.
    • No paddings as the activation remains a dense tensor.
    • No host and device synchronization, as the activation remains a static-shaped tensor.

We built an end-to-end MoE inference solution, MetaShuffling, based on this design.

Runtime Design

No Parallelism for Single-GPU Inference

Above is the overall runtime design for single-GPU inference without model parallelism. Note that, to optimize performance, the first and third linear layers of SwiGLU activation are merged together as GroupedGEMM13 / GEMM13.

  • Solid dark blue/orange boxes represent tensor core heavy kernels on routed/shared expert streams.
  • Solid light blue/orange boxes represent CUDA core or memory traffic-heavy kernels on routed/shared expert streams.
  • Red arrows represent data flows of activation tensors.
  • Green arrows represent data flows of metadata tensors.

All metadata tensors are placed on the device. There is no blocking device to host synchronization. All kernels are launched back to back without bubbles. The diagram shows data flows only, not a demonstration of actual profiling traces.

Kernel Interfaces And Data Flows

  • RoutingScores: A function or fused kernel that handles routing scores calculation.
    • Input: input_tokens: [T, D] (T: number of tokens; D: feature dimension); router_weights: [D, E] (E: number of experts); router_biases: [E];
    • Output: routing_scores: [T, E]; scaling_factors: [T, E];
  •  IndexShuffling: A fused kernel that handles shuffling and sorting of indices. We will introduce an optimized implementation in the Kernel Design section.
    • Input: routing_scores: [T, E]; K (threshold for top-k routing);
    • Output: routed_token_indices: [K * T]; routed_expert_indices: [K * T]; routed_token_counts_per_expert: [E];
  • GatherMul: A fused kernel that shuffles tokens based on sorted indices and scales them.
    • Input: input_tokens: [T, D]; routed_token_indices: [K * T]; routed_expert_indices: [K * T]; scaling_factors: [T, E];
    • Output: scaled_routed_tokens: [K * T, D]
  • GroupedGEMM: An optimized GroupedGEMM kernel that handles on-device shape information about batches along M dimension without restrictions. We will introduce an optimized implementation in the Kernel Design section.
    • Input: tokens: [K * T, D]; weights: [E, D, HD] (HD: hidden dimension); routed_token_counts_per_expert: [E];
    • Output: tokens: [K * T, HD]
  • GEMM: An optimized GEMM kernel. Similar interface to dense model.
  • NonLinearity: A fused kernel that handles non-linearity. Similar interface to dense model.
  • ScatterAdd: An optimized kernel that reverses token shuffling based on sorted indices and directly performs scatter add to shared expert output without materializing an unshuffled tensor.
    • Input: shared_output_tokens: [T, D]; routed_output_tokens: [K * T, D]; routed_token_indices: [K * T]; 
    • Output: combined_output_tokens: [T, D]

Note that if quantization is applied, then activation quantization kernels are fused into the preceding non-GEMM kernels, which means fusing into GatherMul for GroupedGEMM13 and fusing into NonLinearity for GroupedGEMM2, etc.

Note if using a large K * T, the GatherMul and ScatterAdd operation can be further fused into following/proceeding GroupedGEMM operations, which should be complete as global memory to shared memory/registers or shared memory to global memory step in prologue/epilogue, however, it adds additional challenge on overlapping with tensor core execution at the kernel design level. Besides, fusing ScatterAdd requires shared experts to complete before routed experts, which might not be a good design choice if these kernels can be used to hide AlltoAll latency.

Tensor Parallelism for Single-Host Inference

Above is the overall runtime design for single-host inference with tensor parallelism (TP). Compared to single-GPU inference, the additional step is:

  • Solid light mint boxes represent network traffic-heavy communication kernels.

Still, all metadata tensors are placed on the device, there is no device to host synchronization. All kernels are launched back to back without bubbles. The diagram shows data flows only, not a demonstration of actual profiling traces.

Workload Sharding and Additional Kernels

No additional custom kernel is introduced compared to the single GPU inference use case. For GEMM, GroupedGEMM, and non-linearity kernels, the activation and weights are both shared to 1/TP along different dimensions, and the computation/memory overhead is also shared to 1/TP.

The final step should be AllReduce if only tensor parallelism is applied. Alternatively, ReduceScatter if tensor parallelism is applied with sequence parallelism.

Expert Parallelism for Multi-Host Inference

To enable expert parallelism (EP), we swap data parallelism dimension out of the routed expert as the expert parallelism dimension inside the routed expert. Note that tensor parallelism can be further swapped with expert parallelism for better GEMM efficiency with increased routing imbalance risk, but we won’t cover this design in this blog.

If expert parallelism is enabled with token-choice routing, then we must decide between using dense tensors or using static shapes, because the number of routed tokens to different expert groups is dynamic. 

  • We use dense tensors and dynamic shapes when using eager mode is preferred to avoid waste on network traffic and memory space caused by running unpadded AlltoAll.
  • We use sparse tensors and static shapes when using graph mode is preferred to avoid generating GPU bubbles caused by CPU launch overheads and device-to-host synchronization through running with CUDAGraph.

Note that wasted network traffic with padded activations can also be avoided using a custom AlltoAll implementation, but we won’t cover any topics on custom communication or communication and computation fusion kernels in this blog.

Above is the overall runtime design for multi-host inference with tensor parallelism and expert parallelism. Compared to single-host inference with tensor parallelism.

  • Solid red arrows represent intra-node communication.
  • Solid purple arrows represent inter-node communication.

Kernel Interfaces And Data Flows

For added expert parallelism-based communication, we use 3-shot All2All communication to exchange shapes and tokens:

  • 1st A2A: Exchange on-device metadata tensor about number of tokens routed to each expert, which is `routed_token_counts_per_expert: [E]`, the output generated from IndexShuffling kernel.
  • 2nd A2A: Exchange tokens from data parallelism based to expert parallelism based, dispatching to different EP ranks based on routing.  
  • 3rd A2A: Exchange tokens from expert parallelism based to data parallelism based, combining from different EP ranks based on routing.

Besides, we added 2 additional shuffling kernels and 1 special scatter kernel:

  • CombineShuffling (Dense or Padded): Reshuffles received tokens from rank first order to expert first order. Following T* indicates the number of total tokens received from all peers, which can be further interpreted as a jagged dimension with shape information from routed_token_counts_per_rank_per_expert tensor.
    • Input: received_tokens: [T*, D] (first ordered by dp ranks, then ordered by expert indices); routed_token_counts_per_rank_per_expert: [EP, E // EP];
    • Output: reshuffled_tokens: [T*, D] (first ordered by expert indices, then ordered by dp ranks); routed_token_counts_per_expert: [E // EP];
  • SplitShuffling (Dense or Padded): Reverse process of CombineShuffling. Reshuffles to-send tokens from expert first order to rank first order.
    • Input: reshuffuled_tokens: [T*, D] (first ordered by expert indices, then ordered by dp ranks); routed_token_counts_per_rank_per_expert: [EP, E // EP];
    • Output: to_send_tokens: [T*, D] (first ordered by dp ranks, then ordered by expert indices);
  • ScatterAdd (Padded): Scatter adds validate tokens from padded tensors.
    • Input: shared_output_tokens: [T, D]; received_padded_routed_output_tokens: [EP, K*T, D];  routed_token_indices: [K * T];  routed_token_counts_per_expert: [E]; 
    • Output: combined_output_tokens: [T, D]

We will provide a better demonstration of the above kernels in detail in the `Padded Communication with Static Shapes In Graph Mode` section.

Unpadded Communication with Dynamic Shapes In Eager Mode

High-level diagram on runtime behavior. The actual runtime of different components might vary based on software and hardware.

Minimize Usage of Dynamic Shapes

As the routing is dynamic per MoE layer, the minimal amount of device/host synchronization required is once per layer. To achieve this, we made a delay of the D2H copy of `send_sizes`, and concatenated it with `recv_sizes` to transfer them together with a single D2H copy. It reduces the device/host synchronization to once per layer.

Minimize Negative Impact on Dynamic Shapes

To further hide the device/host synchronization overhead, we further split the shared experts into 2 parts.

  • We had the first part dispatched right after routing, but before dispatch A2As. Then, when the device/host synchronization happens, the device is still kept busy running shared experts.
  • We have the second part dispatched right after MoE but before combining A2A. This will further help overlapping the second A2A.

Padded Communication with Static Shapes In Graph Mode

Minimize Usage of Padding

With a dropless token choice design, the maximum possible number of tokens routed to any single expert is T. However, if we group multiple experts together and place them on a single GPU through expert parallelism sharding, for TopK routing,

  • The maximum number of tokens routed to 1 expert is T.
  • The maximum number of tokens routed to 2 expert is 2 * T.
  • The maximum number of tokens routed to K experts is K * T.
  • The maximum number of tokens routed to K + 1 experts is, still, K * T. 

So the maximum number of tokens routed to an expert group of N experts will be capped at min(N, K) * T tokens. 

For Top1 routing, the number of tokens routed to an expert group of any size will always be capped at T tokens, and the minimal required memory to allocate and hold for dynamic tokens is EP * T tokens, as there are EP expert groups. 

To achieve the minimal required padding, we directly use AllGather to gather all active tokens from different EP ranks and then splits and reshuffles routed tokens locally through custom kernels. The activation size is compressed to 1 / (E // EP), which corresponds to reductions in memory and network traffic.

The above diagram shows the padding design. Each box represents a token, the blue / green color represents valid tokens with expert assignments and the grey color represents padded tokens. RiTj represents the j-th token from the i-th rank of expert parallelism group.

Minimize Negative Impact on Padding

Even though the paddings are reduced to minimal allowance, we also ensure that the paddings only cause memory space (allocation) and network traffic (communication), but not cause redundant computation (GroupedGEMM / NonLinear), redundant memory bandwidth (CombineShuffling / SplitShuffling / ScatterAdd) through taking on device shape information `routed_token_counts_per_expert` or `routed_token_counts_per_rank_per_expert`.

Activation Conceptional Interpretation

Most importantly,

  • When the total number of active tokens is small across all EP ranks, it is important to do so to avoid activating redundant experts in GroupedGEMM and causing extra memory traffic.
  • When the total number of active tokens is large across all EP ranks, it is also important to do so to avoid converting GroupedGEMM from memory bound to compute bound. 

CombineShuffling: The tokens assigned to the current EP rank are reshuffled from expert first order to rank first order right after AllGather. The tokens not assigned are not copied, and the remaining allocated memory space at the end of the tensor remains untouched.

SplitShuffling: The tokens assigned to the current EP rank are reshuffled from rank-first order to expert-first order right before AlltoAll. The tokens not assigned are not copied, and the reshuffled tensors have paddings stored in an interleaved fashion.

ScatterAdd (Padded): Each EP rank finally receives activations computed from all other ranks, it will understand where are the valid tokens and where are the padded tokens, and then only read the valid tokens to do scatter_add with.

Communication Deduplication

Different tensor parallelism ranks have the same activation before 1st GroupedGEMM and after 2nd GroupedGEMM, so the same tokens are exchanged across nodes repeatedly. 

We enabled communication deduplication to evenly distribute the inter-node communication workload to different ranks with extra intra-node communication introduced. Example of DP2/TP8/EP2:

  • For first AlltoAll in eager mode, split T*D inter-node AlltoAll to T*D/8 inter-node AlltoAll and T*D intra-node AllGather.

  • For second AlltoAll in eager / graph mode, split T*D inter-node AlltoAll to T*D/8 intra-node ReduceScatter and T*D/8 inter-node AlltoAll.

  • For first AllGather in graph mode, split 2*T*D inter-node AlltoAll to 2*T*D/8 inter-node AllGather and 2*T*D intra-node AllGather.

Kernel Design

We implemented more than 10 custom kernels to support MetaShuffling MoE inference design in different use cases running on both Nvidia H100 GPU and AMD MI300X GPU. We open sourced all computation kernels as PyTorch operators in FBGEMM Generative AI Kernel Library. We hope it can help users efficiently serve Llama 4 models in their preferred framework and preferred accelerators, for example, vLLM / SGLang. In this blog, we will focus on the 2 most interesting kernels designs as the key to improve inference performance, GroupedGEMM and IndexShuffling.

GroupedGEMM

We implemented Triton-based GroupedGEMM kernels for BF16 / FP16 / FP8 Rowwise.

Interface

def grouped_gemm_fp8_rowwise(
	x: torch.Tensor, 		# shape: [M, K]
	w: torch.Tensor, 		# shape: [G*N, K]
	m_sizes: torch.Tensor, 	# shape: [G]
	x_scales: torch.Tensor,	# shape: [M]
	w_scales: torch.Tensor, 	# shape: [G*N]
) -> torch.Tensor:               # shape: [M, N]
	...

The interface is quite similar to single GEMM in that it takes a single LHS, a single RHS tensor, and produces a single output. There is no dynamism or sparsity from the runtime point of view.

However, the kernel dynamically splits the M dimension of the LHS tensor using the data of `m_sizes` and statically splits the N dimension of the RHS tensor using the shape of `m_sizes`. This design has several advantages:

  • No additional padding or alignment requirement within different batches of Ms. So `m_sizes` can store any non-negative values as long as its total does not exceed `M`.
  • The `m_sizes` can be zero values to skip loading weights of unactivated experts.
  • The `m_sizes` can have a total sum less than `M` to skip computation on padded tokens at the end without extra overhead.
  • The `m_sizes`, or the splitting of the LHS activation, is known to the device but unknown to the host. So it supports dynamic routing information without incurring device-to-host synchronization. 

Workload Partition

We adopt the persistent kernel design to launch 1 CTA per SM and have all the CTAs running through all the partitioned tiles in an interleaved fashion. Conceptually, the workload partition happens as follows.


def partition_workload(G: int, Ms: List[int], N: int):
	partitions = []
	for g in range(G):
		for n in range(0, N, BLOCK_N):
			for m in range(0, Ms[g], BLOCK_M):
				partitions.append((g, m, n))
	paritions_per_cta = [[] for _ in NUM_SMS]
	for i, part in enumerate(partitions):
		paritions_per_cta[i % NUM_SMS].append(part)

The partitions are dynamically calculated on the device side at runtime with a small overhead. However, by doing so, we can achieve:

  • Balanced workload across different SMs.
  • Small launching overhead as each SM will only launch 1 CTA.
  • High L2 cache hit rate. The order of workload partition makes sure the weights/activations will most likely be loaded once from HBM and cached on L2. Because usages of the same weight/activation tile will almost always happen concurrently / consecutively from different SMs.

Persistent Kernel with Warp Specialization

We adopted host-side tensor map-based loading of activations and weights, and optional device-side tensor map-based storing of outputs, to reduce memory transfer overhead on Hopper GPUs. With a contiguous storage format of activations, we can use a single host-side TMA (Tensor Memory Accelerator) descriptor to load activations and mask out the tokens that belong to other experts. However, we need to create multiple device-side TMA descriptors to store outputs without dynamic masking support.

We adopted a warp specialization-based kernel design to have the kernel running in a truly persistent fashion that each SM switches between 3 warp groups (1 producer and 2 consumers). This design keeps TMA engine, Tensor core, and CUDA core execution overlapping with each other, utilizing asynchronous TMA instructions and WGMMA (Asynchronous Warpgroup Level Matrix Multiply-Accumulate) instructions with memory barriers on shared memory. We received tremendous help from our Meta’s Triton compiler team to enable it. It is only possible to hide prologue and epilogue with warp specialization, as the traditional software pipeline approach cannot handle complicated control flows with pointer chasing.

IndexShuffling

We implemented CUDA / HIP-based index shuffling kernels.

Interface

def index_shuffling(
	scores: torch.Tensor,			        # shape: [T, E]
):
	token_counts: torch.Tensor = ...		# shape: [E]
	expert_indices: torch.Tensor = ...	        # shape: [T]
	token_indices: torch.Tensor = ...		# shape: [T]
	return token_counts, expert_indices, token_indices

The kernel takes routing scores of all tokens on all experts, figures out the specific expert each token is routed to, reorders the token indices such that all the tokens routed to the same expert are placed contiguously, and returns:

  • `token_counts`: As the number of tokens routed to each expert. It will be fed into the GroupedGEMM kernel discussed above.
  • `expert_indices`: As the expert index each shuffled token belongs to. It will be fed into the GatherMul kernel discussed above.
  • `token_indices`: As the original token index each shuffled token belongs to. It will be fed into the GatherMul and ScatterAdd kernel discussed above.

Cooperative Kernel

We adopted the cooperative kernel design, and split the kernel into 2 major phases, top-k reduction phase and bucket sort phase, with a global synchronization in the middle.

  • 1. Load scores
    • It loads a tile of routing scores from global memory (HBM) to shared memory (SMEM) and stores associated expert indices along with it on SMEM.
  • 2. Reduction
    • Performs TopK reduction on SMEM across E dimension. For Llama 4 use cases, it performs ArgMax sorting as Top1 reduction, which includes a 2D parallel tree reduction on the scores and associated expert indices on SMEM. Between different tree reduction phases,
      • All threads will concurrently work on reductions of multiple tokens on SMEM.
      • Each thread will sequentially work on reductions of multiple tokens on SMEM.
  • 3. Counting & Store Buffers:  
    • It iterates all the tokens on the tile, getting the selected expert index from SMEM, storing it to the buffer (`buf_expert_index`) on HBM, and performs an `atomicAdd` operation on the output counter (`token_counts`) on HBM. 
    • The interesting part is, the `atomicAdd` operation will return the value previously on the memory location, which indicates the place of the token within the group, and we will store this value inside a buffer (`buf_local_token_index`) and use it to determine the global order among all the tokens.
  • Repeat 1-3 iteratively until all the tokens assigned to the CTA are processed.
  • 4. Global Synchronization: 
    • It performs an `atomicAdd` operation on the global counter on HBM. Afterwards, all CTAs will wait until the global counter reaches the number of total tokens, with a `st.release` + `ld.aquire` barrier guarding preceding store operations and following load operations to ensure correctness.
  • 5. Scan
    • It performs a simple load and prefix sum of `token_counts` and transforms it into `token_counts_cumsums` on SMEM.
  • 6. Load Buffer & Store Output
    • It iterates all the tokens assigned to this CTA. For each token, it loads the expert index the token is assigned to from `buf_expert_index`, and then figure out the new token index after shuffling as a sum of 
      • The number of tokens before it that belong to previous experts, using the SMEM tensor `token_counts_cumsums`.
      • The number of tokens before it that belong to the same expert, using the HBM tensor `buf_local_token_index`.
    • Afterwards, it simply does a direct store on `expert_indices` and `token_indices` output at the new token index after shuffling.

Performance

Example Kernel Performance

Our setup used H100 80GB SMX5 HBM3 700W SKUs, Python 3.12, and CUDA 12.8. The theoretical peak HBM memory bandwidth on a single H100 is 3.35 TB/s.

GroupedGEMM

Prefill Performance

The following table shows the prefill performance of the kernel on Llama 4 Scout and Maverick single host serving. The experiment setup assumes 16,384 total number of tokens and tensor parallelism sharding. 

Precision G M N K Time

(us)

Compute

(TFlops)

Memory

(GB/s)

BF16 16 1,024 2,048 5,120 523.85 655.90 1,088.90
BF16 16 1,024 5,120 1,024 294.95 582.46 1,251.39
BF16 128 128 2,048 5,120 975.41 352.26 2,992.82
BF16 128 128 5,120 1,024 510.78 336.35 3,021.86
FP8 16 1,024 2,048 5,120 286.89 1,197.64 1,111.10
FP8 16 1,024 5,120 1,024 182.41 941.83 1,471.62
FP8 128 128 2,048 5,120 517.16 664.40 2,887.28
FP8 128 128 5,120 1,024 290.25 591.90 2,947.93

Note: G indicates the number of groups. M indicates the number of tokens per group. N indicates the output feature dimension per group. K indicates the input feature dimension per group. FP8 indicates FP8 rowwise scaling (per-token scaling on activation and per-channel scaling on weight) with fast accumulation. Quantization kernels are not included in benchmarking. Scales are not included in memory bandwidth calculation. Benchmarked with rotating buffers and CUDAGraphs.

Decode Performance

The following table shows the decode performance of the kernel on Llama 4 Scout and Maverick single host serving. The experiment setup assumes 128 total number of tokens and tensor parallelism sharding. 

Precision G M N K Time

(us)

Compute

(TFlops)

Memory

(GB/s)

BF16 16 8 2,048 5,120 112.54 23.85 2,997.82
BF16 16 8 5,120 1,024 60.00 22.37 2,822.50
BF16 128 1 2,048 5,120 861.22 3.12 3,119.07
BF16 128 1 5,120 1,024 433.15 3.10 3,102.26
FP8 16 8 2,048 5,120 59.81 44.88 2,824.60
FP8 16 8 5,120 1,024 34.86 38.50 2,447.64
FP8 128 1 2,048 5,120 440.53 6.09 3,049.44
FP8 128 1 5,120 1,024 225.14 5.96 2,987.15

IndexShuffling

The following table shows the performance of the kernel on Llama 4 Scout and Maverick single-host serving, comparing against native PyTorch implementations.

Num Tokens Num Experts IndexShuffling (us) Unfused Ops (us) Speedup
128 16 5.08 36.71 722.53%
128 128 8.95 34.36 384.05%
2048 16 7.06 57.10 808.51%
2048 128 13.53 69.84 516.18%
4096 16 7.42 68.96 929.98%
4096 128 18.89 87.46 463.09%
8192 16 9.26 123.94 1339.16%
8192 128 30.56 165.21 540.71%

Note: Benchmarked with rotating buffers and CUDAGraphs.

Example Trace Analysis

Llama 4 Scout BF16 Decode

Here is an example decoding trace of Llama 4 Scout BF16 with 64 tokens using our MetaShuffling MoE inference solution. 

  • The total memory traffic of MoE is (ignoring activations):
    • Router: 5120x16x2 = 163,840 Bytes
    • Shared Experts: (2048×5120 + 5120×1024)x2=31,457,280 Bytes
    • Routed Experts: 16x(2048×5120 + 5120×1024)x2=503,316,480 Bytes
    • Total combined: 163,840 + 31,457,280 + 503,316,480=534,937,600 Bytes
    • The total execution time of MoE is 197.456usThe memory bandwidth achieved is 534,937,600 / (197.456 * 10^-6)=2,709,148,367,231 Bytes/s ~= 2.71 TB/s, which is 80.90% of the theoretical peak HBM memory bandwidth of H100 80GB SMX5 HBM3 as 3.35 TB/s.

Here is a breakdown of different components of the trace.

First is the breakdown of routing and shared experts. Both components are running concurrently on 2 different streams to achieve better resource utilization.

For the router stream (marked with red boxes):

  • 1. Router GEMM: CuBLAS-based GEMM with a split-k design. It launches 2 kernels with the second kernel being the reduction kernel.
  • 2. Sigmoid (Router Activation): PyTorch native sigmoid.
  • 3. IndexShuffling: FBGEMM-based index shuffling with a cooperative kernel design. It can be viewed as a fusion of 3 operations, topk, bincount, and sort. It launches 2 kernels with the first kernel being the setup kernel.
  • 4. GatherMul: FBGEMM-based gather scaling. It can be viewed as a fusion of 3 operations: gather (tokens), gather (scores), and mul operations.

For the shared expert stream (marked with orange boxes):

  • 5. SharedExpert GEMM13: CuBLAS-based GEMM with a split-k design. It launches 2 kernels, with the second kernel being the reduction kernel.
  • 6. SwiGLU: Fused SwiGLU. It can be viewed as a fusion of 2 operations, sigmoid and mul.
  • 7. SharedExpert GEMM2: CuBLAS based GEMM.

Second is the breakdown of routed experts. This component is running exclusively on 1 stream to let the GroupedGEMM kernels take full ownership of all SMs.

For the routed expert stream (marked with red boxes):

  • 8. RoutedExperts GroupedGEMM13: FBGEMM-based GroupedGEMM with a persistent kernel design. 
  • 9. SwiGLU: Fused SwiGLU. As mentioned in 6.
  • 10. RoutedExperts GroupedGEMM2: FBGEMM-based GroupedGEMM with a persistent kernel design, fused with scatter add in the epilogue.

The decoding step is running on dense tensors with static shapes using CUDAGraph.

Llama 4 Maverick FP8 Prefill

Here is an example prefill trace of Llama 4 Maverick FP8 with 5000 tokens using our MetaShuffling MoE inference solution. Note FP8 rowwise scaling for routed experts, and BF16 for router and shared experts.

Compared to the decode trace:

  • It uses a single stream to avoid interactions of kernels between router and shared experts. As the kernels are working on a large enough problem size that can saturate compute resources, having additional overlapping simply causes contentions, especially on L2 cache.
  • It runs on dense tensors with static shapes, but in eager mode. As the kernel execution time is large enough and there is no device/host synchronization, the kernels can be launched back to back without bubbles.

Here we highlight the kernel difference between these two traces, except for execution time.

  • Router GEMM and SharedExpertGEMM13: CuBLAS-based GEMM without using split-k design. So it launches 1 kernel instead of 2.

  • 4. GatherMul (FP8 Rowwise Quantize): FBGEMM-based gather scaling and quantization. It can be viewed as a fusion of 8 operations: gather (tokens), gather (scores), mul, max, divide, mul, clamp, and typecast.
  • 9. SwiGLU (FP8 Rowwise Quantize): Fused SwiGLU and quantization. It can be viewed as a fusion of 7 operations: sigmoid and mul, max, divide, mul, clamp, and typecast.

Takeaway

We take the following steps progressively to optimize the inference performance of our MoE solution:

    • Improve device-level utilization by avoiding host and device synchronization.
    • Reduce wasted resources by removing paddings or avoiding processing paddings.
    • Reduce kernel launch and I/O overhead by aggressive kernel fusion.
    • Improve computation and memory efficiency by various kernel optimizations, pushing performance towards hardware limits.
    • Improve hardware component level utilization by concurrent execution of computation, memory traffic, or network traffic heavy kernels, but avoiding undesirable contention at the same time.

Single Host Serving

We benchmarked the single-host serving performance of Llama 4 Maverick and Llama 4 Scout with our internal MetaShuffling-based MoE inference stack using 1000 requests with random prompts. To compare against openly available data from vLLM and SGLang, we adopted the same experiment setup (i.e., Maverick with FP8, Scout with BF16, on a 8xH100 host with maximum batch size 64). Our setup used H100 80GB SMX5 HBM3 700W SKUs, Python 3.12, and CUDA 12.8.  We open sourced all computation kernels used in the MetaShuffling MoE inference stack on FBGEMM and an example implementation of MetaShuffling as a reference.

To keep the best accuracy, we benchmarked Llama 4 Maverick with FP8 precision on routed experts. BF16 precision on attention linear, attention, shared experts, router, and KV cache.

To match external benchmark numbers, we benchmarked Llama 4 Scout with BF16 precision on all linear layers (attention linear, shared experts, router, and routed experts), attention, and KV cache.

Disclaimer: Here we use datapoints released from official channels as a reference. However, as all inference frameworks are rapidly evolving, they might already be outdated at the time of publication. We hope the community can continuously break records in improving the efficiency of serving Llama 4 models.

Acknowledgement

We would like to thank Jing Zhang, Ying Zhang, and Manman Ren for providing technical review and guidance.

We would also like to thank Bradley Davis, Yan Cui, Rengan Xu, Josh Fromm, Jiawen Liu, Sarunya Pumma, Jie Wang, Xinfeng Xie, Benson Ma, Michael Shu, Bingzhe Liu, Jingyi Yang, Min Si, Pavan Balaji, Dhruva Kaushal. for contributions to this project.

Read More