Enforcing a hierarchical clustering of semantically related labels improves performance on rare “long-tail” classification categories.Read More
NVIDIA and Microsoft Drive Innovation for Windows PCs in New Era of Generative AI
Generative AI — in the form of large language model (LLM) applications like ChatGPT, image generators such as Stable Diffusion and Adobe Firefly, and game rendering techniques like NVIDIA DLSS 3 Frame Generation — is rapidly ushering in a new era of computing for productivity, content creation, gaming and more.
At the Microsoft Build developer conference, NVIDIA and Microsoft today showcased a suite of advancements in Windows 11 PCs and workstations with NVIDIA RTX GPUs to meet the demands of generative AI.
More than 400 Windows apps and games already employ AI technology, accelerated by dedicated processors on RTX GPUs called Tensor Cores. Today’s announcements, which include tools to develop AI on Windows PCs, frameworks to optimize and deploy AI, and driver performance and efficiency improvements, will empower developers to build the next generation of Windows apps with generative AI at their core.
“AI will be the single largest driver of innovation for Windows customers in the coming years,” said Pavan Davuluri, corporate vice president of Windows silicon and system integration at Microsoft. “By working in concert with NVIDIA on hardware and software optimizations, we’re equipping developers with a transformative, high-performance, easy-to-deploy experience.”
Develop Models With Windows Subsystem for Linux
AI development has traditionally taken place on Linux, requiring developers to either dual-boot their systems or use multiple PCs to work in their AI development OS while still accessing the breadth and depth of the Windows ecosystem.
Over the past few years, Microsoft has been building a powerful capability to run Linux directly within the Windows OS, called Windows Subsystem for Linux (WSL). NVIDIA has been working closely with Microsoft to deliver GPU acceleration and support for the entire NVIDIA AI software stack inside WSL. Now developers can use Windows PC for all their local AI development needs with support for GPU-accelerated deep learning frameworks on WSL.
With NVIDIA RTX GPUs delivering up to 48GB of RAM in desktop workstations, developers can now work with models on Windows that were previously only available on servers. The large memory also improves the performance and quality for local fine-tuning of AI models, enabling designers to customize them to their own style or content. And because the same NVIDIA AI software stack runs on NVIDIA data center GPUs, it’s easy for developers to push their models to Microsoft Azure Cloud for large training runs.
Rapidly Optimize and Deploy Models
With trained models in hand, developers need to optimize and deploy AI for target devices.
Microsoft released the Microsoft Olive toolchain for optimization and conversion of PyTorch models to ONNX, enabling developers to automatically tap into GPU hardware acceleration such as RTX Tensor Cores. Developers can optimize models via Olive and ONNX, and deploy Tensor Core-accelerated models to PC or cloud. Microsoft continues to invest in making PyTorch and related tools and frameworks work seamlessly with WSL to provide the best AI model development experience.
Improved AI Performance, Power Efficiency
Once deployed, generative AI models demand incredible inference performance. RTX Tensor Cores deliver up to 1,400 Tensor TFLOPS for AI inferencing. Over the last year, NVIDIA has worked to improve DirectML performance to take full advantage of RTX hardware.
On May 24, we’ll release our latest optimizations in Release 532.03 drivers that combine with Olive-optimized models to deliver big boosts in AI performance. Using an Olive-optimized version of the Stable Diffusion text-to-image generator with the popular Automatic1111 distribution, performance is improved over 2x with the new driver.
With AI coming to nearly every Windows application, efficiently delivering inference performance is critical — especially for laptops. Coming soon, NVIDIA will introduce new Max-Q low-power inferencing for AI-only workloads on RTX GPUs. It optimizes Tensor Core performance while keeping power consumption of the GPU as low as possible, extending battery life and maintaining a cool, quiet system. The GPU can then dynamically scale up for maximum AI performance when the workload demands it.
Join the PC AI Revolution Now
Top software developers — like Adobe, DxO, ON1 and Topaz — have already incorporated NVIDIA AI technology with more than 400 Windows applications and games optimized for RTX Tensor Cores.
“AI, machine learning and deep learning power all Adobe applications and drive the future of creativity. Working with NVIDIA we continuously optimize AI model performance to deliver the best possible experience for our Windows users on RTX GPUs.” — Ely Greenfield, CTO of digital media at Adobe
“NVIDIA is helping to optimize our WinML model performance on RTX GPUs, which is accelerating the AI in DxO DeepPRIME, as well as providing better denoising and demosaicing, faster.” — Renaud Capolunghi, senior vice president of engineering at DxO
“Working with NVIDIA and Microsoft to accelerate our AI models running in Windows on RTX GPUs is providing a huge benefit to our audience. We’re already seeing 1.5x performance gains in our suite of AI-powered photography editing software.” — Dan Harlacher, vice president of products at ON1
“Our extensive work with NVIDIA has led to improvements across our suite of photo- and video-editing applications. With RTX GPUs, AI performance has improved drastically, enhancing the experience for users on Windows PCs.” — Suraj Raghuraman, head of AI engine development at Topaz Labs
NVIDIA and Microsoft are making several resources available for developers to test drive top generative AI models on Windows PCs. An Olive-optimized version of the Dolly 2.0 large language model is available on Hugging Face. And a PC-optimized version of NVIDIA NeMo large language model for conversational AI is coming soon to Hugging Face.
Developers can also learn how to optimize their applications end-to-end to take full advantage of GPU-acceleration via the NVIDIA AI for accelerating applications developer site.
The complementary technologies behind Microsoft’s Windows platform and NVIDIA’s dynamic AI hardware and software stack will help developers quickly and easily develop and deploy generative AI on Windows 11.
Microsoft Build runs through Thursday, May 25. Tune into to learn more on shaping the future of work with AI.
No Programmers? No Problem: READY Robotics Simplifies Robot Coding, Rollouts
Robotics hardware traditionally requires programmers to deploy it. READY Robotics wants to change that with its “no code” software aimed at people working in manufacturing who haven’t got programming skills.
The Columbus, Ohio, startup is a spinout of robotics research from Johns Hopkins University. Kel Guerin was a PhD candidate there leading this research when he partnered with Benjamin Gibbs, who was at Johns Hopkins Technology Ventures, to land funding and pursue the company, now led by Gibbs as CEO.
“There was this a-ha moment where we figured out that we could take these types of visual languages that are very easy to understand and use them for robotics,” said Guerin, who’s now chief innovation officer at the startup.
READY’s “no code” ForgeOS operating system is designed to enable anyone to program any type of robot hardware or automation device. ForgeOS works seamlessly with plug-ins for most major robot hardware, and similar to other operating systems, like Android, it allows running third-party apps and plugins, providing a robust ecosystem of partners and developers working to make robots more capable, says Guerin.
Implementing apps in robotics allows for new capabilities to be added to a robotic system in a few clicks, improving user experience and usability. Users can install their own apps, such as Task Canvas, which provides an intuitive building block programming interface similar to Scratch, a simple block-based visual language for kids developed at MIT Media Lab, which was influential in its design.
Task Canvas allows users to show the actions of the robot, as well as all the other devices in an automation cell (such as grippers, programmable logic controllers, and machine tools) as blocks in a flow chart. The user can easily create powerful logic by tying these blocks together — without writing a single line of code. The interface offers nonprogrammers a more “drag-and-drop” experience for programming and deploying robots, whether working directly on the factory floor with real robots on a tablet device or with access to simulation from Isaac Sim, powered by NVIDIA Omniverse.
Robot System Design in Simulation for Real-World Deployments
READY is making robotics system design easier for nonprogrammers, helping to validate robots and systems for accelerated deployments.
The company is developing Omniverse Extensions — Omniverse kit applications based on Isaac Sim — and can deploy them on the cloud. It uses Omniverse Nucleus — the platform’s database and collaboration engine — in the cloud as well.
Isaac Sim is an application framework that enables simulation training for testing out robots in virtual manufacturing lines before deployment into the real world.
“Bigger companies are moving to a sim-first approach to automation because these systems cost a lot of money to install. They want to simulate them first to make sure it’s worth the investment,” said Guerin.
The startup charges users of its platform licensing per software seat and also offers support services to help roll out and develop systems.
It’s a huge opportunity. Roughly 90 percent of the world’s factories haven’t yet embraced automation, which is a trillion-dollar market.
READY is a member of NVIDIA Inception, a free program that provides startups with technical training, go-to-market support and AI platform guidance.
From Industrial Automation Giants to Stanley Black & Decker
The startup operates in an ecosystem of world-leading industrial automation providers, and these global partners are actively developing integrations with platforms like NVIDIA Omniverse and are investing in READY, said Guerin.
“Right now we are starting to work with large enterprise customers who want to automate but they can’t find the expertise to do it,” he said.
Stanley Black & Decker, a global supplier of tools, is relying on READY to automate machines, including CNC lathes and mills.
Robotic automation had been hard to deploy in their factory until Stanley Black & Decker started using READY’s ForgeOS with its Station setup, which makes it possible to deploy robots in a day.
Creating Drag-and-Drop Robotic Systems in Simulation
READY is putting simulation capabilities into the hands of nonprogrammers, who can learn its Task Canvas interface for drag-and-drop programming of industrial robots in about an hour, according to the company.
The company also runs READY Academy, which offers a catalog of free training for manufacturing professionals to learn the skills to design, deploy, manage and troubleshoot robotic automation systems.
“For potential customers interested in our technology, being able to try it out with a robot simulated in Omniverse before they get their hands on the real thing — that’s something we’re really excited about,” said Guerin.
Learn more about NVIDIA Isaac Sim, Jetson Orin, Omniverse Enterprise.
GPT-4 + Stable-Diffusion = ?: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models
TL;DR: Text Prompt -> LLM -> Intermediate Representation (such as an image layout) -> Stable Diffusion -> Image.
Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, despite their impressive capabilities, diffusion models, such as Stable Diffusion, often struggle to accurately follow the prompts when spatial or common sense reasoning is required.
The following figure lists four scenarios in which Stable Diffusion falls short in generating images that accurately correspond to the given prompts, namely negation, numeracy, and attribute assignment, spatial relationships. In contrast, our method, LLM-grounded Diffusion (LMD), delivers much better prompt understanding in text-to-image generation in those scenarios.
Figure 1: LLM-grounded Diffusion enhances the prompt understanding ability of text-to-image diffusion models.
Privateer Space: The Final Frontier in AI Space Junk Management
It’s time to take out the space trash.
In this episode of the NVIDIA AI Podcast, host Noah Kravitz dives into an illuminating conversation with Alex Fielding, co-founder and CEO of Privateer Space.
Fielding is a tech industry veteran, having previously worked alongside Apple co-founder Steve Wozniak on several projects, and holds a deep expertise in engineering, robotics, machine learning and AI.
Privateer Space, Fielding’s latest venture, aims to address one of the most daunting challenges facing our world today: space debris.
The company is creating a data infrastructure to monitor and clean up space debris, ensuring sustainable growth for the budding space economy. In essence, they’re the sanitation engineers of the cosmos.
Privateer Space is also a part of NVIDIA Inception, a free program that offers go-to-market support, expertise and technology for AI startups.
During the podcast, Fielding shares the genesis of Privateer Space, his journey from Apple to the space industry, and his subsequent work on communication between satellites at different altitudes.
He also addresses the severity of space debris, explaining how every launch adds more debris, including minute yet potentially dangerous fragments like frozen propellant and paint chips.
Tune in to the podcast for more on what the future holds for the intersection of AI and space.
You Might Also Like
Jules Anh Tuan Nguyen Explains How AI Lets Amputee Control Prosthetic Hand, Video Games
A postdoctoral researcher at the University of Minnesota discusses his efforts to allow amputees to control their prosthetic limb — right down to the finger motions — with their minds.
Overjet’s Ai Wardah Inam on Bringing AI to Dentistry
Overjet, a member of NVIDIA Inception, is moving fast to bring AI to dentists’ offices. Dr. Wardah Inam, CEO of the company, discusses using AI to improve patient care.
Immunai CTO and Co-Founder Luis Voloch on Using Deep Learning to Develop New Drugs
Luis Voloch, co-founder and chief technology officer of Immunai, talks about tackling the challenges of the immune system with a machine learning and data science mindset.
Subscribe to the AI Podcast: Now Available on Amazon Music
The AI Podcast is now available through Amazon Music.
In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.
Make the AI Podcast better. Have a few minutes to spare? Fill out this listener survey.
Instruction fine-tuning for FLAN T5 XL with Amazon SageMaker Jumpstart
Generative AI is in the midst of a period of stunning growth. Increasingly capable foundation models are being released continuously, with large language models (LLMs) being one of the most visible model classes. LLMs are models composed of billions of parameters trained on extensive corpora of text, up to hundreds of billions or even a trillion tokens. These models have proven extremely effective for a wide range of text-based tasks, from question answering to sentiment analysis.
The power of LLMs comes from their capacity to learn and generalize from extensive and diverse training data. The initial training of these models is performed with a variety of objectives, supervised, unsupervised, or hybrid. Text completion or imputation is one of the most common unsupervised objectives: given a chunk of text, the model learns to accurately predict what comes next (for example, predict the next sentence). Models can also be trained in a supervised fashion using labeled data to accomplish a set of tasks (for example, is this movie review positive, negative, or neutral). Whether the model is trained for text completion or some other task, it is frequently not the task customers want to use the model for.
To improve the performance of a pre-trained LLM on a specific task, we can tune the model using examples of the target task in a process known as instruction fine-tuning. Instruction fine-tuning uses a set of labeled examples in the form of {prompt, response} pairs to further train the pre-trained model in adequately predicting the response given the prompt. This process modifies the weights of the model.
This post describes how to perform instruction fine-tuning of an LLM, namely FLAN T5 XL, using Amazon SageMaker Jumpstart. We demonstrate how to accomplish this using both the Jumpstart UI and a notebook in Amazon SageMaker Studio. You can find the accompanying notebook in the amazon-sagemaker-examples GitHub repository.
Solution overview
The target task in this post is to, given a chunk of text in the prompt, return questions that are related to the text but can’t be answered based on the information it contains. This is a useful task to identify missing information in a description or identify whether a query needs more information to be answered.
FLAN T5 models are instruction fine-tuned on a wide range of tasks to increase the zero-shot performance of these models on many common tasks[1]. Additional instruction fine-tuning for a particular customer task can further increase the accuracy of these models, especially if the target task wasn’t previously used to train a FLAN T5 model, as is the case for our task.
In our example task, we’re interested in generating relevant but unanswered questions. To this end, we use a subset of the version 2 of the Stanford Question Answering Dataset (SQuAD2.0)[2] to fine-tune the model. This dataset contains questions posed by human annotators on a set of Wikipedia articles. In addition to questions with answers, SQuAD2.0 contains about 50,000 unanswerable questions. Such questions are plausible but can’t be directly answered from articles’ content. We only use the unanswerable questions. Our data is structured as a JSON Lines file, with each line containing a context and a question.
Prerequisites
To get started, all you need is an AWS account in which you can use Studio. You will need to create a user profile for Studio if you don’t already have one.
Fine-tune FLAN-T5 with the Jumpstart UI
To fine-tune the model with the Jumpstart UI, complete the following steps:
- On the SageMaker console, open Studio.
- Under SageMaker Jumpstart in the navigation pane, choose Models, notebooks, solutions.
You will see a list of foundation models, including FLAN T5 XL, which is marked as fine-tunable.
- Choose View model.
- Under Data source, you can provide the path to your training data. The source for the data used in this post is provided by default.
- You can keep the default value for the deployment configuration (including instance type), security, and the hyperparameters, but you should increase the number of epochs to at least three to get good results.
- Choose Train to train the model.
You can track the status of the training job in the UI.
- When training is complete (after about 53 minutes in our case), choose Deploy to deploy the fine-tuned model.
After the endpoint is created (a few minutes), you can open a notebook and start using your fine-tuned model.
Fine-tune FLAN-T5 using a Python notebook
Our example notebook shows how to use Jumpstart and SageMaker to programmatically fine-tune and deploy a FLAN T5 XL model. It can be run in Studio or locally.
In this section, we first walk through some general setup. Then you fine-tune the model using the SQuADv2 datasets. Next, you deploy the pre-trained version of the model behind a SageMaker endpoint, and do the same with the fine-tuned model. Finally, you can query the endpoints and compare the quality of the output of the pre-trained and fine-tuned model. You will find that the output of the fine-tuned model is of much higher quality.
Set up prerequisites
Begin by installing and upgrading the necessary packages. Restart the kernel after running the following code:
Next, obtain the execution role associated with the current notebook instance:
You can define a convenient drop-down menu that will list the model sizes available for fine-tuning:
Jumpstart automatically retrieves appropriate training and inference instance types for the model that you chose:
You’re now ready to start fine-tuning.
Retrain the model on the fine-tuning dataset
After your setup is complete, complete the following steps:
Use the following code to retrieve the URI for the artifacts needed:
The training data is located in a public Amazon Simple Storage Service (Amazon S3) bucket.
Use the following code to point to the location of the data and set up the output location in a bucket in your account:
The original data is not in a format that corresponds to the task for which you are fine-tuning the model, so you can reformat it:
Now you can define some hyperparameters for the training:
You are now ready to launch the training job:
Depending on the size of the fine-tuning data and model chosen, the fine-tuning could take up to a couple of hours.
You can monitor performance metrics such as training and validation loss using Amazon CloudWatch during training. Conveniently, you can also fetch the most recent snapshot of metrics by running the following code:
When the training is complete, you have a fine-tuned model at model_uri
. Let’s use it!
You can create two inference endpoints: one for the original pre-trained model, and one for the fine-tuned model. This allows you to compare the output of both versions of the model. In the next step, you deploy an inference endpoint for the pre-trained model. Then you deploy an endpoint for your fine-tuned model.
Deploy the pre-trained model
Let’s start by deploying the pre-trained model retrieve the inference Docker image URI. This is the base Hugging Face container image. Use the following code:
You can now create the endpoint and deploy the pre-trained model. Note that you need to pass the Predictor class when deploying model through the Model class to be able to run inference through the SageMaker API. See the following code:
The endpoint creation and model deployment can take a few minutes, then your endpoint is ready to receive inference calls.
Deploy the fine-tuned model
Let’s deploy the fine-tuned model to its own endpoint. The process is almost identical to the one we used earlier for the pre-trained model. The only difference is that we use the fine-tuned model name and URI:
When this process is complete, both pre-trained and fine-tuned models are deployed behind their own endpoints. Let’s compare their outputs.
Generate output and compare the results
Define some utility functions to query the endpoint and parse the response:
In the next code snippet, we define the prompt and the test data. The describes our target task, which is to generate questions that are related to the provided text but can’t be answered based on it.
The test data consists of three different paragraphs, one on the Australian city of Adelaide from the first two paragraphs of it Wikipedia page, one regarding Amazon Elastic Block Store (Amazon EBS) from the Amazon EBS documentation, and one of Amazon Comprehend from the Amazon Comprehend documentation. We expect the model to identify questions related to these paragraphs but that can’t be answered with the information provided therein.
You can now test the endpoints using the example articles
Test data: Adelaide
We use the following context:
The pre-trained model response is as follows:
The fine-tuned model responses are as follows:
Test data: Amazon EBS
We use the following context:
The pre-trained model responses are as follows:
The fine-tuned model responses are as follows:
Test data: Amazon Comprehend
We use the following context:
The pre-trained model responses are as follows:
The fine-tuned model responses are as follows:
The difference in output quality between the pre-trained model and the fine-tuned model is stark. The questions provided by the fine-tuned model touch on a wider range of topics. They are systematically meaningful questions, which isn’t always the case for the pre-trained model, as illustrated with the Amazon EBS example.
Although this doesn’t constitute a formal and systematic evaluation, it’s clear that the fine-tuning process has improved the quality of the model’s responses on this task.
Clean up
Lastly, remember to clean up and delete the endpoints:
Conclusion
In this post, we showed how to use instruction fine-tuning with FLAN T5 models using the Jumpstart UI or a Jupyter notebook running in Studio. We provided code explaining how to retrain the model using data for the target task and deploy the fine-tuned model behind an endpoint. The target task in this post was to identify questions that relate to a chunk of text provided in the input but can’t be answered based on the information provided in that text. We demonstrated that a model fine-tuned for this specific task returns better results than a pre-trained model.
Now that you know how to instruction fine-tune a model with Jumpstart, you can create powerful models customized for your application. Gather some data for your use case, uploaded it to Amazon S3, and use either the Studio UI or the notebook to tune a FLAN T5 model!
References
[1] Chung, Hyung Won, et al. “Scaling instruction-fine tuned language models.” arXiv preprint arXiv:2210.11416 (2022). [2] Rajpurkar, Pranav, Robin Jia, and Percy Liang. “Know What You Don’t Know: Unanswerable Questions for SQuAD.” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018.About the authors
Laurent Callot is a Principal Applied Scientist and manager at AWS AI Labs who has worked on a variety of machine learning problems, from foundational models and generative AI to forecasting, anomaly detection, causality, and AI Ops.
Andrey Kan is a Senior Applied Scientist at AWS AI Labs within interests and experience in different fields of Machine Learning. These include research on foundation models, as well as ML applications for graphs and time series.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana Champaign. He is an active researcher in machine learning and statistical inference and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.
Baris Kurt is an Applied Scientist at AWS AI Labs. His interests are in time series anomaly detection and foundation models. He loves developing user friendly ML systems.
Jonas Kübler is an Applied Scientist at AWS AI Labs. He is working on foundation models with the goal to facilitate use-case specific applications.
How Amazon’s delivery network bridges cities and rural areas
Amazon scientists use a combination of data, machine learning, and algorithm design to meet last-mile delivery challenges.Read More
Helping more people stay safe with flood forecasting
AI-powered Flood Hub is expanding to nearly 80 countries worldwide.Read More
Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0
As part of PyTorch 2.0 release, an accelerated implementation of the attention mechanism as part of the “Better Transformer” project (and known in PyTorch as Accelerated Transformers) has been added natively into PyTorch as torch.nn.functional.scaled_dot_product_attention
. This implementation leverages fused kernels from FlashAttention and Memory-efficient attention, and supports both training and inference.
We also release a notebook showcasing an example of this integration here
After seeing 20-30% speedups at inference for diffusion models, we went ahead and implemented an integration with 🤗 Transformers models through the 🤗 Optimum library. Similar to the previous integration for encoder models, the integration replaces modules from Transformers with efficient implementations that use torch.nn.functional.scaled_dot_product_attention
. The usage is as follow:
from optimum.bettertransformer import BetterTransformer
from transformers import AutoModelForCausalLM
with torch.device(“cuda”):
model = AutoModelForCausalLM.from_pretrained(“gpt2-large”, torch_dtype=torch.float16)
model = BetterTransformer.transform(model)
# do your inference or training here
# if training and want to save the model
model = BetterTransformer.reverse(model)
model.save_pretrained(“fine_tuned_model”)
model.push_to_hub(“fine_tuned_model”)
Summarizing our findings below about torch.nn.functional.scaled_dot_product_attention
:
- It is most useful to fit larger models, sequence length, or batch size to train on a given hardware.
- Memory footprint savings on GPU during training range from 20% to 110%+.
- Speedups during training range from 10% to 70%.
- Speedups during inference range from 5% to 20%.
- Standalone, for small head dimensions,
scaled_dot_product_attention
speedups go up to 3x, memory savings go as high as 40x (depending on the sequence length).
You may be surprised by the wide range of memory savings and speedups. In this blog post, we discuss our benchmarks, where this feature shines and upcoming improvements in future PyTorch releases.
In the next release of transformers you will just need to install the proper version of optimum and run:
model = model.to_bettertransformer()
To convert your model using the BetterTransformer API. You can already try this feature out by installing transformers from source.
Benchmark and usage with 🤗 Transformers
torch.nn.functional.scaled_dot_product_attention
is usable with any architecture that uses standard attention, and namely replaces the boiler-plate code:
# native scaled_dot_product_attention is equivalent to the following:
def eager_sdpa(query, key, value, attn_mask, dropout_p, is_causal, scale):
scale_factor = 1 / math.sqrt(Q.size(-1)) if scale is None else scale
attn_mask = torch.ones(L, S, dtype=torch.bool).tril(diagonal=0) if is_causal else attn_mask
attn_mask = attn_mask.masked_fill(not attn_mask, -float('inf')) if attn_mask.dtype==torch.bool else attn_mask
attn_weight = torch.softmax((Q @ K.transpose(-2, -1) * scale_factor) + attn_mask, dim=-1)
attn_weight = torch.dropout(attn_weight, dropout_p)
return attn_weight @ V
In the 🤗 Optimum integration with Transformers models, the following architectures are supported for now: gpt2, gpt-neo, gpt-neox, gptj, t5, bart, codegen, pegasus, opt, LLaMA, blenderbot, m2m100. You can expect this list to be extended in the near future!
To validate the benefits from the native scaled dot-product attention, we ran inference and training benchmarks, whose results are presented below.
Inference benchmark on a single A10G GPU, AWS g5.4xlarge instance
Training benchmark on a single A10G GPU, AWS g5.4xlarge instance
Training benchmark on a single A100-SXM4-80GB, Nvidia DGX
Out of this benchmark, the most interesting finding is that native SDPA allows for the usage of longer sequence lengths and batch sizes without running into out of memory issues. Moreover, up to 20% speedups can be seen during inference, and even larger during training.
As seen on the training benchmarks, it appears that smaller head dimension brings higher speedups and memory savings, which we will discuss in the following section.
The implementation supports multi-GPU settings as well, thanks to 🤗 Accelerate library by passing device_map=”auto”
to the from_pretrained
method. Here are some results for training on two A100-SXM4-80GB.
Training benchmark on two A100-SXM4-80GB, Nvidia DGX, using 🤗 Accelerate library for distributed training
Note that some kernels support only the sm_80 compute capability (which is the one from A100 GPUs), which limits usability on a wide range of hardware, notably if the head dimension is not a power of two. For example, as of PyTorch 2.0.0 during training, opt-2.7b (headim=80) and gpt-neox-20b (headdim=96) can not dispatch to a kernel using flash attention, unless run on an A100 GPU. Better kernels may be developed in the future: https://github.com/pytorch/pytorch/issues/98140#issuecomment-1518101895
Flash Attention, Memory-efficient attention & math differences
The native scaled_dot_product_attention
relies on three possible backend implementations: flash attention, memory-efficient attention, and the so-called math implementation which provides a hardware-neutral fallback for all PyTorch platforms.
When fused kernels are available for a given problem size, flash-attention or memory-efficient attention will be used, effectively allowing for a lower memory footprint, as in the memory-efficient attention case O(N) memory allocations are done on the GPU global memory instead of the classic O(N^2) for the traditional eager attention implementation. With flash attention, a reduced number of memory accesses (read and writes) is expected, hence both giving speedups and memory savings.
The “math” implementation is simply an implementation using the PyTorch’s C++ API. Interesting to note in this implementation is that the query and key tensors are scaled individually for numerical stability, thus launching two aten::div operations instead of possibly only one in an eager implementation that does not contain this optimization for numerical stability.
Head dimension influence on speedups, memory savings
Benchmarking torch.nn.functional.scaled_dot_product_attention
, we notice a decrease in the speedup / memory gains as the head dimension increases. This is an issue for some architectures like EleutherAI/gpt-neo-2.7B, that has a relatively large head dimension of 128, or EleutherAI/gpt-j-6B (and derived models as PygmalionAI/pygmalion-6b) that has a head dimension of 256 (that actually currently do not dispatch on fused kernels as the head dimension is too large).
This trend can be seen in the figures below, where torch.nn.scaled_dot_production
is benchmarked standalone versus the above eager implementation. Moreover, we use the torch.backends.cuda.sdp_kernel
context manager to force the usage of respectively math, flash attention, and memory-efficient attention implementation.
Using memory-efficient attention SDP kernel (forward-only), A100
Using math (without dropout), A100
Using flash attention SDP kernel (without dropout), A100
Using memory-efficient attention SDP kernel (without dropout), A100
We see that for the same problem size, be it for inference-only or training, the speedup decreases with higher head dimension, e.g. from 3.4x for headdim=8 to 1.01x for headdim=128 using flash attention kernel.
The reduced memory saving is expected with larger head dimensions. Recall the standard attention computation:
Due to the intermediate computations, the global memory footprint is 2 * N * N + N * d in this standard step by step computation. Memory-efficient attention proposes to iteratively update the softmax renormalization constant and moving its computation at the very end, allowing for only a constant output memory allocation N * d.
Thus, the memory saving ratio is 2 * N / d + 1, which decreases with larger head dimension.
In flash attention, the tradeoff is between the head dimension d and the shared memory size M of a GPU streaming multiprocessor, with a total number of memory accesses of O(N² * d²/M). Thus, the memory accesses scale quadratically in the head dimension, contrary to the standard attention that scales linearly. The reason is that in flash attention, for larger head dimension d, the key and value K, V need to be split into more blocks to fit into shared memory, and in turn each block needs to load the full query Q and output O.
Thus, the highest speedups for flash attention are in a regime where the ratio d² / M is small enough.
Current limitations as of PyTorch 2.0.0
Absence of a scale argument
As of PyTorch 2.0.0, torch.nn.functional.scaled_dot_product_attention
has no scale argument and uses the default square root of the hidden size sqrt(d_k).
However, some architectures as OPT or T5 do not use a scaling in the attention, which as of Pytorch 2.0.0 forces it to artificially rescale before the scaled_dot_product_attention
call. This introduces an unnecessary overhead, as an additional multiplication is necessary, on top of unneeded divisions in the attention.
A fix for this issue has been merged in PyTorch repository.
Support of flash attention / memory-efficient attention with custom mask
As of PyTorch 2.0.0, when passing a custom attention mask, flash attention and memory-efficient attention can not be used. In this case, scaled_dot_product_attention
automatically dispatches to the C++ implementation.
However, as we have seen, some architectures require a custom attention mask, as T5 that uses positional bias. Moreover, in the case of a batch size larger than one where some inputs may be padded, a custom attention mask also needs to be passed. For this latter case, an alternative would be to use NestedTensor, which SDPA supports.
This limited support for custom masks thus limits the benefits from SDPA in these specific cases, although we can hope for an extended support in the future.
Note that xformers, from which PyTorch’s SDPA partially takes inspiration, currently supports arbitrary attention masks: https://github.com/facebookresearch/xformers/blob/658ebab39545f180a6075385b3897921623d6c3b/xformers/ops/fmha/cutlass.py#L147-L156 . HazyResearch implementation of flash attention also supports an equivalent implementation of padding, as a cumulative sequence length array is used along with packed query/key/values – similar in essence to NestedTensor.
In conclusion
Using torch.nn.functional.scaled_dot_product_attention
is a free-lunch optimization, both making your code more readable, uses less memory, and is in most common cases faster.
Although the implementation in PyTorch 2.0.0 has still minor limitations, inference and training already massively benefit from SDPA in most cases. We encourage you to use this native implementation be it to train or deploy your PyTorch models, and for 🤗 Transformers models as a one-line transformation!
In the future, we would like to adapt the API to enable users to use SDPA in encoder-based models as well.
We thank Benjamin Lefaudeux, Daniel Haziza and Francisco Massa for their advice on the head dimension influence, as well as Michael Gschwind, Christian Puhrsch and Driss Guessous for their feedback on the blog post!
Benchmark reproduction
The benchmark presented in this post was done using torch==2.0.0, transformers==4.27.4, accelerate==0.18.0 and optimum==1.8.0.
The benchmarks can be easily reproduced using the scripts for inference, training for 🤗 Transformers models, and standalone SDPA.
Governance of superintelligence
Now is a good time to start thinking about the governance of superintelligence—future AI systems dramatically more capable than even AGI.OpenAI Blog