NVIDIA and Microsoft Drive Innovation for Windows PCs in New Era of Generative AI

NVIDIA and Microsoft Drive Innovation for Windows PCs in New Era of Generative AI

Generative AI — in the form of large language model (LLM) applications like ChatGPT, image generators such as Stable Diffusion and Adobe Firefly, and game rendering techniques like NVIDIA DLSS 3 Frame Generation — is rapidly ushering in a new era of computing for productivity, content creation, gaming and more.

At the Microsoft Build developer conference, NVIDIA and Microsoft today showcased a suite of advancements in Windows 11 PCs and workstations with NVIDIA RTX GPUs to meet the demands of generative AI.

More than 400 Windows apps and games already employ AI technology, accelerated by dedicated processors on RTX GPUs called Tensor Cores. Today’s announcements, which include tools to develop AI on Windows PCs, frameworks to optimize and deploy AI, and driver performance and efficiency improvements, will empower developers to build the next generation of Windows apps with generative AI at their core.

“AI will be the single largest driver of innovation for Windows customers in the coming years,” said Pavan Davuluri, corporate vice president of Windows silicon and system integration at Microsoft. “By working in concert with NVIDIA on hardware and software optimizations, we’re equipping developers with a transformative, high-performance, easy-to-deploy experience.”

Develop Models With Windows Subsystem for Linux

AI development has traditionally taken place on Linux, requiring developers to either dual-boot their systems or use multiple PCs to work in their AI development OS while still accessing the breadth and depth of the Windows ecosystem.

Over the past few years, Microsoft has been building a powerful capability to run Linux directly within the Windows OS, called Windows Subsystem for Linux (WSL). NVIDIA has been working closely with Microsoft to deliver GPU acceleration and support for the entire NVIDIA AI software stack inside WSL. Now developers can use Windows PC for all their local AI development needs with support for GPU-accelerated deep learning frameworks on WSL.

With NVIDIA RTX GPUs delivering up to 48GB of RAM in desktop workstations, developers can now work with models on Windows that were previously only available on servers. The large memory also improves the performance and quality for local fine-tuning of AI models, enabling designers to customize them to their own style or content. And because the same NVIDIA AI software stack runs on NVIDIA data center GPUs, it’s easy for developers to push their models to Microsoft Azure Cloud for large training runs.

Rapidly Optimize and Deploy Models

With trained models in hand, developers need to optimize and deploy AI for target devices.

Microsoft released the Microsoft Olive toolchain for optimization and conversion of PyTorch models to ONNX, enabling developers to automatically tap into GPU hardware acceleration such as RTX Tensor Cores. Developers can optimize models via Olive and ONNX, and deploy Tensor Core-accelerated models to PC or cloud. Microsoft continues to invest in making PyTorch and related tools and frameworks work seamlessly with WSL to provide the best AI model development experience.

Improved AI Performance, Power Efficiency

Once deployed, generative AI models demand incredible inference performance. RTX Tensor Cores deliver up to 1,400 Tensor TFLOPS for AI inferencing. Over the last year, NVIDIA has worked to improve DirectML performance to take full advantage of RTX hardware.

On May 24, we’ll release our latest optimizations in Release 532.03 drivers that combine with Olive-optimized models to deliver big boosts in AI performance. Using an Olive-optimized version of the Stable Diffusion text-to-image generator with the popular Automatic1111 distribution, performance is improved over 2x with the new driver.

Chart showing performance improvements in Stable Diffusion with updated NVIDIA drivers.
Stable Diffusion performance tested on GeForce RTX 4090 using Automatic1111 and Text-to-Image function.

With AI coming to nearly every Windows application, efficiently delivering inference performance is critical — especially for laptops. Coming soon, NVIDIA will introduce new Max-Q low-power inferencing for AI-only workloads on RTX GPUs. It optimizes Tensor Core performance while keeping power consumption of the GPU as low as possible, extending battery life and maintaining a cool, quiet system.  The GPU can then dynamically scale up for maximum AI performance when the workload demands it.

Join the PC AI Revolution Now

Top software developers — like Adobe, DxO, ON1 and Topaz — have already incorporated NVIDIA AI technology with more than 400 Windows applications and games optimized for RTX Tensor Cores.

“AI, machine learning and deep learning power all Adobe applications and drive the future of creativity. Working with NVIDIA we continuously optimize AI model performance to deliver the best possible experience for our Windows users on RTX GPUs.” — Ely Greenfield, CTO of digital media at Adobe

“NVIDIA is helping to optimize our WinML model performance on RTX GPUs, which is accelerating the AI in DxO DeepPRIME, as well as providing better denoising and demosaicing, faster.” — Renaud Capolunghi, senior vice president of engineering at DxO

“Working with NVIDIA and Microsoft to accelerate our AI models running in Windows on RTX GPUs is providing a huge benefit to our audience. We’re already seeing 1.5x performance gains in our suite of AI-powered photography editing software.” — Dan Harlacher, vice president of products at ON1

“Our extensive work with NVIDIA has led to improvements across our suite of photo- and video-editing applications. With RTX GPUs, AI performance has improved drastically, enhancing the experience for users on Windows PCs.” — Suraj Raghuraman, head of AI engine development at Topaz Labs

NVIDIA and Microsoft are making several resources available for developers to test drive top generative AI models on Windows PCs. An Olive-optimized version of the Dolly 2.0 large language model is available on Hugging Face. And a PC-optimized version of NVIDIA NeMo large language model for conversational AI is coming soon to Hugging Face.

Developers can also learn how to optimize their applications end-to-end to take full advantage of GPU-acceleration via the NVIDIA AI for accelerating applications developer site.

The complementary technologies behind Microsoft’s Windows platform and NVIDIA’s dynamic AI hardware and software stack will help developers quickly and easily develop and deploy generative AI on Windows 11.

Microsoft Build runs through Thursday, May 25. Tune into to learn more on shaping the future of work with AI.

Read More

No Programmers? No Problem: READY Robotics Simplifies Robot Coding, Rollouts

No Programmers? No Problem: READY Robotics Simplifies Robot Coding, Rollouts

Robotics hardware traditionally requires programmers to deploy it. READY Robotics wants to change that with its “no code” software aimed at people working in manufacturing who haven’t got programming skills.

The Columbus, Ohio, startup is a spinout of robotics research from Johns Hopkins University. Kel Guerin was a PhD candidate there leading this research when he partnered with Benjamin Gibbs, who was at Johns Hopkins Technology Ventures, to land funding and pursue the company, now led by Gibbs as CEO.

“There was this a-ha moment where we figured out that we could take these types of visual languages that are very easy to understand and use them for robotics,” said Guerin, who’s now chief innovation officer at the startup.

READY’s  “no code” ForgeOS operating system is designed to enable anyone to program any type of robot hardware or automation device. ForgeOS works seamlessly with plug-ins for most major robot hardware, and similar to other operating systems, like Android, it allows running third-party apps and plugins, providing a robust ecosystem of partners and developers working to make robots more capable, says Guerin.

Implementing apps in robotics allows for new capabilities to be added to a robotic system in a few clicks, improving user experience and usability. Users can install their own apps, such as Task Canvas, which provides an intuitive building block programming interface similar to Scratch, a simple block-based visual language for kids developed at MIT Media Lab, which was influential in its design.

Task Canvas allows users to show the actions of the robot, as well as all the other devices in an automation cell (such as grippers, programmable logic controllers, and machine tools) as blocks in a flow chart. The user can easily create powerful logic by tying these blocks together — without writing a single line of code. The interface offers nonprogrammers a more “drag-and-drop” experience for programming and deploying robots, whether working directly on the factory floor with real robots on a tablet device or with access to simulation from Isaac Sim, powered by NVIDIA Omniverse.

 

Robot System Design in Simulation for Real-World Deployments 

READY is making robotics system design easier for nonprogrammers, helping to validate robots and systems for accelerated deployments.

The company is developing Omniverse Extensions — Omniverse kit applications based on Isaac Sim — and can deploy them on the cloud. It uses Omniverse Nucleus — the platform’s database and collaboration engine — in the cloud as well.

Isaac Sim is an application framework that enables simulation training for testing out robots in virtual manufacturing lines before deployment into the real world.

“Bigger companies are moving to a sim-first approach to automation because these systems cost a lot of money to install. They want to simulate them first to make sure it’s worth the investment,” said Guerin.

The startup charges users of its platform licensing per software seat and also offers support services to help roll out and develop systems.

It’s a huge opportunity. Roughly 90 percent of the world’s factories haven’t yet embraced automation, which is a trillion-dollar market.

READY is a member of NVIDIA Inception, a free program that provides startups with technical training, go-to-market support and AI platform guidance.

From Industrial Automation Giants to Stanley Black & Decker

The startup operates in an ecosystem of world-leading industrial automation providers, and these global partners are actively developing integrations with platforms like NVIDIA Omniverse and are investing in READY, said Guerin.

“Right now we are starting to work with large enterprise customers who want to automate but they can’t find the expertise to do it,” he said.

Stanley Black & Decker, a global supplier of tools, is relying on READY to automate machines, including CNC lathes and mills.

Robotic automation had been hard to deploy in their factory until Stanley Black & Decker started using READY’s ForgeOS with its Station setup, which makes it possible to deploy robots in a day.

Creating Drag-and-Drop Robotic Systems in Simulation 

READY is putting simulation capabilities into the hands of nonprogrammers, who can learn its Task Canvas interface for drag-and-drop programming of industrial robots in about an hour, according to the company.

The company also runs READY Academy, which offers a catalog of free training for manufacturing professionals to learn the skills to design, deploy, manage and troubleshoot robotic automation systems.

“For potential customers interested in our technology, being able to try it out with a robot simulated in Omniverse before they get their hands on the real thing — that’s something we’re really excited about,” said Guerin.

Learn more about NVIDIA Isaac Sim, Jetson Orin, Omniverse Enterprise.

 

Read More

GPT-4 + Stable-Diffusion = ?: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

GPT-4 + Stable-Diffusion = ?: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

TL;DR: Text Prompt -> LLM -> Intermediate Representation (such as an image layout) -> Stable Diffusion -> Image.

Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, despite their impressive capabilities, diffusion models, such as Stable Diffusion, often struggle to accurately follow the prompts when spatial or common sense reasoning is required.

The following figure lists four scenarios in which Stable Diffusion falls short in generating images that accurately correspond to the given prompts, namely negation, numeracy, and attribute assignment, spatial relationships. In contrast, our method, LLM-grounded Diffusion (LMD), delivers much better prompt understanding in text-to-image generation in those scenarios.

VisualizationsFigure 1: LLM-grounded Diffusion enhances the prompt understanding ability of text-to-image diffusion models.

Privateer Space: The Final Frontier in AI Space Junk Management

Privateer Space: The Final Frontier in AI Space Junk Management

It’s time to take out the space trash.

In this episode of the NVIDIA AI Podcast, host Noah Kravitz dives into an illuminating conversation with Alex Fielding, co-founder and CEO of Privateer Space.

Fielding is a tech industry veteran, having previously worked alongside Apple co-founder Steve Wozniak on several projects, and holds a deep expertise in engineering, robotics, machine learning and AI.

Privateer Space, Fielding’s latest venture, aims to address one of the most daunting challenges facing our world today: space debris.

The company is creating a data infrastructure to monitor and clean up space debris, ensuring sustainable growth for the budding space economy. In essence, they’re the sanitation engineers of the cosmos.

Privateer Space is also a part of NVIDIA Inception, a free program that offers go-to-market support, expertise and technology for AI startups.

During the podcast, Fielding shares the genesis of Privateer Space, his journey from Apple to the space industry, and his subsequent work on communication between satellites at different altitudes.

He also addresses the severity of space debris, explaining how every launch adds more debris, including minute yet potentially dangerous fragments like frozen propellant and paint chips.

Tune in to the podcast for more on what the future holds for the intersection of AI and space.

You Might Also Like

Jules Anh Tuan Nguyen Explains How AI Lets Amputee Control Prosthetic Hand, Video Games

A postdoctoral researcher at the University of Minnesota discusses his efforts to allow amputees to control their prosthetic limb — right down to the finger motions — with their minds.

Overjet’s Ai Wardah Inam on Bringing AI to Dentistry

Overjet, a member of NVIDIA Inception, is moving fast to bring AI to dentists’ offices. Dr. Wardah Inam, CEO of the company, discusses using AI to improve patient care.

Immunai CTO and Co-Founder Luis Voloch on Using Deep Learning to Develop New Drugs

Luis Voloch, co-founder and chief technology officer of Immunai, talks about tackling the challenges of the immune system with a machine learning and data science mindset.

Subscribe to the AI Podcast: Now Available on Amazon Music

The AI Podcast is now available through Amazon Music.

In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better. Have a few minutes to spare? Fill out this listener survey.

Read More

Instruction fine-tuning for FLAN T5 XL with Amazon SageMaker Jumpstart

Instruction fine-tuning for FLAN T5 XL with Amazon SageMaker Jumpstart

Generative AI is in the midst of a period of stunning growth. Increasingly capable foundation models are being released continuously, with large language models (LLMs) being one of the most visible model classes. LLMs are models composed of billions of parameters trained on extensive corpora of text, up to hundreds of billions or even a trillion tokens. These models have proven extremely effective for a wide range of text-based tasks, from question answering to sentiment analysis.

The power of LLMs comes from their capacity to learn and generalize from extensive and diverse training data. The initial training of these models is performed with a variety of objectives, supervised, unsupervised, or hybrid. Text completion or imputation is one of the most common unsupervised objectives: given a chunk of text, the model learns to accurately predict what comes next (for example, predict the next sentence). Models can also be trained in a supervised fashion using labeled data to accomplish a set of tasks (for example, is this movie review positive, negative, or neutral). Whether the model is trained for text completion or some other task, it is frequently not the task customers want to use the model for.

To improve the performance of a pre-trained LLM on a specific task, we can tune the model using examples of the target task in a process known as instruction fine-tuning. Instruction fine-tuning uses a set of labeled examples in the form of {prompt, response} pairs to further train the pre-trained model in adequately predicting the response given the prompt. This process modifies the weights of the model.

This post describes how to perform instruction fine-tuning of an LLM, namely FLAN T5 XL, using Amazon SageMaker Jumpstart. We demonstrate how to accomplish this using both the Jumpstart UI and a notebook in Amazon SageMaker Studio. You can find the accompanying notebook in the amazon-sagemaker-examples GitHub repository.

Solution overview

The target task in this post is to, given a chunk of text in the prompt, return questions that are related to the text but can’t be answered based on the information it contains. This is a useful task to identify missing information in a description or identify whether a query needs more information to be answered.

FLAN T5 models are instruction fine-tuned on a wide range of tasks to increase the zero-shot performance of these models on many common tasks[1]. Additional instruction fine-tuning for a particular customer task can further increase the accuracy of these models, especially if the target task wasn’t previously used to train a FLAN T5 model, as is the case for our task.

In our example task, we’re interested in generating relevant but unanswered questions. To this end, we use a subset of the version 2 of the Stanford Question Answering Dataset (SQuAD2.0)[2] to fine-tune the model. This dataset contains questions posed by human annotators on a set of Wikipedia articles. In addition to questions with answers, SQuAD2.0 contains about 50,000 unanswerable questions. Such questions are plausible but can’t be directly answered from articles’ content. We only use the unanswerable questions. Our data is structured as a JSON Lines file, with each line containing a context and a question.

Screenshot of a few entries of the SQuADv2 dataset.

Prerequisites

To get started, all you need is an AWS account in which you can use Studio. You will need to create a user profile for Studio if you don’t already have one.

Fine-tune FLAN-T5 with the Jumpstart UI

To fine-tune the model with the Jumpstart UI, complete the following steps:

  1. On the SageMaker console, open Studio.
  2. Under SageMaker Jumpstart in the navigation pane, choose Models, notebooks, solutions.

You will see a list of foundation models, including FLAN T5 XL, which is marked as fine-tunable.

  1. Choose View model.

The JumpStart UI with FLAN-T5 XL.

  1. Under Data source, you can provide the path to your training data. The source for the data used in this post is provided by default.
  2. You can keep the default value for the deployment configuration (including instance type), security, and the hyperparameters, but you should increase the number of epochs to at least three to get good results.
  3. Choose Train to train the model.

The JumpStart train UI for the FLAN-T5 XL model.

You can track the status of the training job in the UI.

Jumpstart UI for training in progress.

  1. When training is complete (after about 53 minutes in our case), choose Deploy to deploy the fine-tuned model.

JumpStart UI training complete.

After the endpoint is created (a few minutes), you can open a notebook and start using your fine-tuned model.

Fine-tune FLAN-T5 using a Python notebook

Our example notebook shows how to use Jumpstart and SageMaker to programmatically fine-tune and deploy a FLAN T5 XL model. It can be run in Studio or locally.

In this section, we first walk through some general setup. Then you fine-tune the model using the SQuADv2 datasets. Next, you deploy the pre-trained version of the model behind a SageMaker endpoint, and do the same with the fine-tuned model. Finally, you can query the endpoints and compare the quality of the output of the pre-trained and fine-tuned model. You will find that the output of the fine-tuned model is of much higher quality.

Set up prerequisites

Begin by installing and upgrading the necessary packages. Restart the kernel after running the following code:

!pip install nest-asyncio==1.5.5 --quiet
!pip install ipywidgets==8.0.4 --quiet
!pip install --upgrade sagemaker --quiet

Next, obtain the execution role associated with the current notebook instance:

import boto3
import sagemaker
# Get current region, role, and default bucket
aws_region = boto3.Session().region_name
aws_role = sagemaker.session.Session().get_caller_identity_arn()
output_bucket = sagemaker.Session().default_bucket()
# This will be useful for printing
newline, bold, unbold = "n", "33[1m", "33[0m"
print(f"{bold}aws_region:{unbold} {aws_region}")
print(f"{bold}aws_role:{unbold} {aws_role}")
print(f"{bold}output_bucket:{unbold} {output_bucket}"

You can define a convenient drop-down menu that will list the model sizes available for fine-tuning:

import IPython
from ipywidgets import Dropdown
from sagemaker.jumpstart.filters import And
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
# Default model choice
model_id = "huggingface-text2text-flan-t5-xl"
# Identify FLAN T5 models that support fine-tuning
filter_value = And(
"task == text2text", "framework == huggingface", "training_supported == true"
)
model_list = [m for m in list_jumpstart_models(filter=filter_value) if "flan-t5" in m]
# Display the model IDs in a dropdown, for user to select
dropdown = Dropdown(
value=model_id,
options=model_list,
description="FLAN T5 models available for fine-tuning:",
style={"description_width": "initial"},
layout={"width": "max-content"},
)
display(IPython.display.Markdown("### Select a pre-trained model from the dropdown below"))
display(dropdown)

Jumpstart automatically retrieves appropriate training and inference instance types for the model that you chose:

from sagemaker.instance_types import retrieve_default
model_id, model_version = dropdown.value, "*"
# Instance types for training and inference
training_instance_type = retrieve_default(
model_id=model_id, model_version=model_version, scope="training"
)
inference_instance_type = retrieve_default(
model_id=model_id, model_version=model_version, scope="inference"
)
print(f"{bold}model_id:{unbold} {model_id}")
print(f"{bold}training_instance_type:{unbold} {training_instance_type}")
print(f"{bold}inference_instance_type:{unbold} {inference_instance_type}")

If you have chosen the FLAN T5 XL, you will see the following output:

model_id: huggingface-text2text-flan-t5-xl

training_instance_type: ml.p3.16xlarge

inference_instance_type: ml.g5.2xlarge

You’re now ready to start fine-tuning.

Retrain the model on the fine-tuning dataset

After your setup is complete, complete the following steps:

Use the following code to retrieve the URI for the artifacts needed:

from sagemaker import image_uris, model_uris, script_uris
# Training instance will use this image
train_image_uri = image_uris.retrieve(
region=aws_region,
framework=None,  # automatically inferred from model_id
model_id=model_id,
model_version=model_version,
image_scope="training",
instance_type=training_instance_type,
)
# Pre-trained model
train_model_uri = model_uris.retrieve(
model_id=model_id, model_version=model_version, model_scope="training"
)
# Script to execute on the training instance
train_script_uri = script_uris.retrieve(
model_id=model_id, model_version=model_version, script_scope="training"
)
print(f"{bold}image uri:{unbold} {train_image_uri}")
print(f"{bold}model uri:{unbold} {train_model_uri}")
print(f"{bold}script uri:{unbold} {train_script_uri}")

The training data is located in a public Amazon Simple Storage Service (Amazon S3) bucket.

Use the following code to point to the location of the data and set up the output location in a bucket in your account:

from sagemaker.s3 import S3Downloader

# We will use the train split of SQuAD2.0
original_data_file = "train-v2.0.json"

# The data was mirrored in the following bucket
original_data_location = f"s3://sagemaker-sample-files/datasets/text/squad2.0/{original_data_file}"
S3Downloader.download(original_data_location, ".")

The original data is not in a format that corresponds to the task for which you are fine-tuning the model, so you can reformat it:

import json

local_data_file = "task-data.jsonl"  # any name with .jsonl extension

with open(original_data_file) as f:
data = json.load(f)

with open(local_data_file, "w") as f:
for article in data["data"]:
for paragraph in article["paragraphs"]:
# iterate over questions for a given paragraph
for qas in paragraph["qas"]:
if qas["is_impossible"]:
# the question is relevant, but cannot be answered
example = {"context": paragraph["context"], "question": qas["question"]}
json.dump(example, f)
f.write("n")

template = {
"prompt": "Ask a question which is related to the following text, but cannot be answered based on the text. Text: {context}",
"completion": "{question}",
}
with open("template.json", "w") as f:
json.dump(template, f)

from sagemaker.s3 import S3Uploader

train_data_location = f"s3://{output_bucket}/train_data"
S3Uploader.upload(local_data_file, train_data_location)
S3Uploader.upload("template.json", train_data_location)
print(f"{bold}training data:{unbold} {train_data_location}")

Now you can define some hyperparameters for the training:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# We will override some default hyperparameters with custom values
hyperparameters["epochs"] = "3"
# TODO
# hyperparameters["max_input_length"] = "300"  # data inputs will be truncated at this length
# hyperparameters["max_output_length"] = "40"  # data outputs will be truncated at this length
# hyperparameters["generation_max_length"] = "40"  # max length of generated output
print(hyperparameters)

You are now ready to launch the training job:

from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base

model_name = "-".join(model_id.split("-")[2:])  # get the most informative part of ID
training_job_name = name_from_base(f"js-demo-{model_name}-{hyperparameters['epochs']}")
print(f"{bold}job name:{unbold} {training_job_name}")

training_metric_definitions = [
{"Name": "val_loss", "Regex": "'eval_loss': ([0-9\.]+)"},
{"Name": "train_loss", "Regex": "'loss': ([0-9\.]+)"},
{"Name": "epoch", "Regex": "'epoch': ([0-9\.]+)"},
]

# Create SageMaker Estimator instance
sm_estimator = Estimator(
role=aws_role,
image_uri=train_image_uri,
model_uri=train_model_uri,
source_dir=train_script_uri,
entry_point="transfer_learning.py",
instance_count=1,
instance_type=training_instance_type,
volume_size=300,
max_run=360000,
hyperparameters=hyperparameters,
output_path=output_location,
metric_definitions=training_metric_definitions,
)

# Launch a SageMaker training job over data located in the given S3 path
# Training jobs can take hours, it is recommended to set wait=False,
# and monitor job status through SageMaker console
sm_estimator.fit({"training": train_data_location}, job_name=training_job_name, wait=False)

Depending on the size of the fine-tuning data and model chosen, the fine-tuning could take up to a couple of hours.

You can monitor performance metrics such as training and validation loss using Amazon CloudWatch during training. Conveniently, you can also fetch the most recent snapshot of metrics by running the following code:

from sagemaker import TrainingJobAnalytics

# This can be called while the job is still running
df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()
df.head(10)

model uri: s3://sagemaker-us-west-2-802376408542/avkan/training-huggingface-text2text-huggingface-text2text-flan-t5-xl-repack.tar.gz
job name: jumpstart-demo-xl-3-2023-04-06-08-16-42-738
INFO:sagemaker:Creating training-job with name: jumpstart-demo-xl-3-2023-04-06-08-16-42-738

When the training is complete, you have a fine-tuned model at model_uri. Let’s use it!

You can create two inference endpoints: one for the original pre-trained model, and one for the fine-tuned model. This allows you to compare the output of both versions of the model. In the next step, you deploy an inference endpoint for the pre-trained model. Then you deploy an endpoint for your fine-tuned model.

Deploy the pre-trained model

Let’s start by deploying the pre-trained model retrieve the inference Docker image URI. This is the base Hugging Face container image. Use the following code:

from sagemaker import image_uris

# Retrieve the inference docker image URI. This is the base HuggingFace container image
deploy_image_uri = image_uris.retrieve(
region=None,
framework=None,  # automatically inferred from model_id
model_id=model_id,
model_version=model_version,
image_scope="inference",
instance_type=inference_instance_type,
)

You can now create the endpoint and deploy the pre-trained model. Note that you need to pass the Predictor class when deploying model through the Model class to be able to run inference through the SageMaker API. See the following code:

from sagemaker import model_uris, script_uris
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base

# Retrieve the URI of the pre-trained model
pre_trained_model_uri = model_uris.retrieve(
model_id=model_id, model_version=model_version, model_scope="inference"
)

pre_trained_name = name_from_base(f"jumpstart-demo-pre-trained-{model_id}")

# Create the SageMaker model instance of the pre-trained model
if ("small" in model_id) or ("base" in model_id):
deploy_source_uri = script_uris.retrieve(
model_id=model_id, model_version=model_version, script_scope="inference"
)
pre_trained_model = Model(
image_uri=deploy_image_uri,
source_dir=deploy_source_uri,
entry_point="inference.py",
model_data=pre_trained_model_uri,
role=aws_role,
predictor_cls=Predictor,
name=pre_trained_name,
)
else:
# For those large models, we already repack the inference script and model
# artifacts for you, so the `source_dir` argument to Model is not required.
pre_trained_model = Model(
image_uri=deploy_image_uri,
model_data=pre_trained_model_uri,
role=aws_role,
predictor_cls=Predictor,
name=pre_trained_name,
)

print(f"{bold}image URI:{unbold}{newline} {deploy_image_uri}")
print(f"{bold}model URI:{unbold}{newline} {pre_trained_model_uri}")
print("Deploying an endpoint ...")

# Deploy the pre-trained model. Note that we need to pass Predictor class when we deploy model
# through Model class, for being able to run inference through the SageMaker API
pre_trained_predictor = pre_trained_model.deploy(
initial_instance_count=1,
instance_type=inference_instance_type,
predictor_cls=Predictor,
endpoint_name=pre_trained_name,
)
print(f"{newline}Deployed an endpoint {pre_trained_name}")

The endpoint creation and model deployment can take a few minutes, then your endpoint is ready to receive inference calls.

Deploy the fine-tuned model

Let’s deploy the fine-tuned model to its own endpoint. The process is almost identical to the one we used earlier for the pre-trained model. The only difference is that we use the fine-tuned model name and URI:

from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base

fine_tuned_name = name_from_base(f"jumpstart-demo-fine-tuned-{model_id}")
fine_tuned_model_uri = f"{output_location}{training_job_name}/output/model.tar.gz"

# Create the SageMaker model instance of the fine-tuned model
fine_tuned_model = Model(
image_uri=deploy_image_uri,
model_data=fine_tuned_model_uri,
role=aws_role,
predictor_cls=Predictor,
name=fine_tuned_name,
)

print(f"{bold}image URI:{unbold}{newline} {deploy_image_uri}")
print(f"{bold}model URI:{unbold}{newline} {fine_tuned_model_uri}")
print("Deploying an endpoint ...")

# Deploy the fine-tuned model.
fine_tuned_predictor = fine_tuned_model.deploy(
initial_instance_count=1,
instance_type=inference_instance_type,
predictor_cls=Predictor,
endpoint_name=fine_tuned_name,
)
print(f"{newline}Deployed an endpoint {fine_tuned_name}")

When this process is complete, both pre-trained and fine-tuned models are deployed behind their own endpoints. Let’s compare their outputs.

Generate output and compare the results

Define some utility functions to query the endpoint and parse the response:

import boto3
import json

# Parameters of (output) text generation. A great introduction to generation
# parameters can be found at https://huggingface.co/blog/how-to-generate
parameters = {
"max_length": 40,  # restrict the length of the generated text
"num_return_sequences": 5,  # we will inspect several model outputs
"num_beams": 10,  # use beam search
}

# Helper functions for running inference queries
def query_endpoint_with_json_payload(payload, endpoint_name):
encoded_json = json.dumps(payload).encode("utf-8")
client = boto3.client("runtime.sagemaker")
response = client.invoke_endpoint(
EndpointName=endpoint_name, ContentType="application/json", Body=encoded_json
)
return response

def parse_response_multiple_texts(query_response):
model_predictions = json.loads(query_response["Body"].read())
generated_text = model_predictions["generated_texts"]
return generated_text

def generate_questions(endpoint_name, text):
expanded_prompt = prompt.replace("{context}", text)
payload = {"text_inputs": expanded_prompt, **parameters}
query_response = query_endpoint_with_json_payload(payload, endpoint_name=endpoint_name)
generated_texts = parse_response_multiple_texts(query_response)
for i, generated_text in enumerate(generated_texts):
print(f"Response {i}: {generated_text}{newline}")

In the next code snippet, we define the prompt and the test data. The describes our target task, which is to generate questions that are related to the provided text but can’t be answered based on it.

The test data consists of three different paragraphs, one on the Australian city of Adelaide from the first two paragraphs of it Wikipedia page, one regarding Amazon Elastic Block Store (Amazon EBS) from the Amazon EBS documentation, and one of Amazon Comprehend from the Amazon Comprehend documentation. We expect the model to identify questions related to these paragraphs but that can’t be answered with the information provided therein.

prompt = "Ask a question which is related to the following text, but cannot be answered based on the text. Text: {context}"

test_paragraphs = [
"""
Adelaide is the capital city of South Australia, the state's largest city and the fifth-most populous city in Australia.
"Adelaide" may refer to either Greater Adelaide (including the Adelaide Hills) or the Adelaide city centre.
The demonym Adelaidean is used to denote the city and the residents of Adelaide. The Traditional Owners of the Adelaide
region are the Kaurna people. The area of the city centre and surrounding parklands is called Tarndanya in the Kaurna language.

Adelaide is situated on the Adelaide Plains north of the Fleurieu Peninsula, between the Gulf St Vincent in the west and
the Mount Lofty Ranges in the east. Its metropolitan area extends 20 km (12 mi) from the coast to the foothills of
the Mount Lofty Ranges, and stretches 96 km (60 mi) from Gawler in the north to Sellicks Beach in the south.
""",
"""
Amazon Elastic Block Store (Amazon EBS) provides block level storage volumes for use with EC2 instances. EBS volumes behave like raw, unformatted block devices. You can mount these volumes as devices on your instances. EBS volumes that are attached to an instance are exposed as storage volumes that persist independently from the life of the instance. You can create a file system on top of these volumes, or use them in any way you would use a block device (such as a hard drive). You can dynamically change the configuration of a volume attached to an instance.

We recommend Amazon EBS for data that must be quickly accessible and requires long-term persistence. EBS volumes are particularly well-suited for use as the primary storage for file systems, databases, or for any applications that require fine granular updates and access to raw, unformatted, block-level storage. Amazon EBS is well suited to both database-style applications that rely on random reads and writes, and to throughput-intensive applications that perform long, continuous reads and writes.
""",
"""
Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases. 
You can access Amazon Comprehend document analysis capabilities using the Amazon Comprehend console or using the Amazon Comprehend APIs. You can run real-time analysis for small workloads or you can start asynchronous analysis jobs for large document sets. You can use the pre-trained models that Amazon Comprehend provides, or you can train your own custom models for classification and entity recognition. 
All of the Amazon Comprehend features accept UTF-8 text documents as the input. In addition, custom classification and custom entity recognition accept image files, PDF files, and Word files as input. 
Amazon Comprehend can examine and analyze documents in a variety of languages, depending on the specific feature. For more information, see Languages supported in Amazon Comprehend. Amazon Comprehend's Dominant language capability can examine documents and determine the dominant language for a far wider selection of languages.
"""
]

You can now test the endpoints using the example articles

print(f"{bold}Prompt:{unbold} {repr(prompt)}")
for paragraph in test_paragraphs:
print("-" * 80)
print(paragraph)
print("-" * 80)
print(f"{bold}pre-trained{unbold}")
generate_questions(pre_trained_name, paragraph)
print(f"{bold}fine-tuned{unbold}")
generate_questions(fine_tuned_name, paragraph)

Test data: Adelaide

We use the following context:

delaide is the capital city of South Australia, the state's largest city and the fifth-most populous city in Australia.
"Adelaide" may refer to either Greater Adelaide (including the Adelaide Hills) or the Adelaide city centre.
The demonym Adelaidean is used to denote the city and the residents of Adelaide. The Traditional Owners of the Adelaide
region are the Kaurna people. The area of the city centre and surrounding parklands is called Tarndanya in the Kaurna language.

Adelaide is situated on the Adelaide Plains north of the Fleurieu Peninsula, between the Gulf St Vincent in the west and
the Mount Lofty Ranges in the east. Its metropolitan area extends 20 km (12 mi) from the coast to the foothills of
the Mount Lofty Ranges, and stretches 96 km (60 mi) from Gawler in the north to Sellicks Beach in the south.

The pre-trained model response is as follows:

Response 0: What is the area of the city centre and surrounding parklands called in the Kaurna language?
Response 1: What is the area of the city centre and surrounding parklands is called Tarndanya in the Kaurna language?
Response 2: What is the area of the city centre and surrounding parklands called in Kaurna?
Response 3: What is the capital city of South Australia?
Response 4: What is the area of the city centre and surrounding parklands known as in the Kaurna language?

The fine-tuned model responses are as follows:

Response 0: What is the second most populous city in Australia?
Response 1: What is the fourth most populous city in Australia?
Response 2: What is the population of Gawler?
Response 3: What is the largest city in Australia?
Response 4: What is the fifth most populous city in the world?

Test data: Amazon EBS

We use the following context:

Amazon Elastic Block Store (Amazon EBS) provides block level storage volumes for use with EC2 instances. EBS volumes behave like raw, unformatted block devices. You can mount these volumes as devices on your instances. EBS volumes that are attached to an instance are exposed as storage volumes that persist independently from the life of the instance. You can create a file system on top of these volumes, or use them in any way you would use a block device (such as a hard drive). You can dynamically change the configuration of a volume attached to an instance.

We recommend Amazon EBS for data that must be quickly accessible and requires long-term persistence. EBS volumes are particularly well-suited for use as the primary storage for file systems, databases, or for any applications that require fine granular updates and access to raw, unformatted, block-level storage. Amazon EBS is well suited to both database-style applications that rely on random reads and writes, and to throughput-intensive applications that perform long, continuous reads and writes.

The pre-trained model responses are as follows:

esponse 0: What is the difference between Amazon EBS and Amazon Elastic Block Store (Amazon EBS)?
Response 1: What is the difference between Amazon EBS and Amazon Elastic Block Store?
Response 2: What is the difference between Amazon EBS and Amazon Simple Storage Service (Amazon S3)?
Response 3: What is Amazon Elastic Block Store (Amazon EBS)?
Response 4: What is the difference between Amazon EBS and a hard drive?

The fine-tuned model responses are as follows:

Response 0: What type of applications are not well suited to Amazon EBS?
Response 1: What behaves like formatted block devices?
Response 2: What type of applications are not suited to Amazon EBS?
Response 3: What type of applications are not well suited for Amazon EBS?
Response 4: What type of applications are not suited for Amazon EBS?

Test data: Amazon Comprehend

We use the following context:

Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases. 
You can access Amazon Comprehend document analysis capabilities using the Amazon Comprehend console or using the Amazon Comprehend APIs. You can run real-time analysis for small workloads or you can start asynchronous analysis jobs for large document sets. You can use the pre-trained models that Amazon Comprehend provides, or you can train your own custom models for classification and entity recognition. 
All of the Amazon Comprehend features accept UTF-8 text documents as the input. In addition, custom classification and custom entity recognition accept image files, PDF files, and Word files as input. 
Amazon Comprehend can examine and analyze documents in a variety of languages, depending on the specific feature. For more information, see Languages supported in Amazon Comprehend. Amazon Comprehend's Dominant language capability can examine documents and determine the dominant language for a far wider selection of languages.

The pre-trained model responses are as follows:

Response 0: What does Amazon Comprehend use to extract insights about the content of documents?
Response 1: How does Amazon Comprehend extract insights about the content of documents?
Response 2: What does Amazon Comprehend use to develop insights about the content of documents?
Response 3: How does Amazon Comprehend develop insights about the content of documents?
Response 4: What does Amazon Comprehend use to extract insights about the content of a document?

The fine-tuned model responses are as follows:

Response 0: What does Amazon Comprehend use to extract insights about the structure of documents?
Response 1: How does Amazon Comprehend recognize sentiments in a document?
Response 2: What does Amazon Comprehend use to extract insights about the content of social networking feeds?
Response 3: What does Amazon Comprehend use to extract insights about the content of documents?
Response 4: What type of files does Amazon Comprehend reject as input?

The difference in output quality between the pre-trained model and the fine-tuned model is stark. The questions provided by the fine-tuned model touch on a wider range of topics. They are systematically meaningful questions, which isn’t always the case for the pre-trained model, as illustrated with the Amazon EBS example.

Although this doesn’t constitute a formal and systematic evaluation, it’s clear that the fine-tuning process has improved the quality of the model’s responses on this task.

Clean up

Lastly, remember to clean up and delete the endpoints:

# Delete resources
pre_trained_predictor.delete_model()
pre_trained_predictor.delete_endpoint()
fine_tuned_predictor.delete_model()
fine_tuned_predictor.delete_endpoint()

Conclusion

In this post, we showed how to use instruction fine-tuning with FLAN T5 models using the Jumpstart UI or a Jupyter notebook running in Studio. We provided code explaining how to retrain the model using data for the target task and deploy the fine-tuned model behind an endpoint. The target task in this post was to identify questions that relate to a chunk of text provided in the input but can’t be answered based on the information provided in that text. We demonstrated that a model fine-tuned for this specific task returns better results than a pre-trained model.

Now that you know how to instruction fine-tune a model with Jumpstart, you can create powerful models customized for your application. Gather some data for your use case, uploaded it to Amazon S3, and use either the Studio UI or the notebook to tune a FLAN T5 model!

References

[1] Chung, Hyung Won, et al. “Scaling instruction-fine tuned language models.” arXiv preprint arXiv:2210.11416 (2022).

[2] Rajpurkar, Pranav, Robin Jia, and Percy Liang. “Know What You Don’t Know: Unanswerable Questions for SQuAD.” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018.


About the authors

Laurent Callot is a Principal Applied Scientist and manager at AWS AI Labs who has worked on a variety of machine learning problems, from foundational models and generative AI to forecasting, anomaly detection, causality, and AI Ops.

Andrey Kan is a Senior Applied Scientist at AWS AI Labs within interests and experience in different fields of Machine Learning. These include research on foundation models, as well as ML applications for graphs and time series.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana Champaign. He is an active researcher in machine learning and statistical inference and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Baris Kurt is an Applied Scientist at AWS AI Labs. His interests are in time series anomaly detection and foundation models. He loves developing user friendly ML systems.

Jonas Kübler is an Applied Scientist at AWS AI Labs. He is working on foundation models with the goal to facilitate use-case specific applications.

Read More

Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0

Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0

As part of PyTorch 2.0 release, an accelerated implementation of the attention mechanism as part of the “Better Transformer” project (and known in PyTorch as Accelerated Transformers) has been added natively into PyTorch as torch.nn.functional.scaled_dot_product_attention. This implementation leverages fused kernels from FlashAttention and Memory-efficient attention, and supports both training and inference.

We also release a notebook showcasing an example of this integration here

After seeing 20-30% speedups at inference for diffusion models, we went ahead and implemented an integration with 🤗 Transformers models through the 🤗 Optimum library. Similar to the previous integration for encoder models, the integration replaces modules from Transformers with efficient implementations that use torch.nn.functional.scaled_dot_product_attention. The usage is as follow:

from optimum.bettertransformer import BetterTransformer
from transformers import AutoModelForCausalLM

with torch.device(“cuda”):
model = AutoModelForCausalLM.from_pretrained(“gpt2-large”, torch_dtype=torch.float16)

model = BetterTransformer.transform(model)

# do your inference or training here

# if training and want to save the model
model = BetterTransformer.reverse(model)
model.save_pretrained(“fine_tuned_model”)
model.push_to_hub(“fine_tuned_model”) 

Summarizing our findings below about torch.nn.functional.scaled_dot_product_attention:

  • It is most useful to fit larger models, sequence length, or batch size to train on a given hardware.
  • Memory footprint savings on GPU during training range from 20% to 110%+.
  • Speedups during training range from 10% to 70%.
  • Speedups during inference range from 5% to 20%.
  • Standalone, for small head dimensions, scaled_dot_product_attention speedups go up to 3x, memory savings go as high as 40x (depending on the sequence length).

You may be surprised by the wide range of memory savings and speedups. In this blog post, we discuss our benchmarks, where this feature shines and upcoming improvements in future PyTorch releases.

In the next release of transformers you will just need to install the proper version of optimum and run:

model = model.to_bettertransformer()

To convert your model using the BetterTransformer API. You can already try this feature out by installing transformers from source.

Benchmark and usage with 🤗 Transformers

torch.nn.functional.scaled_dot_product_attention is usable with any architecture that uses standard attention, and namely replaces the boiler-plate code:

# native scaled_dot_product_attention is equivalent to the following:
def eager_sdpa(query, key, value, attn_mask, dropout_p, is_causal, scale):
	scale_factor = 1 / math.sqrt(Q.size(-1)) if scale is None else scale
	attn_mask = torch.ones(L, S, dtype=torch.bool).tril(diagonal=0) if is_causal else attn_mask
	attn_mask = attn_mask.masked_fill(not attn_mask, -float('inf')) if attn_mask.dtype==torch.bool else attn_mask
	attn_weight = torch.softmax((Q @ K.transpose(-2, -1) * scale_factor) + attn_mask, dim=-1)
	attn_weight = torch.dropout(attn_weight, dropout_p)
	return attn_weight @ V

In the 🤗 Optimum integration with Transformers models, the following architectures are supported for now: gpt2, gpt-neo, gpt-neox, gptj, t5, bart, codegen, pegasus, opt, LLaMA, blenderbot, m2m100. You can expect this list to be extended in the near future!

To validate the benefits from the native scaled dot-product attention, we ran inference and training benchmarks, whose results are presented below.

Inference benchmark on a single A10G GPU, AWS g5.4xlarge instanceInference benchmark on a single A10G GPU, AWS g5.4xlarge instance

Training benchmark on a single A10G GPU, AWS g5.4xlarge instanceTraining benchmark on a single A10G GPU, AWS g5.4xlarge instance

Training benchmark on a single A100-SXM4-80GB, Nvidia DGXTraining benchmark on a single A100-SXM4-80GB, Nvidia DGX

Out of this benchmark, the most interesting finding is that native SDPA allows for the usage of longer sequence lengths and batch sizes without running into out of memory issues. Moreover, up to 20% speedups can be seen during inference, and even larger during training.

As seen on the training benchmarks, it appears that smaller head dimension brings higher speedups and memory savings, which we will discuss in the following section.

The implementation supports multi-GPU settings as well, thanks to 🤗 Accelerate library by passing device_map=”auto” to the from_pretrained method. Here are some results for training on two A100-SXM4-80GB.

Training benchmark on two A100-SXM4-80GB, Nvidia DGX, using 🤗 Accelerate library for distributed trainingTraining benchmark on two A100-SXM4-80GB, Nvidia DGX, using 🤗 Accelerate library for distributed training

Note that some kernels support only the sm_80 compute capability (which is the one from A100 GPUs), which limits usability on a wide range of hardware, notably if the head dimension is not a power of two. For example, as of PyTorch 2.0.0 during training, opt-2.7b (headim=80) and gpt-neox-20b (headdim=96) can not dispatch to a kernel using flash attention, unless run on an A100 GPU. Better kernels may be developed in the future: https://github.com/pytorch/pytorch/issues/98140#issuecomment-1518101895

Flash Attention, Memory-efficient attention & math differences

The native scaled_dot_product_attention relies on three possible backend implementations: flash attention, memory-efficient attention, and the so-called math implementation which provides a hardware-neutral fallback for all PyTorch platforms.

When fused kernels are available for a given problem size, flash-attention or memory-efficient attention will be used, effectively allowing for a lower memory footprint, as in the memory-efficient attention case O(N) memory allocations are done on the GPU global memory instead of the classic O(N^2) for the traditional eager attention implementation. With flash attention, a reduced number of memory accesses (read and writes) is expected, hence both giving speedups and memory savings.

The “math” implementation is simply an implementation using the PyTorch’s C++ API. Interesting to note in this implementation is that the query and key tensors are scaled individually for numerical stability, thus launching two aten::div operations instead of possibly only one in an eager implementation that does not contain this optimization for numerical stability.

Head dimension influence on speedups, memory savings

Benchmarking torch.nn.functional.scaled_dot_product_attention, we notice a decrease in the speedup / memory gains as the head dimension increases. This is an issue for some architectures like EleutherAI/gpt-neo-2.7B, that has a relatively large head dimension of 128, or EleutherAI/gpt-j-6B (and derived models as PygmalionAI/pygmalion-6b) that has a head dimension of 256 (that actually currently do not dispatch on fused kernels as the head dimension is too large).

This trend can be seen in the figures below, where torch.nn.scaled_dot_production is benchmarked standalone versus the above eager implementation. Moreover, we use the torch.backends.cuda.sdp_kernel context manager to force the usage of respectively math, flash attention, and memory-efficient attention implementation.

Using memory-efficient attention SDP kernel (forward-only), A100Using memory-efficient attention SDP kernel (forward-only), A100

Using math (without dropout), A100Using math (without dropout), A100

Using flash attention SDP kernel (without dropout), A100Using flash attention SDP kernel (without dropout), A100

Using memory-efficient attention SDP kernel (without dropout), A100Using memory-efficient attention SDP kernel (without dropout), A100

We see that for the same problem size, be it for inference-only or training, the speedup decreases with higher head dimension, e.g. from 3.4x for headdim=8 to 1.01x for headdim=128 using flash attention kernel.

The reduced memory saving is expected with larger head dimensions. Recall the standard attention computation:

Math equation

Due to the intermediate computations, the global memory footprint is 2 * N * N + N * d in this standard step by step computation. Memory-efficient attention proposes to iteratively update the softmax renormalization constant and moving its computation at the very end, allowing for only a constant output memory allocation N * d.

Thus, the memory saving ratio is 2 * N / d + 1, which decreases with larger head dimension.

In flash attention, the tradeoff is between the head dimension d and the shared memory size M of a GPU streaming multiprocessor, with a total number of memory accesses of O(N² * d²/M). Thus, the memory accesses scale quadratically in the head dimension, contrary to the standard attention that scales linearly. The reason is that in flash attention, for larger head dimension d, the key and value K, V need to be split into more blocks to fit into shared memory, and in turn each block needs to load the full query Q and output O.

Thus, the highest speedups for flash attention are in a regime where the ratio d² / M is small enough.

Current limitations as of PyTorch 2.0.0

Absence of a scale argument

As of PyTorch 2.0.0, torch.nn.functional.scaled_dot_product_attention has no scale argument and uses the default square root of the hidden size sqrt(d_k).

Math equation

However, some architectures as OPT or T5 do not use a scaling in the attention, which as of Pytorch 2.0.0 forces it to artificially rescale before the scaled_dot_product_attention call. This introduces an unnecessary overhead, as an additional multiplication is necessary, on top of unneeded divisions in the attention.

A fix for this issue has been merged in PyTorch repository.

Support of flash attention / memory-efficient attention with custom mask

As of PyTorch 2.0.0, when passing a custom attention mask, flash attention and memory-efficient attention can not be used. In this case, scaled_dot_product_attention automatically dispatches to the C++ implementation.

However, as we have seen, some architectures require a custom attention mask, as T5 that uses positional bias. Moreover, in the case of a batch size larger than one where some inputs may be padded, a custom attention mask also needs to be passed. For this latter case, an alternative would be to use NestedTensor, which SDPA supports.

This limited support for custom masks thus limits the benefits from SDPA in these specific cases, although we can hope for an extended support in the future.

Note that xformers, from which PyTorch’s SDPA partially takes inspiration, currently supports arbitrary attention masks: https://github.com/facebookresearch/xformers/blob/658ebab39545f180a6075385b3897921623d6c3b/xformers/ops/fmha/cutlass.py#L147-L156 . HazyResearch implementation of flash attention also supports an equivalent implementation of padding, as a cumulative sequence length array is used along with packed query/key/values – similar in essence to NestedTensor.

In conclusion

Using torch.nn.functional.scaled_dot_product_attention is a free-lunch optimization, both making your code more readable, uses less memory, and is in most common cases faster.

Although the implementation in PyTorch 2.0.0 has still minor limitations, inference and training already massively benefit from SDPA in most cases. We encourage you to use this native implementation be it to train or deploy your PyTorch models, and for 🤗 Transformers models as a one-line transformation!

In the future, we would like to adapt the API to enable users to use SDPA in encoder-based models as well.

We thank Benjamin Lefaudeux, Daniel Haziza and Francisco Massa for their advice on the head dimension influence, as well as Michael Gschwind, Christian Puhrsch and Driss Guessous for their feedback on the blog post!

Benchmark reproduction

The benchmark presented in this post was done using torch==2.0.0, transformers==4.27.4, accelerate==0.18.0 and optimum==1.8.0.

The benchmarks can be easily reproduced using the scripts for inference, training for 🤗 Transformers models, and standalone SDPA.

Read More