Modeling and improving text stability in live captions

Modeling and improving text stability in live captions

Automatic speech recognition (ASR) technology has made conversations more accessible with live captions in remote conferencing software, mobile applications, and head-worn displays. However, to maintain real-time responsiveness, live caption systems often display interim predictions that are updated as new utterances are received. This can cause text instability (a “flicker” where previously displayed text is updated, shown in the captions on the left in the video below), which can impair users’ reading experience due to distraction, fatigue, and difficulty following the conversation.

In “Modeling and Improving Text Stability in Live Captions”, presented at ACM CHI 2023, we formalize this problem of text stability through a few key contributions. First, we quantify the text instability by employing a vision-based flicker metric that uses luminance contrast and discrete Fourier transform. Second, we also introduce a stability algorithm to stabilize the rendering of live captions via tokenized alignment, semantic merging, and smooth animation. Finally, we conducted a user study (N=123) to understand viewers’ experience with live captioning. Our statistical analysis demonstrates a strong correlation between our proposed flicker metric and viewers’ experience. Furthermore, it shows that our proposed stabilization techniques significantly improves viewers’ experience (e.g., the captions on the right in the video above).

Raw ASR captions vs. stabilized captions


Inspired by previous work, we propose a flicker-based metric to quantify text stability and objectively evaluate the performance of live captioning systems. Specifically, our goal is to quantify the flicker in a grayscale live caption video. We achieve this by comparing the difference in luminance between individual frames (frames in the figures below) that constitute the video. Large visual changes in luminance are obvious (e.g., addition of the word “bright” in the figure on the bottom), but subtle changes (e.g., update from “… this gold. Nice..” to “… this. Gold is nice”) may be difficult to discern for readers. However, converting the change in luminance to its constituting frequencies exposes both the obvious and subtle changes.

Thus, for each pair of contiguous frames, we convert the difference in luminance into its constituting frequencies using discrete Fourier transform. We then sum over each of the low and high frequencies to quantify the flicker in this pair. Finally, we average over all of the frame-pairs to get a per-video flicker.

For instance, we can see below that two identical frames (top) yield a flicker of 0, while two non-identical frames (bottom) yield a non-zero flicker. It is worth noting that higher values of the metric indicate high flicker in the video and thus, a worse user experience than lower values of the metric.

Illustration of the flicker metric between two identical frames.
Illustration of the flicker between two non-identical frames.

Stability algorithm

To improve the stability of live captions, we propose an algorithm that takes as input already rendered sequence of tokens (e.g., “Previous” in the figure below) and the new sequence of ASR predictions, and outputs an updated stabilized text (e.g., “Updated text (with stabilization)” below). It considers both the natural language understanding (NLU) aspect as well as the ergonomic aspect (display, layout, etc.) of the user experience in deciding when and how to produce a stable updated text. Specifically, our algorithm performs tokenized alignment, semantic merging, and smooth animation to achieve this goal. In what follows, a token is defined as a word or punctuation produced by ASR.

We show (a) the previously already rendered text, (b) the baseline layout of updated text without our merging algorithm, and (c) the updated text as generated by our stabilization algorithm.

Our algorithm address the challenge of producing stabilized updated text by first identifying three classes of changes (highlighted in red, green, and blue below):

  1. Red: Addition of tokens to the end of previously rendered captions (e.g., “How about”).
  2. Green: Addition / deletion of tokens, in the middle of already rendered captions.
    • B1: Addition of tokens (e.g., “I” and “friends”). These may or may not affect the overall comprehension of the captions, but may lead to layout change. Such layout changes are not desired in live captions as they cause significant jitter and poorer user experience. Here “I” does not add to the comprehension but “friends” does. Thus, it is important to balance updates with stability specially for B1 type tokens.
    • B2: Removal of tokens, e.g., “in” is removed in the updated sentence.
  3. Blue: Re-captioning of tokens: This includes token edits that may or may not have an impact on the overall comprehension of the captions.
  • C1: Proper nouns like “disney land” are updated to “Disneyland”.
  • C2: Grammatical shorthands like “it’s” are updated to “It was”.
Classes of changes between previously displayed and updated text.

Alignment, merging, and smoothing

To maximize text stability, our goal is to align the old sequence with the new sequence using updates that make minimal changes to the existing layout while ensuring accurate and meaningful captions. To achieve this, we leverage a variant of the Needleman-Wunsch algorithm with dynamic programming to merge the two sequences depending on the class of tokens as defined above:

  • Case A tokens: We directly add case A tokens, and line breaks as needed to fit the updated captions.
  • Case B tokens: Our preliminary studies showed that users preferred stability over accuracy for previously displayed captions. Thus, we only update case B tokens if the updates do not break an existing line layout.
  • Case C tokens: We compare the semantic similarity of case C tokens by transforming original and updated sentences into sentence embeddings, measuring their dot-product, and updating them only if they are semantically different (similarity < 0.85) and the update will not cause new line breaks.

Finally, we leverage animations to reduce visual jitter. We implement smooth scrolling and fading of newly added tokens to further stabilize the overall layout of the live captions.

User evaluation

We conducted a user study with 123 participants to (1) examine the correlation of our proposed flicker metric with viewers’ experience of the live captions, and (2) assess the effectiveness of our stabilization techniques.

We manually selected 20 videos in YouTube to obtain a broad coverage of topics including video conferences, documentaries, academic talks, tutorials, news, comedy, and more. For each video, we selected a 30-second clip with at least 90% speech.

We prepared four types of renderings of live captions to compare:

  1. Raw ASR: raw speech-to-text results from a speech-to-text API.
  2. Raw ASR + thresholding: only display interim speech-to-text result if its confidence score is higher than 0.85.
  3. Stabilized captions: captions using our algorithm described above with alignment and merging.
  4. Stabilized and smooth captions: stabilized captions with smooth animation (scrolling + fading) to assess whether softened display experience helps improve the user experience.

We collected user ratings by asking the participants to watch the recorded live captions and rate their assessments of comfort, distraction, ease of reading, ease of following the video, fatigue, and whether the captions impaired their experience.

Correlation between flicker metric and user experience

We calculated Spearman’s coefficient between the flicker metric and each of the behavioral measurements (values range from -1 to 1, where negative values indicate a negative relationship between the two variables, positive values indicate a positive relationship, and zero indicates no relationship). Shown below, our study demonstrates statistically significant (𝑝 < 0.001) correlations between our flicker metric and users’ ratings. The absolute values of the coefficient are around 0.3, indicating a moderate relationship.

Behavioral Measurement         Correlation to Flickering Metric*
Comfort -0.29

Distraction 0.33

Easy to read -0.31

Easy to follow videos -0.29

Fatigue 0.36

Impaired Experience 0.31

Spearman correlation tests of our proposed flickering metric. *p < 0.001.

Stabilization of live captions

Our proposed technique (stabilized smooth captions) received consistently better ratings, significant as measured by the Mann-Whitney U test (p < 0.01 in the figure below), in five out of six aforementioned survey statements. That is, users considered the stabilized captions with smoothing to be more comfortable and easier to read, while feeling less distraction, fatigue, and impairment to their experience than other types of rendering.

User ratings from 1 (Strongly Disagree) – 7 (Strongly Agree) on survey statements. (**: p<0.01, ***: p<0.001; ****: p<0.0001; ns: non-significant)

Conclusion and future direction

Text instability in live captioning significantly impairs users’ reading experience. This work proposes a vision-based metric to model caption stability that statistically significantly correlates with users’ experience, and an algorithm to stabilize the rendering of live captions. Our proposed solution can be potentially integrated into existing ASR systems to enhance the usability of live captions for a variety of users, including those with translation needs or those with hearing accessibility needs.

Our work represents a substantial step towards measuring and improving text stability. This can be evolved to include language-based metrics that focus on the consistency of the words and phrases used in live captions over time. These metrics may provide a reflection of user discomfort as it relates to language comprehension and understanding in real-world scenarios. We are also interested in conducting eye-tracking studies (e.g., videos shown below) to track viewers’ gaze patterns, such as eye fixation and saccades, allowing us to better understand the types of errors that are most distracting and how to improve text stability for those.

Illustration of tracking a viewer’s gaze when reading raw ASR captions.

Illustration of tracking a viewer’s gaze when reading stabilized and smoothed captions.

By improving text stability in live captions, we can create more effective communication tools and improve how people connect in everyday conversations in familiar or, through translation, unfamiliar languages.


This work is a collaboration across multiple teams at Google. Key contributors include Xingyu “Bruce” Liu, Jun Zhang, Leonardo Ferrer, Susan Xu, Vikas Bahirwani, Boris Smus, Alex Olwal, and Ruofei Du. We wish to extend our thanks to our colleagues who provided assistance, including Nishtha Bhatia, Max Spear, and Darcy Philippon. We would also like to thank Lin Li, Evan Parker, and CHI 2023 reviewers.

Read More

Automatically generate impressions from findings in radiology reports using generative AI on AWS

Automatically generate impressions from findings in radiology reports using generative AI on AWS

Radiology reports are comprehensive, lengthy documents that describe and interpret the results of a radiological imaging examination. In a typical workflow, the radiologist supervises, reads, and interprets the images, and then concisely summarizes the key findings. The summarization (or impression) is the most important part of the report because it helps clinicians and patients focus on the critical contents of the report that contain information for clinical decision-making. Creating a clear and impactful impression involves much more effort than simply restating the findings. The entire process is therefore laborious, time consuming, and prone to error. It often takes years of training for doctors to accumulate enough expertise in writing concise and informative radiology report summarizations, further highlighting the significance of automating the process. Additionally, automatic generation of report findings summarization is critical for radiology reporting. It enables translation of reports into human readable language, thereby alleviating the patients’ burden of reading through lengthy and obscure reports.

To solve this problem, we propose the use of generative AI, a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. Generative AI is powered by machine learning (ML) models—very large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). Recent advancements in ML (specifically the invention of the transformer-based neural network architecture) have led to the rise of models that contain billions of parameters or variables. The proposed solution in this post uses fine-tuning of pre-trained large language models (LLMs) to help generate summarizations based on findings in radiology reports.

This post demonstrates a strategy for fine-tuning publicly available LLMs for the task of radiology report summarization using AWS services. LLMs have demonstrated remarkable capabilities in natural language understanding and generation, serving as foundation models that can be adapted to various domains and tasks. There are significant benefits to using a pre-trained model. It reduces computation costs, reduces carbon footprints, and allows you to use state-of-the-art models without having to train one from scratch.

Our solution uses the FLAN-T5 XL FM, using Amazon SageMaker JumpStart, which is an ML hub offering algorithms, models, and ML solutions. We demonstrate how to accomplish this using a notebook in Amazon SageMaker Studio. Fine-tuning a pre-trained model involves further training on specific data to improve performance on a different but related task. This solution involves fine-tuning the FLAN-T5 XL model, which is an enhanced version of T5 (Text-to-Text Transfer Transformer) general-purpose LLMs. T5 reframes natural language processing (NLP) tasks into a unified text-to-text-format, in contrast to BERT-style models that can only output either a class label or a span of the input. It is fine-tuned for a summarization task on 91,544 free-text radiology reports obtained from the MIMIC-CXR dataset.

Overview of solution

In this section, we discuss the key components of our solution: choosing the strategy for the task, fine-tuning an LLM, and evaluating the results. We also illustrate the solution architecture and the steps to implement the solution.

Identify the strategy for the task

There are various strategies to approach the task of automating clinical report summarization. For example, we could use a specialized language model pre-trained on clinical reports from scratch. Alternatively, we could directly fine-tune a publicly available general-purpose language model to perform the clinical task. Using a fine-tuned domain-agnostic model may be necessary in settings where training a language model from scratch is too costly. In this solution, we demonstrate the latter approach of using a FLAN -T5 XL model, which we fine-tune for the clinical task of summarization of radiology reports. The following diagram illustrates the model workflow.

A typical radiology report is well-organized and succinct. Such reports often have three key sections:

  • Background – Provides general information about the demographics of the patient with essential information about the patient, clinical history, and relevant medical history and details of exam procedures
  • Findings – Presents detailed exam diagnosis and results
  • Impression – Concisely summarizes the most salient findings or interpretation of the findings with an assessment of significance and potential diagnosis based on the observed abnormalities

Using the findings section in the radiology reports, the solution generates the impression section, which corresponds to the doctors’ summarization. The following figure is an example of a radiology report .

Fine-tune a general-purpose LLM for a clinical task

In this solution, we fine-tune a FLAN-T5 XL model (tuning all the parameters of the model and optimizing them for the task). We fine-tune the model using the clinical domain dataset MIMIC-CXR, which is a publicly available dataset of chest radiographs. To fine-tune this model through SageMaker Jumpstart, labeled examples must be provided in the form of {prompt, completion} pairs. In this case, we use pairs of {Findings, Impression} from the original reports in MIMIC-CXR dataset. For inferencing, we use a prompt as shown in the following example:

The model is fine-tuned on an accelerated computing ml.p3.16xlarge instance with 64 virtual CPUs and 488 GiB memory. For validation, 5% of the dataset was randomly selected. The elapsed time of the SageMaker training job with fine-tuning was 38,468 seconds (approximately 11 hours).

Evaluate the results

When the training is complete, it’s critical to evaluate the results. For a quantitative analysis of the generated impression, we use ROUGE (Recall-Oriented Understudy for Gisting Evaluation), the most commonly used metric for evaluating summarization. This metric compares an automatically produced summary against a reference or a set of references (human-produced) summary or translation. ROUGE1 refers to the overlap of unigrams (each word) between the candidate (the model’s output) and reference summaries. ROUGE2 refers to the overlap of bigrams (two words) between the candidate and reference summaries. ROUGEL is a sentence-level metric and refers to the longest common subsequence (LCS) between two pieces of text. It ignores newlines in the text. ROUGELsum is a summary-level metric. For this metric, newlines in the text aren’t ignored but are interpreted as sentence boundaries. The LCS is then computed between each pair of reference and candidate sentences, and then union-LCS is computed. For aggregation of these scores over a given set of reference and candidate sentences, the average is computed.

Walkthrough and architecture

The overall solution architecture as shown in the following figure primarily consists of a model development environment that uses SageMaker Studio, model deployment with a SageMaker endpoint, and a reporting dashboard using Amazon QuickSight.

In the following sections, we demonstrate fine-tuning an LLM available on SageMaker JumpStart for summarization of a domain-specific task via the SageMaker Python SDK. In particular, we discuss the following topics:

  • Steps to set up the development environment
  • An overview of the radiology report datasets on which the model is fine-tuned and evaluated
  • A demonstration of fine-tuning the FLAN-T5 XL model using SageMaker JumpStart programmatically with the SageMaker Python SDK
  • Inferencing and evaluation of the pre-trained and fine-tuned models
  • Comparison of results from pre-trained model and fine-tuned models

The solution is available in the Generating Radiology Report Impression using generative AI with Large Language Model on AWS GitHub repo.


To get started, you need an AWS account in which you can use SageMaker Studio. You will need to create a user profile for SageMaker Studio if you don’t already have one.

The training instance type used in this post is ml.p3.16xlarge. Note that the p3 instance type requires a service quota limit increase.

The MIMIC CXR dataset can be accessed through a data use agreement, which requires user registration and completion of a credentialing process.

Set up the development environment

To set up your development environment, you create an S3 bucket, configure a notebook, create endpoints and deploy the models, and create a QuickSight dashboard.

Create an S3 bucket

Create an S3 bucket called llm-radiology-bucket to host the training and evaluation datasets. This will also be used to store the model artifact during model development.

Configure a notebook

Complete the following steps:

  1. Launch SageMaker Studio from either the SageMaker console or the AWS Command Line Interface (AWS CLI).

For more information about onboarding to a domain, see Onboard to Amazon SageMaker Domain.

  1. Create a new SageMaker Studio notebook for cleaning the report data and fine-tuning the model. We use an ml.t3.medium 2vCPU+4GiB notebook instance with a Python 3 kernel.
  1. Within the notebook, install the relevant packages such as nest-asyncio, IPyWidgets (for interactive widgets for Jupyter notebook), and the SageMaker Python SDK:
!pip install nest-asyncio==1.5.5 --quiet 
!pip install ipywidgets==8.0.4 --quiet 
!pip install sagemaker==2.148.0 --quiet

Create endpoints and deploy the models for inference

For inferencing the pre-trained and fine-tuned models, create an endpoint and deploy each model in the notebook as follows:

  1. Create a model object from the Model class that can be deployed to an HTTPS endpoint.
  2. Create an HTTPS endpoint with the model object’s pre-built deploy() method:
from sagemaker import model_uris, script_uris
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base

# Retrieve the URI of the pre-trained model
pre_trained_model_uri =model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="inference")


pre_trained_name = name_from_base(f"jumpstart-demo-pre-trained-{model_id}")

# Create the SageMaker model instance of the pre-trained model
if ("small" in model_id) or ("base" in model_id):
    deploy_source_uri = script_uris.retrieve(
        model_id=model_id, model_version=model_version, script_scope="inference"
    pre_trained_model = Model(
    # For those large models, we already repack the inference script and model
    # artifacts for you, so the `source_dir` argument to Model is not required.
    pre_trained_model = Model(

# Deploy the pre-trained model. Note that we need to pass Predictor class when we deploy model
# through Model class, for being able to run inference through the SageMaker API
pre_trained_predictor = pre_trained_model.deploy(

Create a QuickSight dashboard

Create a QuickSight dashboard with an Athena data source with inference results in Amazon Simple Storage Service (Amazon S3) to compare the inference results with the ground truth. The following screenshot shows our example dashboard.

Radiology report datasets

The model is now fine-tuned, all the model parameters are tuned on 91,544 reports downloaded from the MIMIC-CXR v2.0 dataset. Because we used only the radiology report text data, we downloaded just one compressed report file ( from the MIMIC-CXR website. Now we evaluate the fine-tuned model on 2,000 reports (referred to as the dev1 dataset) from the separate held out subset of this dataset. We use another 2,000 radiology reports (referred to as dev2) for evaluating the fine-tuned model from the chest X-ray collection from the Indiana University hospital network. All the datasets are read as JSON files and uploaded to the newly created S3 bucket llm-radiology-bucket. Note that all the datasets by default don’t contain any Protected Health Information (PHI); all sensitive information is replaced with three consecutive underscores (___) by the providers.

Fine-tune with the SageMaker Python SDK

For fine-tuning, the model_id is specified as huggingface-text2text-flan-t5-xl from the list of SageMaker JumpStart models. The training_instance_type is set as ml.p3.16xlarge and the inference_instance_type as ml.g5.2xlarge. The training data in JSON format is read from the S3 bucket. The next step is to use the selected model_id to extract the SageMaker JumpStart resource URIs, including image_uri (the Amazon Elastic Container Registry (Amazon ECR) URI for the Docker image), model_uri (the pre-trained model artifact Amazon S3 URI), and script_uri (the training script):

from sagemaker import image_uris, model_uris, script_uris

# Training instance will use this image
train_image_uri = image_uris.retrieve(
    framework=None,  # automatically inferred from model_id

# Pre-trained model
train_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="training"

# Script to execute on the training instance
train_script_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="training"

output_location = f"s3://{output_bucket}/demo-llm-rad-fine-tune-flan-t5/"

Also, an output location is set up as a folder within the S3 bucket.

Only one hyperparameter, epochs, is changed to 3, and the rest all are set as default:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# We will override some default hyperparameters with custom values
hyperparameters["epochs"] = "3"

The training metrics such as eval_loss (for validation loss), loss (for training loss), and epoch to be tracked are defined and listed:

from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base

model_name = "-".join(model_id.split("-")[2:])  # get the most informative part of ID
training_job_name = name_from_base(f"js-demo-{model_name}-{hyperparameters['epochs']}")
print(f"{bold}job name:{unbold} {training_job_name}")

training_metric_definitions = [
    {"Name": "val_loss", "Regex": "'eval_loss': ([0-9\.]+)"},
    {"Name": "train_loss", "Regex": "'loss': ([0-9\.]+)"},
    {"Name": "epoch", "Regex": "'epoch': ([0-9\.]+)"},

We use the SageMaker JumpStart resource URIs (image_uri, model_uri, script_uri) identified earlier to create an estimator and fine-tune it on the training dataset by specifying the S3 path of the dataset. The Estimator class requires an entry_point parameter. In this case, JumpStart uses The training job fails to run if this value is not set.

# Create SageMaker Estimator instance
sm_estimator = Estimator(

# Launch a SageMaker training job over data located in the given S3 path
# Training jobs can take hours, it is recommended to set wait=False,
# and monitor job status through SageMaker console{"training": train_data_location}, job_name=training_job_name, wait=True)

This training job can take hours to complete; therefore, it’s recommended to set the wait parameter to False and monitor the training job status on the SageMaker console. Use the TrainingJobAnalytics function to keep track of the training metrics at various timestamps:

from sagemaker import TrainingJobAnalytics

# Wait for a couple of minutes for the job to start before running this cell
# This can be called while the job is still running
df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()

Deploy inference endpoints

In order to draw comparisons, we deploy inference endpoints for both the pre-trained and fine-tuned models.

First, retrieve the inference Docker image URI using model_id, and use this URI to create a SageMaker model instance of the pre-trained model. Deploy the pre-trained model by creating an HTTPS endpoint with the model object’s pre-built deploy() method. In order to run inference through SageMaker API, make sure to pass the Predictor class.

from sagemaker import image_uris
# Retrieve the inference docker image URI. This is the base HuggingFace container image
deploy_image_uri = image_uris.retrieve(
    framework=None,  # automatically inferred from model_id

# Retrieve the URI of the pre-trained model
pre_trained_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="inference"

pre_trained_model = Model(

# Deploy the pre-trained model. Note that we need to pass Predictor class when we deploy model
# through Model class, for being able to run inference through the SageMaker API
pre_trained_predictor = pre_trained_model.deploy(

Repeat the preceding step to create a SageMaker model instance of the fine-tuned model and create an endpoint to deploy the model.

Evaluate the models

First, set the length of summarized text, number of model outputs (should be greater than 1 if multiple summaries need to be generated), and number of beams for beam search.

Construct the inference request as a JSON payload and use it to query the endpoints for the pre-trained and fine-tuned models.

Compute the aggregated ROUGE scores (ROUGE1, ROUGE2, ROUGEL, ROUGELsum) as described earlier.

Compare the results

The following table depicts the evaluation results for the dev1 and dev2 datasets. The evaluation result on dev1 (2,000 findings from the MIMIC CXR Radiology Report) shows approximately 38 percentage points improvement in the aggregated average ROUGE1 and ROUGE2 scores compared to the pre-trained model. For dev2, an improvement of 31 percentage points and 25 percentage points is observed in ROUGE1 and ROUGE2 scores. Overall, fine-tuning led to an improvement of 38.2 percentage points and 31.3 percentage points in ROUGELsum scores for the dev1 and dev2 datasets, respectively.



Pre-trained Model Fine-tuned model
dev1 0.2239 0.1134 0.1891 0.1891 0.6040 0.4800 0.5705 0.5708
dev2 0.1583 0.0599 0.1391 0.1393 0.4660 0.3125 0.4525 0.4525

The following box plots depict the distribution of ROUGE scores for the dev1 and dev2 datasets evaluated using the fine-tuned model.

(a): dev1 (b): dev2

The following table shows that ROUGE scores for the evaluation datasets have approximately the same median and mean and therefore are symmetrically distributed.

Datasets Scores Count Mean Std Deviation Minimum 25% percentile 50% percentile 75% percentile Maximum
dev1 ROUGE1 2000.00 0.6038 0.3065 0.0000 0.3653 0.6000 0.9384 1.0000
ROUGE 2 2000.00 0.4798 0.3578 0.0000 0.1818 0.4000 0.8571 1.0000
ROUGE L 2000.00 0.5706 0.3194 0.0000 0.3000 0.5345 0.9101 1.0000
ROUGELsum 2000.00 0.5706 0.3194 0.0000 0.3000 0.5345 0.9101 1.0000
dev2 ROUGE 1 2000.00 0.4659 0.2525 0.0000 0.2500 0.5000 0.7500 1.0000
ROUGE 2 2000.00 0.3123 0.2645 0.0000 0.0664 0.2857 0.5610 1.0000
ROUGE L 2000.00 0.4529 0.2554 0.0000 0.2349 0.4615 0.7500 1.0000
ROUGE Lsum 2000.00 0.4529 0.2554 0.0000 0.2349 0.4615 0.7500 1.0000

Clean up

To avoid incurring future charges, delete the resources you created with the following code:

# Delete resources


In this post, we demonstrated how to fine-tune a FLAN-T5 XL model for a clinical domain-specific summarization task using SageMaker Studio. To increase the confidence, we compared the predictions with ground truth and evaluated the results using ROUGE metrics. We demonstrated that a model fine-tuned for a specific task returns better results than a model pre-trained on a generic NLP task. We would like to point out that fine-tuning a general-purpose LLM eliminates the cost of pre-training altogether.

Although the work presented here focuses on chest X-ray reports, it has the potential to be expanded to bigger datasets with varied anatomies and modalities, such as MRI and CT, for which radiology reports might be more complex with multiple findings. In such cases, radiologists could generate impressions in order of criticality and include follow-up recommendations. Furthermore, setting up a feedback loop for this application would enable radiologists to improve the performance of the model over time.

As we showed in this post, the fine-tuned model generates impressions for radiology reports with high ROUGE scores. You can try to fine-tune LLMs on other domain-specific medical reports from different departments.

About the authors

Dr. Adewale Akinfaderin is a Senior Data Scientist in Healthcare and Life Sciences at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global healthcare customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in Physics and a Doctorate degree in Engineering.

Priya Padate is a Senior Partner Solutions Architect with extensive expertise in Healthcare and Life Sciences at AWS. Priya drives go-to-market strategies with partners and drives solution development to accelerate AI/ML-based development. She is passionate about using technology to transform the healthcare industry to drive better patient care outcomes.

Ekta Walia Bhullar, PhD, is a senior AI/ML consultant with AWS Healthcare and Life Sciences (HCLS) professional services business unit. She has extensive experience in the application of AI/ML within the healthcare domain, especially in radiology. Outside of work, when not discussing AI in radiology, she likes to run and hike.

Read More

Research Focus: Week of August 28, 2023

Research Focus: Week of August 28, 2023

Microsoft Research Focus 23 | Week of August 28, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.


An illusion of predictability in scientific results: Even experts confuse inferential uncertainty and outcome variability

In many fields, practitioners focus on inference (precisely estimating an unknown quantity, such as a population average) instead of prediction (forecasting individual outcomes). In a newly published article, researchers from Microsoft demonstrate that this focus on inference over prediction can mislead readers into thinking that the results of scientific studies are more definitive than they actually are.

Through a series of randomized experiments, the researchers demonstrate that this confusion arises for one of the most basic ways of presenting statistical findings and affects even experts whose jobs involve producing and interpreting such results, including medical professionals, data scientists, and tenure-track faculty.  In contrast, the paper shows that communicating both inferential and predictive information side by side provides a simple and effective alternative, leading to calibrated interpretations of scientific results.

This article was published in the Proceedings of the National Academy of Sciences (PNAS).

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft


FiGURe: Simple and Efficient Unsupervised Node Representations with Filter Augmentations

Contrastive learning is a powerful method for unsupervised graph representation learning. It is typically deployed on homophilic tasks, where task labels strongly correlate with the graph’s structure. However, these representations struggle when dealing with heterophilic tasks, where edges tend to connect nodes with different labels.

Several papers have tackled the problem of heterophily by leveraging information from both low and high frequency components. Yet these methods operate in semi-supervised settings, and the extension of these ideas in unsupervised learning still needs to be explored.

In a new paper: FiGURe: Simple and Efficient Unsupervised Node Representations with Filter Augmentations, researchers from Microsoft propose using filter banks for learning representations that can cater to both heterophilic and homophilic tasks. They address the related computational and storage burdens by sharing the encoder across these various filter views, and by learning a low-dimensional representation which is projected to high dimensions using Random Fourier Features. FiGURe achieves a gain of up to 4.4%, compared to the state-of-the-art unsupervised models, across all datasets in consideration, both homophilic and heterophilic.


Kathleen Sullivan named to Insider’s 30 under 40 in healthcare list

Microsoft Research congratulates Kathleen Sullivan (opens in new tab) for being named to Insider’s list of 30 under 40 forging a new future in healthcare (opens in new tab). After a competitive nomination and interview, Kathleen was selected for this inspiring list of “entrepreneurs, scientists, doctors, and business leaders who are transforming the healthcare industry.”

As senior director of strategy and operations within the health and life sciences division of Microsoft Research, Sullivan helps steer the company’s investments in AI. She helped engineer a Microsoft collaboration with Nuance Technologies–a precursor to Microsoft’s acquisition of Nuance in 2021. In 2018, Sullivan helped secure Microsoft’s partnership with Adaptive Biotechnologies to map the human immune system (opens in new tab)

Read the Insider article (opens in new tab)
(subscription required)

The post Research Focus: Week of August 28, 2023 appeared first on Microsoft Research.

Read More

Building a “heavy metal quartet” of AI compilers

Building a “heavy metal quartet” of AI compilers

By MSR Editor 

Compilation is an important process in program development, in which a program called a compiler translates source code written in a programming language into machine code executable on computer hardware. As AI technology and large-scale AI models become increasingly prevalent across the digital world, their unique characteristics are posing new challenges for compilers.

As AI models have evolved from early versions like recurrent neural networks (RNN) and convolutional neural networks (CNN) to more recent iterations like Transformer, their fundamental architecture is also constantly evolving. Meanwhile, the underlying hardware accelerators, such as graphics processing units (GPUs) and neural processing units (NPUs), are iterating rapidly as well, with some designs disrupting previous architectures. Therefore, an AI compiler plays a critical role in helping new AI models run efficiently on new hardware.

In response, researchers from Microsoft Research, in collaboration with academic colleagues, conducted a series of research and released the “heavy-metal quartet” of AI compilers: Rammer, Roller, Welder, and Grinder[1]. This quartet provides systematic and innovative solutions for current mainstream AI models and hardware compilation.

The left diagram shows the unified compiler abstraction with a tile-based intermediate representation (IR) as the core. The right diagram shows the four core AI compilation technologies.
Figure 1: The four core AI compilation technologies based on unified tile abstraction

Microsoft Research Podcast

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma

Emre Kiciman and Amit Sharma discuss their paper “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality” and how it examines the causal capabilities of large language models (LLMs) and their implications.

AI compilation “Rammer” improves hardware parallel utilization

Deep neural networks (DNNs) are widely adopted in image classification, natural language processing, and many other intelligence tasks. Because of their importance, many computing devices such as CPUs, GPUs, and specially designed DNN accelerators are being used to perform DNN computations. One key variable for DNN computation efficiency is scheduling, which determines the order in which computational tasks are performed on hardware. Conventional AI compilers typically treat DNN computation as a data flow graph where each node represents a DNN operator. These operators are implemented as opaque library functions and are scheduled to run on the accelerator separately. At the same time, this process also relies on another layer of schedulers, usually implemented in hardware, to take advantage of the parallelism available in operators. This two-level approach incurs significant scheduling overhead and often does not fully utilize hardware resources.

To address this issue, researchers proposed a new DNN compiler, Rammer, which can optimize the execution of DNN workloads on massive-parallel units of accelerators. Rammer imagines the scheduling space for AI compilation as a two-dimensional plane, where computational tasks are “bricks” that can be divided into different shapes and sizes. The purpose of scheduling in Rammer is to arrange these bricks tightly—as if building a wall—on the computational units of the two-dimensional plane. The arrangement should not leave any gaps, which would hurt hardware utilization and thus reduce execution speed. Rammer works like a compactor in this two-dimensional space: when a DNN program is translated into bricks, Rammer can place them on different computing units of the accelerator to compact them.

A schematic diagram illustrating Rammer’s technical framework. The input to Rammer is a data-flow graph where a node is an rOperator. Then, Rammer introduces rTask-aware DFG compiler to manage the inter and intra-operator scheduling in one place. The rTask-aware DFG compiler will generate a static execution plan for runtime execution. Rammer abstracts a hardware accelerator as a virtualized parallel device (vDevice), which includes multiple virtualized execution units (vEUs). The vDevice provides the scheduling and synchronization capabilities at the rTask level so that the rProgram can be mapped to the corresponding vEUs at compile time. The vEUs, together with the vDevice will be mapped to the hardware at runtime.
Figure 2: Rammer’s technical framework

In other words, Rammer generates an efficient static spatiotemporal schedule for DNNs ahead of time (during compilation), minimizing runtime scheduling overhead. Meanwhile, through new hardware-independent abstractions for computing tasks and hardware accelerators, Rammer exposes a larger scheduling space and provides a novel way to implement cooperative intra- and inter-operator scheduling. This allows Rammer to find more efficient schedules, thereby greatly improving hardware utilization.

Researchers evaluated Rammer on multiple devices, including NVIDIA GPUs, AMD GPUs, and Graphcore intelligence processing units (IPUs). Experiments have shown that Rammer significantly outperforms state-of-the-art compilers, such as XLA and TVM, on NVIDIA and AMD GPUs, achieving a speedup of up to 20.1 times. And compared to TensorRT, NVIDIA’s proprietary DNN inference library, Rammer achieves a speedup of up to 3.1 times.

AI compilation “Roller” improves compilation efficiency

An accelerator is equipped with parallel computing units and multiple layers of memory hierarchy. The data needs to be passed upwards layer by layer from the bottom memory layer before computation. At each layer, the data is divided into smaller bricks. Eventually, these smaller bricks are handed over to the top-level processor for computation. The challenge lies in how to partition the data and fill the memory space with large bricks, so as to better utilize available memory and improve efficiency. The current approach involves using machine learning to identify better strategies for partitioning these bricks. However, this typically requires thousands of search steps, each of which is evaluated on the accelerator, in order to find a satisfactory solution. As a result, the process can take days or even weeks to compile a full AI model.

Given the computational logic and the specification of each memory layer, which present a holistic view on the software and hardware information, it is possible to formulate the best strategy for partitioning the bricks, as well as the best brick sizes. This enables faster compilation with good computation efficiency. And it is the key idea behind Roller. Like a road roller, the system lays down high-dimensional tensor data onto two-dimensional memory like tiling a floor, finding the optimal tile sizes given the memory characteristics. At the same time, it encapsulates the tensor shape that aligns with the hardware characteristics of the underlying accelerator, achieving efficient compilation by limiting the choices for shapes.

A schematic diagram illustrating Roller’s technical framework. Roller takes an operator described as a tensor expression. Roller extracts the tensor shapes from the tensor expression and leverage hardware specifications to construct rTiles. Based on rTiles, Roller proposes a scale-up-then-scale-out recursive construction algorithm to generate efficient tensor programs (named rProgram) that describes the data processing pipeline. When generating rProgram, the construction algorithm identifies good rTile configurations by evaluating the performance of a constructed rProgram through a micro-performance model. It is built on top a device described through a hardware abstraction layer exposing only rTile-related interfaces: Load, Compute, and Store. The constructed rProgram is finally realized through a code generator to emit the final kernel code corresponding to the specific device.
Figure 3: Roller’s technical framework

Evaluations on six mainstream DNN models and 119 popular DNN operators demonstrated that Roller can generate highly optimized kernels in seconds, especially for large and expensive custom operators. Roller achieves a three-orders-of-magnitude improvement in compilation time compared to existing compilers. The performance of the kernels generated by Roller is comparable to that of state-of-the-art tensor compilers, including DNN libraries, with some operators performing even better. Roller has also been used in customizing DNN kernels internally, which has demonstrated its real improvement in development agility.

AI compilation “Welder” optimizes memory access and improves computing efficiency

With the growing demand for processing higher fidelity data and the use of faster computing cores in newer hardware accelerators, modern DNN models are becoming increasingly memory intensive. A disparity between underutilized computing cores and saturated memory bandwidth has been observed in various popular DNN models.

For example, profiling on a state-of-the-art DNN benchmark shows that the memory bandwidth utilization can be as high as 96.7% while the average utilization of computing cores is only 51.6%. Even more seriously, the continuous evolution of hardware and DNN models continues to increase this gap. Modern AI models tend to process high-fidelity data, such as larger images, longer sentences, and higher-resolution graphics. Such data demands higher memory bandwidth during computation. Additionally, the introduction of more efficient specialized computing cores (such as NVIDIA Tensor Cores or AMD Matrix Cores) further increases memory pressure.

To address this issue, the researchers proposed the Welder deep learning compiler, which holistically optimizes the memory access efficiency of the end-to-end DNN model. Represented as a data flow graph, the end-to-end DNN computation involves multiple stages, where the input data is divided into blocks that flow through different operators. These blocks are transferred to processor cores for computation and then transferred back to memory. This results in significant overhead due to data movement across memory layers. Since it includes multiple stages, the entire process can be envisioned as a scenario where “workers” are moving bricks upwards layer by layer. The first worker takes the bricks up, processes them, and then puts them back in their original location. The second worker takes them up again, sculpts them, and then once again puts them back. The process continues with the third worker, the fourth worker, and so on, repeatedly moving the bricks. However, this leads to significant overhead. Would it be possible for the first worker to finish a part of the subtask and then directly hand it over to the next worker at the top level? These tasks can then be “welded” together to achieve a pipelined operation with higher efficiency. Welder plays the role of such a welding tool. By connecting (welding) different operators, data blocks are processed in the manner of an assembly line, greatly reducing memory access traffic at lower-level memory layers. With AI models imposing increasingly high requirements for memory efficiency in recent years, Welder helps to significantly improve computational efficiency.

A schematic diagram illustrating Welder’s technical framework. Welder takes a full DNN model as input and converts it into a data-flow graph of tile-based computing tasks, which is called tile-graph. Then, a two-step scheduling algorithm, i.e., graph connecting and sub-graph scheduling, is proposed to recursively decide an efficient tile-graph execution plan for multiple memory layers, known as a hierarchical tile-graph. Finally, this plan is then mapped to an executable code for a specific hardware accelerator using four abstracted computing interfaces defined in the hardware layer.
Figure 4: Welder’s technical framework

Evaluations on 10 mainstream DNN models, (including classic and the latest AI model structures for various tasks, such as vision, natural language processing, 3D graphics, etc.), demonstrated that Welder significantly exceeds the performance of existing mainstream frameworks and compilers on both NVIDIA and AMD GPUs. For example, it outperforms PyTorch, ONNXRuntime, and Ansor by up to 21.4 times, 8.7 times, and 2.8 times, respectively. Welder’s automatic optimization surpasses even TensorRT and Faster Transformer (a hand-crafted library), achieving speedups of up to 3.0 times and 1.7 times, respectively. Furthermore, when running these models on hardware with faster computing cores such as TensorCore, performance is improved even more, underscoring the significance of memory optimization for future AI accelerators. 

AI compilation “Grinder” allows efficient control flow execution on accelerators

In AI computation, the movement of data blocks sometimes requires more complex control logic, i.e., control flow code. For example, a program could iteratively traverse each word in a sentence or dynamically determine which part of a program to execute based on input. Currently, most AI compilers focus on addressing data flow execution efficiency and do not provide efficient support for control flow. As a result, models with more complex control flow cannot effectively utilize accelerator performance. The researchers realized that control flow and data flow can be segmented and reorganized in order to execute more efficiently. Their solution is Grinder, which acts like a portable grinding and cutting machine. After cutting the data flow into parallel computing blocks of different sizes, it then integrates (grinds) control flow into data flow, so that control flow can also be executed efficiently on the accelerator.

A schematic diagram illustrating Grinder’s technical framework. The example loop structure is scheduled as a uProgram mapped on the 3-level accelerator. The uProgram consists of 4 loop-uTasks for 4 L1-Units resepectively and each loop-uTask is mapped to a L1-Unit for execution. Both the data flow operators and the loop are scheduled into the loop-uTasks.
Figure 5: Grinder’s technical framework

Grinder can jointly optimize the execution of control flow and data flow on hardware accelerators and unify the representation of AI models, including both control flow and data flow, through uTask, a new abstraction. This allows Grinder to expose the overall scheduling space for rescheduling control flow to lower levels of hardware parallelism. Grinder uses a heuristic strategy to find an effective scheduling scheme and can automatically move control flow into device kernels, thereby achieving optimizations across control flow boundaries. Experiments have shown that Grinder can achieve up to an 8.2x speedup on control flow-intensive DNN models, making it the fastest among DNN frameworks and compilers for control flow. 

These four AI compilers, based on a common compiler abstraction and unified intermediate representation (IR), solve multiple fundamental problems in current AI compilers, including parallelism, compilation efficiency, memory, and control flow. Together they constitute a comprehensive set of solutions for compilation. and have played an important role in the customization and optimization of new AI models within Microsoft Research.

Jilong Xue, Principal Researcher at MSR Asia, summed up the project this way:

“On one hand, AI compilers must perform extreme optimizations like operator fusion and kernel specialization tailored for hardware resources. On the other hand, they must also provide systematic compilation support for new, large-scale hardware architectures, such as AI chips featuring on-chip network interconnection (NoC) or hybrid memory architectures, and even guiding hardware design using white-box compilation technologies. The AI compilers we developed have demonstrated a substantial improvement in AI compilation efficiency, thereby facilitating the training and deployment of AI models. At the same time, the evolution of large-scale models also presents opportunities for the next generation AI compiler. In the future, these large-scale models themselves may inherently assist in achieving optimization and compilation.”

The following researchers have contributed to this project:

(In alphabetical order) Wei Cui, Yuxiao Guo, Wenxiang Hu, Lingxiao Ma, Youshan Miao, Ziming Miao, Yuqing Xia, Jilong Xue, Fan Yang, Mao Yang, Lidong Zhou

[1] Grinder is the research project name. However, this system is referred to as Cocktailer in the paper.

The post Building a “heavy metal quartet” of AI compilers appeared first on Microsoft Research.

Read More

AI Lands at Bengaluru Airport With IoT Company’s Intelligent Video Analytics Platform

AI Lands at Bengaluru Airport With IoT Company’s Intelligent Video Analytics Platform

Each year, nearly 32 million people travel through the Bengaluru Airport, or BLR, one of the busiest airports in the world’s most populous nation.

To provide such multitudes with a safer, quicker experience, the airport in the city formerly known as Bangalore is tapping vision AI technologies powered by Industry.AI.

A member of the NVIDIA Metropolis vision AI partner ecosystem, Industry.AI has deployed its vision AI platform across BLR’s newest terminal, T2, known as the Garden Terminal for its green spaces, indoor gardens and waterfalls. It’s one of the first deployments of intelligent video analytics at scale in an Indian airport.

Greenery in BLR’s newest terminal.

Industry.AI increases the safety and efficiency of the terminal’s operations by using vision AI and object detection to track abandoned baggage, flag long passenger queues and alert security teams of potential issues, among other use cases.

By identifying congestion points and anticipating delays with vision AI, staff can proactively redirect passengers to less crowded areas or provide signals to open additional checkpoints, reducing wait times and enhancing passenger experiences.

“Deploying vision AI at this scale is a first for us,” said George Fanthome, chief information officer at BLR’s parent company. “By adopting such advanced deep learning technologies, we strive to be one of the best airports in the world and provide our customers the best experience.”

Smarter, Safer Airport Operations

The Industry.AI platform connects more than 500 live camera feeds across the BLR terminal to vision AI technologies that can accomplish nearly a dozen tasks in real time.

For one, the platform can detect when luggage or a purse is left unattended.

It also helps to manage passenger queues at terminal entries, check-in counters, security check lanes and other areas. Airport staff can be trained to proactively perform tasks based on historical data of passenger movement collected by the AI platform.

“Our platform speeds up passenger flow during peak hours of operation by alerting airport staff about longer-than-optimal lines,” said Tejpreet Chopra, CEO of Industry.AI. “This is done through a dashboard with a real-time visual and sensor feed that allows the airport staff to respond to the situation in the shortest possible time.”

Unauthorized people and vehicles in the airport can also be tracked and alerted to the platform’s users in real time for enhanced security. In addition, Industry.AI detects speed violations made by vehicles outside the terminal, helping to manage safe transportation around the travel hub.

AI helps manage transportation inside and outside of BLR.

Industry.AI uses the NVIDIA TAO Toolkit and A100 Tensor Core GPUs to train its AI models. For AI inference, the company taps NVIDIA Triton Inference Server and A30 Tensor Core GPUs.

And with the NVIDIA DeepStream software development kit for AI-powered video analytics, along with technical expertise from NVIDIA — a benefit of being a member of the NVIDIA Inception program for cutting-edge startups — Industry.AI built and deployed the BLR solution in just three months.

“NVIDIA Metropolis enabled us to develop our vision AI applications more cost-effectively and bring them to market faster,” Chopra said.

Looking forward, Industry.AI plans to deploy NVIDIA-powered accelerated computing and vision AI technologies across BLR’s other terminals and at additional airports, too.

“BLR’s focus on adopting advanced AI technologies sets a new benchmark for passenger experience at airports,” Chopra said.

Learn more about the NVIDIA Metropolis platform and how it’s used to build smarter, safer airports.

Read More

Deepdub’s AI Redefining Dubbing from Hollywood to Bollywood

Deepdub’s AI Redefining Dubbing from Hollywood to Bollywood

In the global entertainment landscape, TV show and film production stretches far beyond Hollywood or Bollywood — it’s a worldwide phenomenon.

However, while streaming platforms have broadened the reach of content, dubbing and translation technology still has plenty of room for growth.

Deepdub acts as a digital bridge, providing access to content by using generative AI to break down language and cultural barriers.

On the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with the Israel-based startup’s co-founder and CEO, Ofir Krakowski. Deepdub uses AI-driven dubbing to help entertainment companies boost efficiency and cut costs while increasing accessibility.

The company is a member of NVIDIA Inception, a free program that offers startups go-to-market support, expertise and technological assistance.

Traditional dubbing is slow, costly and often missing the mark, Krakowski says. Current technology struggles with the subtleties of language, leaving jokes, idioms or jargon lost in translation.

Deepdub offers a web-based platform that enables people to interact with sophisticated AI models to handle each part of the translation and dubbing process efficiently. It translates the text, generates a voice and mixes it into the original music and audio effects.

But as Krakowksi points out, even the best AI models make mistakes, so the platform involves a human touchpoint to verify translations and ensure that generated voices sound natural and capture the right emotion.

Deepdub is also working on matching lip movements to dubbed voices.

Ultimately, Krakowski hopes to free the world from the restrictions placed by language barriers.

“I believe that the technology will enable people to enjoy the content that is created around the world,” he said. “It will globalize storytelling and knowledge, which are currently bound by language barriers.”

You Might Also Like

Jules Anh Tuan Nguyen Explains How AI Lets Amputee Control Prosthetic Hand, Video Games
A postdoctoral researcher at the University of Minnesota discusses his efforts to allow amputees to control their prosthetic limb — right down to the finger motions — with their minds.

Overjet’s Ai Wardah Inam on Bringing AI to Dentistry
Overjet, a member of NVIDIA Inception, is moving fast to bring AI to dentists’ offices. Dr. Wardah Inam, CEO of the company, discusses using AI to improve patient care.

Immunai CTO and Co-Founder Luis Voloch on Using Deep Learning to Develop New Drugs
Luis Voloch talks about tackling the challenges of the immune system with a machine learning and data science mindset.

Subscribe to the AI Podcast: Now Available on Amazon Music

The AI Podcast is now available through Amazon Music.

In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better. Have a few minutes to spare? Fill out this listener survey.

Read More

SayTap: Language to quadrupedal locomotion

SayTap: Language to quadrupedal locomotion

Simple and effective interaction between human and quadrupedal robots paves the way towards creating intelligent and capable helper robots, forging a future where technology enhances our lives in ways beyond our imagination. Key to such human-robot interaction systems is enabling quadrupedal robots to respond to natural language instructions. Recent developments in large language models (LLMs) have demonstrated the potential to perform high-level planning. Yet, it remains a challenge for LLMs to comprehend low-level commands, such as joint angle targets or motor torques, especially for inherently unstable legged robots, necessitating high-frequency control signals. Consequently, most existing work presumes the provision of high-level APIs for LLMs to dictate robot behavior, inherently limiting the system’s expressive capabilities.

In “SayTap: Language to Quadrupedal Locomotion”, we propose an approach that uses foot contact patterns (which refer to the sequence and manner in which a four-legged agent places its feet on the ground while moving) as an interface to bridge human commands in natural language and a locomotion controller that outputs low-level commands. This results in an interactive quadrupedal robot system that allows users to flexibly craft diverse locomotion behaviors (e.g., a user can ask the robot to walk, run, jump or make other movements using simple language). We contribute an LLM prompt design, a reward function, and a method to expose the SayTap controller to the feasible distribution of contact patterns. We demonstrate that SayTap is a controller capable of achieving diverse locomotion patterns that can be transferred to real robot hardware.

SayTap method

The SayTap approach uses a contact pattern template, which is a 4 X T matrix of 0s and 1s, with 0s representing an agent’s feet in the air and 1s for feet on the ground. From top to bottom, each row in the matrix gives the foot contact patterns of the front left (FL), front right (FR), rear left (RL) and rear right (RR) feet. SayTap’s control frequency is 50 Hz, so each 0 or 1 lasts 0.02 seconds. In this work, a desired foot contact pattern is defined by a cyclic sliding window of size Lw and of shape 4 X Lw. The sliding window extracts from the contact pattern template four foot ground contact flags, which indicate if a foot is on the ground or in the air between t + 1 and t + Lw. The figure below provides an overview of the SayTap method.

SayTap introduces these desired foot contact patterns as a new interface between natural language user commands and the locomotion controller. The locomotion controller is used to complete the main task (e.g., following specified velocities) and to place the robot’s feet on the ground at the specified time, such that the realized foot contact patterns are as close to the desired contact patterns as possible. To achieve this, the locomotion controller takes the desired foot contact pattern at each time step as its input in addition to the robot’s proprioceptive sensory data (e.g., joint positions and velocities) and task-related inputs (e.g., user-specified velocity commands). We use deep reinforcement learning to train the locomotion controller and represent it as a deep neural network. During controller training, a random generator samples the desired foot contact patterns, the policy is then optimized to output low-level robot actions to achieve the desired foot contact pattern. Then at test time a LLM translates user commands into foot contact patterns.

SayTap approach overview.

SayTap uses foot contact patterns (e.g., 0 and 1 sequences for each foot in the inset, where 0s are foot in the air and 1s are foot on the ground) as an interface that bridges natural language user commands and low-level control commands. With a reinforcement learning-based locomotion controller that is trained to realize the desired contact patterns, SayTap allows a quadrupedal robot to take both simple and direct instructions (e.g., “Trot forward slowly.”) as well as vague user commands (e.g., “Good news, we are going to a picnic this weekend!”) and react accordingly.

We demonstrate that the LLM is capable of accurately mapping user commands into foot contact pattern templates in specified formats when given properly designed prompts, even in cases when the commands are unstructured or vague. In training, we use a random pattern generator to produce contact pattern templates that are of various pattern lengths T, foot-ground contact ratios within a cycle based on a given gait type G, so that the locomotion controller gets to learn on a wide distribution of movements leading to better generalization. See the paper for more details.


With a simple prompt that contains only three in-context examples of commonly seen foot contact patterns, an LLM can translate various human commands accurately into contact patterns and even generalize to those that do not explicitly specify how the robot should react.

SayTap prompts are concise and consist of four components: (1) general instruction that describes the tasks the LLM should accomplish; (2) gait definition that reminds the LLM of basic knowledge about quadrupedal gaits and how they can be related to emotions; (3) output format definition; and (4) examples that give the LLM chances to learn in-context. We also specify five velocities that allow a robot to move forward or backward, fast or slow, or remain still.

General instruction block
You are a dog foot contact pattern expert.
Your job is to give a velocity and a foot contact pattern based on the input.
You will always give the output in the correct format no matter what the input is.

Gait definition block
The following are description about gaits:
1. Trotting is a gait where two diagonally opposite legs strike the ground at the same time.
2. Pacing is a gait where the two legs on the left/right side of the body strike the ground at the same time.
3. Bounding is a gait where the two front/rear legs strike the ground at the same time. It has a longer suspension phase where all feet are off the ground, for example, for at least 25% of the cycle length. This gait also gives a happy feeling.

Output format definition block
The following are rules for describing the velocity and foot contact patterns:
1. You should first output the velocity, then the foot contact pattern.
2. There are five velocities to choose from: [-1.0, -0.5, 0.0, 0.5, 1.0].
3. A pattern has 4 lines, each of which represents the foot contact pattern of a leg.
4. Each line has a label. "FL" is front left leg, "FR" is front right leg, "RL" is rear left leg, and "RR" is rear right leg.
5. In each line, "0" represents foot in the air, "1" represents foot on the ground.

Example block
Input: Trot slowly
Output: 0.5
FL: 11111111111111111000000000
FR: 00000000011111111111111111
RL: 00000000011111111111111111
RR: 11111111111111111000000000

Input: Bound in place
Output: 0.0
FL: 11111111111100000000000000
FR: 11111111111100000000000000
RL: 00000011111111111100000000
RR: 00000011111111111100000000

Input: Pace backward fast
Output: -1.0
FL: 11111111100001111111110000
FR: 00001111111110000111111111
RL: 11111111100001111111110000
RR: 00001111111110000111111111


SayTap prompt to the LLM. Texts in blue are used for illustration and are not input to LLM.

Following simple and direct commands

We demonstrate in the videos below that the SayTap system can successfully perform tasks where the commands are direct and clear. Although some commands are not covered by the three in-context examples, we are able to guide the LLM to express its internal knowledge from the pre-training phase via the “Gait definition block” (see the second block in our prompt above) in the prompt.

Following unstructured or vague commands

But what is more interesting is SayTap’s ability to process unstructured and vague instructions. With only a little hint in the prompt to connect certain gaits with general impressions of emotions, the robot bounds up and down when hearing exciting messages, like “We are going to a picnic!” Furthermore, it also presents the scenes accurately (e.g., moving quickly with its feet barely touching the ground when told the ground is very hot).

Conclusion and future work

We present SayTap, an interactive system for quadrupedal robots that allows users to flexibly craft diverse locomotion behaviors. SayTap introduces desired foot contact patterns as a new interface between natural language and the low-level controller. This new interface is straightforward and flexible, moreover, it allows a robot to follow both direct instructions and commands that do not explicitly state how the robot should react.

One interesting direction for future work is to test if commands that imply a specific feeling will allow the LLM to output a desired gait. In the gait definition block shown in the results section above, we provide a sentence that connects a happy mood with bounding gaits. We believe that providing more information can augment the LLM’s interpretations (e.g., implied feelings). In our evaluation, the connection between a happy feeling and a bounding gait led the robot to act vividly when following vague human commands. Another interesting direction for future work is to introduce multi-modal inputs, such as videos and audio. Foot contact patterns translated from those signals will, in theory, still work with our pipeline and will unlock many more interesting use cases.


Yujin Tang, Wenhao Yu, Jie Tan, Heiga Zen, Aleksandra Faust and Tatsuya Harada conducted this research. This work was conceived and performed while the team was in Google Research and will be continued at Google DeepMind. The authors would like to thank Tingnan Zhang, Linda Luu, Kuang-Huei Lee, Vincent Vanhoucke and Douglas Eck for their valuable discussions and technical support in the experiments.

Read More

Wide Horizons: NVIDIA Keynote Points Way to Further AI Advances

Wide Horizons: NVIDIA Keynote Points Way to Further AI Advances

Dramatic gains in hardware performance have spawned generative AI, and a rich pipeline of ideas for future speedups that will drive machine learning to new heights, Bill Dally, NVIDIA’s chief scientist and senior vice president of research, said today in a keynote.

Dally described a basket of techniques in the works — some already showing impressive results — in a talk at Hot Chips, an annual event for processor and systems architects.

“The progress in AI has been enormous, it’s been enabled by hardware and it’s still gated by deep learning hardware,” said Dally, one of the world’s foremost computer scientists and former chair of Stanford University’s computer science department.

He showed, for example, how ChatGPT, the large language model (LLM) used by millions, could suggest an outline for his talk. Such capabilities owe their prescience in large part to gains from GPUs in AI inference performance over the last decade, he said.

Chart of single GPU performance advances
Gains in single-GPU performance are just part of a larger story that includes million-x advances in scaling to data-center-sized supercomputers.

Research Delivers 100 TOPS/Watt

Researchers are readying the next wave of advances. Dally described a test chip that demonstrated nearly 100 tera operations per watt on an LLM.

The experiment showed an energy-efficient way to further accelerate the transformer models used in generative AI. It applied four-bit arithmetic, one of several simplified numeric approaches that promise future gains.

closeup of Bill Dally
Bill Dally

Looking further out, Dally discussed ways to speed calculations and save energy using logarithmic math, an approach NVIDIA detailed in a 2021 patent.

Tailoring Hardware for AI

He explored a half dozen other techniques for tailoring hardware to specific AI tasks, often by defining new data types or operations.

Dally described ways to simplify neural networks, pruning synapses and neurons in an approach called structural sparsity, first adopted in NVIDIA A100 Tensor Core GPUs.

“We’re not done with sparsity,” he said. “We need to do something with activations and can have greater sparsity in weights as well.”

Researchers need to design hardware and software in tandem, making careful decisions on where to spend precious energy, he said. Memory and communications circuits, for instance, need to minimize data movements.

“It’s a fun time to be a computer engineer because we’re enabling this huge revolution in AI, and we haven’t even fully realized yet how big a revolution it will be,” Dally said.

More Flexible Networks

In a separate talk, Kevin Deierling, NVIDIA’s vice president of networking, described the unique flexibility of NVIDIA BlueField DPUs and NVIDIA Spectrum networking switches for allocating resources based on changing network traffic or user rules.

The chips’ ability to dynamically shift hardware acceleration pipelines in seconds enables load balancing with maximum throughput and gives core networks a new level of adaptability. That’s especially useful for defending against cybersecurity threats.

“Today with generative AI workloads and cybersecurity, everything is dynamic, things are changing constantly,” Deierling said. “So we’re moving to runtime programmability and resources we can change on the fly,”

In addition, NVIDIA and Rice University researchers are developing ways users can take advantage of the runtime flexibility using the popular P4 programming language.

Grace Leads Server CPUs

A talk by Arm on its Neoverse V2 cores included an update on the performance of the NVIDIA Grace CPU Superchip, the first processor implementing them.

Tests show that, at the same power, Grace systems deliver up to 2x more throughput than current x86 servers across a variety of CPU workloads. In addition, Arm’s SystemReady Program certifies that Grace systems will run existing Arm operating systems, containers and applications with no modification.

Chart of Grace efficiency and performance gains
Grace gives data center operators a choice to deliver more performance or use less power.

Grace uses an ultra-fast fabric to connect 72 Arm Neoverse V2 cores in a single die, then a version of NVLink connects two of those dies in a package, delivering 900 GB/s of bandwidth. It’s the first data center CPU to use server-class LPDDR5X memory, delivering 50% more memory bandwidth at similar cost but one-eighth the power of typical server memory.

Hot Chips kicked off Aug. 27 with a full day of tutorials, including talks from NVIDIA experts on AI inference and protocols for chip-to-chip interconnects, and runs through today.

Read More