The Balloon Learning Environment

Benchmark challenges have been a driving force in the advancement of machine learning (ML). In particular, difficult benchmark environments for reinforcement learning (RL) have been crucial for the rapid progress of the field by challenging researchers to overcome increasingly difficult tasks. The Arcade Learning Environment, Mujoco, and others have been used to push the envelope in RL algorithms, representation learning, exploration, and more.

In “Autonomous Navigation of Stratospheric Balloons Using Reinforcement Learning”, published in Nature, we demonstrated how deep RL can be used to create a high-performing flight agent that can control stratospheric balloons in the real world. This research confirmed that deep RL can be successfully applied outside of simulated environments, and contributed practical knowledge for integrating RL algorithms with complex dynamical systems. Today we are excited to announce the open-source release of the Balloon Learning Environment (BLE), a new benchmark emulating the real-world problem of controlling stratospheric balloons. The BLE is a high-fidelity simulator, which we hope will provide researchers with a valuable resource for deep RL research.

Station-Keeping Stratospheric Balloons
Stratospheric balloons are filled with a buoyant gas that allows them to float for weeks or months at a time in the stratosphere, about twice as high as a passenger plane’s cruising altitude. Though there are many potential variations of stratospheric balloons, the kind emulated in the BLE are equipped with solar panels and batteries, which allow them to adjust their altitude by controlling the weight of air in their ballast using an electric pump. However, they have no means to propel themselves laterally, which means that they are subject to wind patterns in the air around them.

By changing its altitude, a stratospheric balloon can surf winds moving in different directions.

The goal of an agent in the BLE is to station-keep — i.e., to control a balloon to stay within 50km of a fixed ground station — by changing its altitude to catch winds that it finds favorable. We measure how successful an agent is at station-keeping by measuring the fraction of time the balloon is within the specified radius, denoted TWR50 (i.e., the time within a radius of 50km).

A station-seeking balloon must navigate a changing wind field to stay above a ground station. Left: Side elevation of a station-keeping balloon. Right: Birds-eye-view of the same balloon.

The Challenges of Station-Keeping
To create a realistic simulator (without including copious amounts of historical wind data), the BLE uses a variational autoencoder (VAE) trained on historical data to generate wind forecasts that match the characteristics of real winds. A wind noise model is then used to make the windfields more realistic to match what a balloon would encounter in real-world conditions.

Navigating a stratospheric balloon through a wind field can be quite challenging. The winds at any given altitude rarely remain ideal for long, and a good balloon controller will need to move up and down through its wind column to discover more suitable winds. In RL parlance, the problem of station-keeping is partially observable because the agent only has access to forecasted wind data to make those decisions. An agent has access to wind forecasts at every altitude and the true wind at its current altitude. The BLE returns an observation which includes a notion of wind uncertainty.

A stratospheric balloon must explore winds at different altitudes in order to find favorable winds. The observation returned by the BLE includes wind predictions and a measure of uncertainty, made by mixing a wind forecast and winds measured at the balloon’s altitude.

In some situations, there may not be suitable winds anywhere in the balloon’s wind column. In this case, an expert agent is still able to fly towards the station by taking a more circuitous route through the wind field (a common example is when the balloon moves in a zig-zag fashion, akin to tacking on a sailboat). Below we demonstrate that even just remaining in range of the station usually requires significant acrobatics.

An agent must handle long planning horizons to succeed in station-keeping. In this case, StationSeeker (an expert-designed controller) heads directly to the center of the station-keeping area and is pushed out, while Perciatelli44 (an RL agent) is able to plan ahead and stay in range longer by hugging the edge of the area.

Night-time adds a fresh element of difficulty to station-keeping in the BLE, which reflects the reality of night-time changes in physical conditions and power availability. While during the day the air pump is powered by solar panels, at night the balloon relies on its on-board batteries for energy. Using too much power early in the night typically results in limited maneuverability in the hours preceding dawn. This is where RL agents can discover quite creative solutions — such as reducing altitude in the afternoon in order to store potential energy.

An agent needs to balance the station-keeping objective with a finite energy allowance at night.

Despite all these challenges, our research demonstrates that agents trained with reinforcement learning can learn to perform better than expert-designed controllers at station-keeping. Along with the BLE, we are releasing the main agents from our research: Perciatelli44 (an RL agent) and StationSeeker (an expert-designed controller). The BLE can be used with any reinforcement learning library, and to showcase this we include Dopamine’s DQN and QR-DQN agents, as well as Acme’s QR-DQN agent (supporting both standalone and distributed training with Launchpad).

Evaluation performance by the included benchmark agents on the BLE. “Finetuned” is a fine-tuned Perciatelli44 agent, and Acme is a QR-DQN agent trained with the Acme library.

The BLE source code contains information on how to get started with the BLE, including training and evaluating agents, documentation on the various components of the simulator, and example code. It also includes the historical windfield data (as a TensorFlow DataSet) used to train the VAE to allow researchers to experiment with their own models for windfield generation. We are excited to see the progress that the community will make on this benchmark.

Acknowledgements
We would like to thank the Balloon Learning Environment team: Sal Candido, Marc G. Bellemare, Vincent Dumoulin, Ross Goroshin, and Sam Ponda. We’d also like to thank Tom Small for his excellent animation in this blog post and graphic design help, along with our colleagues, Bradley Rhodes, Daniel Eisenberg, Piotr Staczyk, Anton Raichuk, Nikola Momchev, Geoff Hinton, Hugo Larochelle, and the rest of the Google Brain team in Montreal.

Read More

This Googler wants to ‘add every voice’ to AI

Early in his career, Laurence Moroney was working on an equation — not something related to his job in tech, but to his bank account. “At one point, I calculated I was about three weeks away from being homeless,” Laurence says. “My motivation was to put a meal on the table and keep a roof over my head.”

Today Laurence is a developer advocate at Google focusing on artificial intelligence (AI) and machine learning (ML). “It’s my goal to inform and inspire the world about what we can do with AI and ML, and help developers realize these possibilities.” Laurence applied at Google in 2013 after hearing then-CEO Larry Page talk about Google’s vision to make the world a better place. “I was hired on my third attempt — so yes, I failed twice!”

Now he focuses on inviting and introducing more people to roles in the AI and ML fields through coursework, workshops and bootcamps that help developers gain job skills through professional certificates. “I try to meet developers where they are, whether that’s on YouTube, social media or in-person events,” he says. He’s particularly motivated to reach out to groups who have been historically underrepresented in tech. “Often they look and see everyone is one ethnicity and one gender and they think they don’t belong, but that’s not the case: Everyone, all ages, disabilities, whatever your background is, you should be here,” he says. “It’s so important for AI and ML work to include the entire scope of people which is why I’m so motivated to try and make everyone feel like they belong in this work.”

But it wasn’t an easy or straightforward path: his early years were tumultuous. Originally from Cyprus, Laurence and his family were forced to leave their home when a civil war resulted in an invasion. Exposure to chemicals used in the war zone permanently stained Laurence’s teeth, and he was also left with shaky hands. After moving to four different countries before the age of 8 (and learning four different languages), they settled in Ireland. “When you’re young, you don’t notice how difficult these things are, you just think…this is your life and this is normal,” he says.

He didn’t have the luxury to find his “passion” at work. “I needed a job and I needed a career. And around that time, the internet was starting to open up all of these new possibilities and opportunities.” In 1992, while bouncing around between odd jobs after receiving his degree in physics, Laurence heard about a government AI training program in the U.K. — one that worked as a sort of fellowship helping participants earn their master’s degree while also working on ways that AI systems could benefit the country.

“Hundreds of people descended on the testing center, where they looked at things like IQ, reasoning skills and so on,” Laurence says. Laurence also went — and ended up with the highest score. “They signed me up without realizing my background or ethnicity, and I was glad for that because I had experinced a lot of discrimination for being Irish,” he notes. “By that time I had gotten in the habit of disguising my accent. I tried not to talk much when I spoke to the government official who was running the program.” Despite his nerves, Laurence was asked to be the first person to sign on…though, the program itself was short-lived.

Every voice we add enriches what we’re doing — and every voice we lose diminishes it.

Read More

Good News About the Carbon Footprint of Machine Learning Training

Machine learning (ML) has become prominent in information technology, which has led some to raise concerns about the associated rise in the costs of computation, primarily the carbon footprint, i.e., total greenhouse gas emissions. While these assertions rightfully elevated the discussion around carbon emissions in ML, they also highlight the need for accurate data to assess true carbon footprint, which can help identify strategies to mitigate carbon emission in ML.

In “The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink”, accepted for publication in IEEE Computer, we focus on operational carbon emissions — i.e., the energy cost of operating ML hardware, including data center overheads — from training of natural language processing (NLP) models and investigate best practices that could reduce the carbon footprint. We demonstrate four key practices that reduce the carbon (and energy) footprint of ML workloads by large margins, which we have employed to help keep ML under 15% of Google’s total energy use.

The 4Ms: Best Practices to Reduce Energy and Carbon Footprints
We identified four best practices that reduce energy and carbon emissions significantly — we call these the “4Ms” — all of which are being used at Google today and are available to anyone using Google Cloud services.

  • Model. Selecting efficient ML model architectures, such as sparse models, can advance ML quality while reducing computation by 3x–10x.
  • Machine. Using processors and systems optimized for ML training, versus general-purpose processors, can improve performance and energy efficiency by 2x–5x.
  • Mechanization. Computing in the Cloud rather than on premise reduces energy usage and therefore emissions by 1.4x–2x. Cloud-based data centers are new, custom-designed warehouses equipped for energy efficiency for 50,000 servers, resulting in very good power usage effectiveness (PUE). On-premise data centers are often older and smaller and thus cannot amortize the cost of new energy-efficient cooling and power distribution systems.
  • Map Optimization. Moreover, the cloud lets customers pick the location with the cleanest energy, further reducing the gross carbon footprint by 5x–10x. While one might worry that map optimization could lead to the greenest locations quickly reaching maximum capacity, user demand for efficient data centers will result in continued advancement in green data center design and deployment.

These four practices together can reduce energy by 100x and emissions by 1000x.

Note that Google matches 100% of its operational energy use with renewable energy sources. Conventional carbon offsets are usually retrospective up to a year after the carbon emissions and can be purchased anywhere on the same continent. Google has committed to decarbonizing all energy consumption so that by 2030, it will operate on 100% carbon-free energy, 24 hours a day on the same grid where the energy is consumed. Some Google data centers already operate on 90% carbon-free energy; the overall average was 61% carbon-free energy in 2019 and 67% in 2020.

Below, we illustrate the impact of improving the 4Ms in practice. Other studies examined training the Transformer model on an Nvidia P100 GPU in an average data center and energy mix consistent with the worldwide average. The recently introduced Primer model reduces the computation needed to achieve the same accuracy by 4x. Using newer-generation ML hardware, like TPUv4, provides an additional 14x improvement over the P100, or 57x overall. Efficient cloud data centers gain 1.4x over the average data center, resulting in a total energy reduction of 83x. In addition, using a data center with a low-carbon energy source can reduce the carbon footprint another 9x, resulting in a 747x total reduction in carbon footprint over four years.

Reduction in gross carbon dioxide equivalent emissions (CO2e) from applying the 4M best practices to the Transformer model trained on P100 GPUs in an average data center in 2017, as done in other studies. Displayed values are the cumulative improvement successively addressing each of the 4Ms: updating the model to Primer; upgrading the ML accelerator to TPUv4; using a Google data center with better PUE than average; and training in a Google Oklahoma data center that uses very clean energy.

Overall Energy Consumption for ML
Google’s total energy usage increases annually, which is not surprising considering increased use of its services. ML workloads have grown rapidly, as has the computation per training run, but paying attention to the 4Ms — optimized models, ML-specific hardware, efficient data centers — has largely compensated for this increased load. Our data shows that ML training and inference are only 10%–15% of Google’s total energy use for each of the last three years, each year split ⅗ for inference and ⅖ for training.

Prior Emission Estimates
Google uses neural architecture search (NAS) to find better ML models. NAS is typically performed once per problem domain/search space combination, and the resulting model can then be reused for thousands of applications — e.g., the Evolved Transformer model found by NAS is open sourced for all to use. As the optimized model found by NAS is often more efficient, the one time cost of NAS is typically more than offset by emission reductions from subsequent use.

A study from the University of Massachusetts (UMass) estimated carbon emissions for the Evolved Transformer NAS.

  • Without ready access to Google hardware or data centers, the study extrapolated from the available P100 GPUs instead of TPUv2s, and assumed US average data center efficiency instead of highly efficient hyperscale data centers. These assumptions increased the estimate by 5x over the energy used by the actual NAS computation that was performed in Google’s data center.
  • In order to accurately estimate the emissions for NAS, it’s important to understand the subtleties of how they work. NAS systems use a much smaller proxy task to search for the most efficient models to save time, and then scale up the found models to full size. The UMass study assumed that the search repeated full size model training thousands of times, resulting in emission estimates that are another 18.7x too high.

The overshoot for the NAS was 88x: 5x for energy-efficient hardware in Google data centers and 18.7x for computation using proxies. The actual CO2e for the one-time search were 3,223 kg versus 284,019 kg, 88x less than the published estimate.

Unfortunately, some subsequent papers misinterpreted the NAS estimate as the training cost for the model it discovered, yet emissions for this particular NAS are ~1300x larger than for training the model. These papers estimated that training the Evolved Transformer model takes two million GPU hours, costs millions of dollars, and that its carbon emissions are equivalent to five times the lifetime emissions of a car. In reality, training the Evolved Transformer model on the task examined by the UMass researchers and following the 4M best practices takes 120 TPUv2 hours, costs $40, and emits only 2.4 kg (0.00004 car lifetimes), 120,000x less. This gap is nearly as large as if one overestimated the CO2e to manufacture a car by 100x and then used that number as the CO2e for driving a car.

Outlook
Climate change is important, so we must get the numbers right to ensure that we focus on solving the biggest challenges. Within information technology, we believe these are much more likely the lifecycle costs — i.e., emission estimates that include the embedded carbon emitted from manufacturing all components involved, from chips to data center buildings — of manufacturing computing equipment of all types and sizes1 rather than the operational cost of ML training.

Expect more good news if everyone improves the 4Ms. While these numbers may currently vary across companies, these simple measures can be followed across the industry:

If the 4Ms become widely recognized, we predict a virtuous circle that will bend the curve so that the global carbon footprint of ML training is actually shrinking, not increasing.

Acknowledgements
Let me thank my co-authors who stayed with this long and winding investigation into a topic that was new to most of us: Jeff Dean, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, and Maud Texier. We also had a great deal of help from others along the way for an earlier study that eventually led to this version of the paper. Emma Strubell made several suggestions for the prior paper, including the recommendation to examine the recent giant NLP models. Christopher Berner, Ilya Sutskever, OpenAI, and Microsoft shared information about GPT-3. Dmitry Lepikhin and Zongwei Zhou did a great deal of work to measure the performance and power of GPUs and TPUs in Google data centers. Hallie Cramer, Anna Escuer, Elke Michlmayr, Kelli Wright, and Nick Zakrasek helped with the data and policies for energy and CO2e emissions at Google.



1Worldwide IT manufacturing for 2021 included 1700M cell phones, 340M PCs, and 12M data center servers.   

Read More

An International Scientific Challenge for the Diagnosis and Gleason Grading of Prostate Cancer

In recent years, machine learning (ML) competitions in health have attracted ML scientists to work together to solve challenging clinical problems. These competitions provide access to relevant data and well-defined problems where experienced data scientists come to compete for solutions and learn new methods. However, a fundamental difficulty in organizing such challenges is obtaining and curating high quality datasets for model development and independent datasets for model evaluation. Importantly, to reduce the risk of bias and to ensure broad applicability of the algorithm, evaluation of the generalisability of resulting algorithms should ideally be performed on multiple independent evaluation datasets by an independent group of scientists.

One clinical problem that has attracted substantial ML research is prostate cancer, a condition that 1 in 9 men develop in their lifetime. A prostate cancer diagnosis requires pathologists to examine biological tissue samples under a microscope to identify cancer and grade the cancer for signs of aggressive growth patterns in the cells. However, this cancer grading task (called Gleason grading) is difficult and subjective due to the need for visual assessment of cell differentiation and Gleason pattern predominance. Building a large dataset of samples with expert annotations can help with the development of ML systems to aid in prostate cancer grading.

To help accelerate and enable more research in this area, Google Health, Radboud University Medical Center and Karolinska Institutet joined forces to organize a global competition, the Prostate cANcer graDe Assessment (PANDA) Challenge, on the open Kaggle platform. In “Artificial Intelligence for Diagnosis and Gleason Grading of Prostate Cancer: the PANDA challenge”, published in Nature Medicine, we present the results of the challenge. The study design of the PANDA challenge provided the largest public whole-slide image dataset available and was open to participants from April 21st until July 23rd, 2020. The development datasets remain available for further research. In this effort, we compiled and publicly released a European cohort of prostate cancer cases for algorithm development and pioneered a standardized evaluation setup for digital pathology that enabled independent, blinded external validation of the algorithms on data from both the United States and EU.

The global competition attracted participants from 65 countries (the size of the circle for each country illustrates the number of participants).

Design of the Panda Challenge
The challenge had two phases: a development phase (i.e., the Kaggle competition) and a validation phase. During the competition, 1,290 developers from 65 countries competed in building the best performing Gleason grading algorithm, having full access to a development set for algorithm training. Throughout the competition teams submitted algorithms that were evaluated on a hidden tuning set.

In the validation phase, a selection of top performing algorithms were independently evaluated on internal and external validation datasets with high quality reference grades from panels of expert prostate pathologists. In addition, a group of general pathologists graded a subset of the same cases to put the difficulty of the task and dataset in context. The algorithms submitted by the teams were then compared to grades done by groups of international and US general pathologists on these subsets.

Overview of the PANDA challenge’s phases for development and validation.

Research Velocity During the Challenge
We found that a group of Gleason grading ML algorithms developed during a global competition could achieve pathologist-level performance and generalize well to intercontinental and multinational cohorts. On all external validation sets, these algorithms achieved high agreement with urologic pathologists (prostate specialists) and high sensitivity for detecting tumor in biopsies. The Kaggle platform enabled the tracking of teams’ performance throughout the competition. Impressively, the first team achieving high agreement with the prostate pathologists at above 0.90 (quadratically weighted Cohen’s kappa) on the internal validation set occurred within the first 10 days of the competition. By the 33rd day, the median performance of all teams exceeded a score of 0.85.

Progression of algorithms’ performances throughout the competition, as shown by the highest score on the tuning and internal validation sets among all participating teams. During the competition teams could submit their algorithm for evaluation on the tuning set, after which they received their score. At the same time, algorithms were evaluated on the internal validation set, without disclosing these results to the participating teams. The development of the top score obtained by any team shows the rapid improvement of the algorithms.

Learning from the Challenge
By moderating the discussion forum on the Kaggle platform, we learned that the teams’ openness in sharing code via colab notebooks led to rapid improvement across the board, a promising sign for future public challenges, and a clear indication of the power of sharing knowledge on a common platform.

Organizing a public challenge that evaluates algorithm generalization across independent cohorts using high quality reference standard panels presents substantial logistical difficulties. Assembling this size of a dataset across countries and organizations was a massive undertaking. This work benefited from an amazing collaboration between the three organizing institutions which have all contributed respective publications in this space, two in Lancet Oncology and one in JAMA Oncology. Combining these efforts provided a high quality foundation on which this competition could be based. With the publication, Radboud and Karolinska research groups are also open sourcing the PANDA challenge development datasets to facilitate the further improvement of prostate Gleason grading algorithms. We look forward to seeing many more advancements in this field, and more challenges that can catalyze extensive international knowledge sharing and collaborative research.

Acknowledgements
Key contributors to this project at Google include Po-Hsuan Cameron Chen, Kunal Nagpal, Yuannan Cai, David F. Steiner, Maggie Demkin, Sohier Dane, Fraser Tan, Greg S. Corrado, Lily Peng, Craig H. Mermel. Collaborators on this project include Wouter Bulten, Kimmo Kartasalo, Peter Ström, Hans Pinckaers, Hester van Boven, Robert Vink, Christina Hulsbergen-van de Kaa, Jeroen van der Laak, Mahul B. Amin, Andrew J. Evans, Theodorus van der Kwast, Robert Allan, Peter A. Humphrey, Henrik Grönberg, Hemamali Samaratunga, Brett Delahunt, Toyonori Tsuzuki, Tomi Häkkinen, Lars Egevad, Masi Valkonen, Pekka Ruusuvuori, Geert Litjens, Martin Eklund and the PANDA Challenge consortium. We thank Ellery Wulczyn, Annisah Um’rani, Yun Liu, and Dale Webster for their feedback on the manuscript and guidance on the project. We thank our collaborators at NMCSD, particularly Niels Olson, for internal re-use of de-identified data which contributed to the US external validation set. Sincere appreciation also goes to Sami Lachgar, Ashley Zlatinov, and Lauren Winer for their feedback on the blogpost.

Read More

Guiding Frozen Language Models with Learned Soft Prompts

Large pre-trained language models, which are continuing to grow in size, achieve state-of-art results on many natural language processing (NLP) benchmarks. Since the development of GPT and BERT, standard practice has been to fine-tune models on downstream tasks, which involves adjusting every weight in the network (i.e., model tuning). However, as models become larger, storing and serving a tuned copy of the model for each downstream task becomes impractical.

An appealing alternative is to share across all downstream tasks a single frozen pre-trained language model, in which all weights are fixed. In an exciting development, GPT-3 showed convincingly that a frozen model can be conditioned to perform different tasks through “in-context” learning. With this approach, a user primes the model for a given task through prompt design, i.e., hand-crafting a text prompt with a description or examples of the task at hand. For instance, to condition a model for sentiment analysis, one could attach the prompt, “Is the following movie review positive or negative?” before the input sequence, “This movie was amazing!

Sharing the same frozen model across tasks greatly simplifies serving and allows for efficient mixed-task inference, but unfortunately, this is at the expense of task performance. Text prompts require manual effort to design, and even well-designed prompts still far underperform compared to model tuning. For instance, the performance of a frozen GPT-3 175B parameter model on the SuperGLUE benchmark is 5 points below a fine-tuned T5 model that uses 800 times fewer parameters.

In “The Power of Scale for Parameter-Efficient Prompt Tuning”, presented at EMNLP 2021, we explore prompt tuning, a more efficient and effective method for conditioning frozen models using tunable soft prompts. Just like engineered text prompts, soft prompts are concatenated to the input text. But rather than selecting from existing vocabulary items, the “tokens” of the soft prompt are learnable vectors. This means a soft prompt can be optimized end-to-end over a training dataset. In addition to removing the need for manual design, this allows the prompt to condense information from datasets containing thousands or millions of examples. By comparison, discrete text prompts are typically limited to under 50 examples due to constraints on model input length. We are also excited to release the code and checkpoints to fully reproduce our experiments.

Prompt tuning retains the strong task performance of model tuning, while keeping the pre-trained model frozen, enabling efficient multitask serving.

Prompt Tuning
To create a soft prompt for a given task, we first initialize the prompt as a fixed-length sequence of vectors (e.g., 20 tokens long). We attach these vectors to the beginning of each embedded input and feed the combined sequence into the model. The model’s prediction is compared to the target to calculate a loss, and the error is back-propagated to calculate gradients, however we only apply these gradient updates to our new learnable vectors — keeping the core model frozen. While soft prompts learned in this way are not immediately interpretable, at an intuitive level, the soft prompt is extracting evidence about how to perform a task from the labeled dataset, performing the same role as a manually written text prompt, but without the need to be constrained to discrete language.

Our codebase, implemented in the new JAX-based T5X framework, makes it easy for anyone to replicate this procedure, and provides practical hyperparameter settings, including a large learning rate (0.3), which we found was important for achieving good results.

Since soft prompts have a small parameter footprint (we train prompts with as few as 512 parameters), one can easily pass the model a different prompt along with each input example. This enables mixed-task inference batches, which can streamline serving by sharing one core model across many tasks.

Left: With model tuning, incoming data are routed to task-specific models. Right: With prompt tuning, examples and prompts from different tasks can flow through a single frozen model in large batches, better utilizing serving resources.

Improvement with Scale
When evaluated on SuperGLUE and using a frozen T5 model, prompt tuning significantly outperforms prompt design using either GPT-3 or T5. Furthermore, as model size increases, prompt tuning catches up to the performance level of model tuning. Intuitively, the larger the pre-trained model, the less of a “push” it needs to perform a specific task, and the more capable it is of being adapted in a parameter-efficient way.

As scale increases, prompt tuning matches model tuning, despite tuning 25,000 times fewer parameters.

The effectiveness of prompt tuning at large model scales is especially important, since serving separate copies of a large model can incur significant computational overhead. In our paper, we demonstrate that larger models can be conditioned successfully even with soft prompts as short as 5 tokens. For T5 XXL, this means tuning just 20 thousand parameters to guide the behavior of an 11 billion parameter model.

Resilience to Domain Shift
Another advantage of prompt tuning is its resilience to domain shift. Since model tuning touches every weight in the network, it has the capacity to easily overfit on the provided fine-tuning data and may not generalize well to variations in the task at inference time. By comparison, our learned soft prompts have a small number of parameters, so the solutions they represent may be more generalizable.

To test generalizability, we train prompt tuning and model tuning solutions on one task, and evaluate zero-shot on a closely related task. For example, when we train on the Quora Question Pairs task (i.e., detecting if two questions are duplicates) and evaluate on MRPC (i.e., detecting if two sentences from news articles are paraphrases), prompt tuning achieves +3.2 points higher accuracy than model tuning.

Train    Eval    Tuning    Accuracy    F1
                          
QQP    MRPC    Model    73.1 ±0.9    81.2 ±2.1
Prompt    76.3 ±0.1    84.3 ±0.3
                          
MRPC    QQP    Model    74.9 ±1.3    70.9 ±1.2
Prompt    75.4 ±0.8    69.7 ±0.3   
On zero-shot domain transfer between two paraphrase detection tasks, prompt tuning matches or outperforms model tuning, depending on the direction of transfer.

Looking Forward
Prompt-based learning is an exciting new area that is quickly evolving. While several similar methods have been proposed — such as Prefix Tuning, WARP, and P-Tuningwe discuss their pros and cons and demonstrate that prompt tuning is the simplest and the most parameter efficient method.

In addition to the Prompt Tuning codebase, we’ve also released our LM-adapted T5 checkpoints, which we found to be better-suited for prompt tuning compared to the original T5. This codebase was used for the prompt tuning experiments in FLAN, and the checkpoints were used as a starting point for training the BigScience T0 model. We hope that the research community continues to leverage and extend prompt tuning in future research.

Acknowledgements
This project was a collaboration between Brian Lester, Rami Al-Rfou and Noah Constant. We are grateful to the following people for feedback, discussion and assistance: Waleed Ammar, Lucas Dixon, Slav Petrov, Colin Raffel, Adam Roberts, Sebastian Ruder, Noam Shazeer, Tu Vu and Linting Xue.

Read More

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient, and Interpretable Visual Understanding

In visual understanding, the Visual Transformer (ViT) and its variants have received significant attention recently due to their superior performance on many core visual applications, such as image classification, object detection, and video understanding. The core idea of ViT is to utilize the power of self-attention layers to learn global relationships between small patches of images. However, the number of connections between patches increases quadratically with image size. Such a design has been observed to be data inefficient — although the original ViT can perform better than convolutional networks with hundreds of millions of images for pre-training, such a data requirement is not always practical, and it still underperforms compared to convolutional networks when given less data. Many are exploring to find more suitable architectural re-designs that can learn visual representations effectively, such as by adding convolutional layers and building hierarchical structures with local self-attention.

The principle of hierarchical structure is one of the core ideas in vision models, where bottom layers learn more local object structures on the high-dimensional pixel space and top layers learn more abstracted and high-level knowledge at low-dimensional feature space. Existing ViT-based methods focus on designing a variety of modifications inside self-attention layers to achieve such a hierarchy, but while these offer promising performance improvements, they often require substantial architectural re-designs. Moreover, these approaches lack an interpretable design, so it is difficult to explain the inner-workings of trained models.

To address these challenges, in “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding”, we present a rethinking of existing hierarchical structure–driven designs, and provide a novel and orthogonal approach to significantly simplify them. The central idea of this work is to decouple feature learning and feature abstraction (pooling) components: nested transformer layers encode visual knowledge of image patches separately, and then the processed information is aggregated. This process is repeated in a hierarchical manner, resulting in a pyramid network structure. The resulting architecture achieves competitive results on ImageNet and outperforms results on data-efficient benchmarks. We have shown such a design can meaningfully improve data efficiency with faster convergence and provide valuable interpretability benefits. Moreover, we introduce GradCAT, a new technique for interpreting the decision process of a trained model at inference time.

Architecture Design
The overall architecture is simple to implement by adding just a few lines of Python code to the source code of the original ViT. The original ViT architecture divides an input image into small patches, projects pixels of each patch to a vector with predefined dimension, and then feeds the sequences of all vectors to the overall ViT architecture containing multiple stacked identical transformer layers. While every layer in ViT processes information of the whole image, with this new method, stacked transformer layers are used to process only a region (i.e., block) of the image containing a few spatially adjacent image patches. This step is independent for each block and is also where feature learning occurs. Finally, a new computational layer called block aggregation then combines all of the spatially adjacent blocks. After each block aggregation, the features corresponding to four spatially adjacent blocks are fed to another module with a stack of transformer layers, which then process those four blocks jointly. This design naturally builds a pyramid hierarchical structure of the network, where bottom layers can focus on local features (such as textures) and upper layers focus on global features (such as object shape) at reduced dimensionality thanks to the block aggregation.

A visualization of the network processing an image: Given an input image, the network first partitions images into blocks, where each block contains 4 image patches. Image patches in every block are linearly projected as vectors and processed by a stack of identical transformer layers. Then the proposed block aggregation layer aggregates information from each block and reduces its spatial size by 4 times. The number of blocks is reduced to 1 at the top hierarchy and classification is conducted after the output of it.

Interpretability
This architecture has a non-overlapping information processing mechanism, independent at every node. This design resembles a decision tree-like structure, which manifests unique interpretability capabilities because every tree node contains independent information of an image block that is being received by its parent nodes. We can trace the information flow through the nodes to understand the importance of each feature. In addition, our hierarchical structure retains the spatial structure of images throughout the network, leading to learned spatial feature maps that are effective for interpretation. Below we showcase two kinds of visual interpretability.

First, we present a method to interpret the trained model on test images, called gradient-based class-aware tree-traversal (GradCAT). GradCAT traces the feature importance of each block (a tree node) from top to bottom of the hierarchy structure. The main idea is to find the most valuable traversal from the root node at the top layer to a child node at the bottom layer that contributes the most to the classification outcomes. Since each node processes information from a certain region of the image, such traversal can be easily mapped to the image space for interpretation (as shown by the overlaid dots and lines in the image below).

The following is an example of the model’s top-4 predictions and corresponding interpretability results on the left input image (containing 4 animals). As shown below, GradCAT highlights the decision path along the hierarchical structure as well as the corresponding visual cues in local image regions on the images.

Given the left input image (containing four objects), the figure visualizes the interpretability results of the top-4 prediction classes. The traversal locates the model decision path along the tree and simultaneously locates the corresponding image patch (shown by the dotted line on images) that has the highest impact to the predicted target class.

Moreover, the following figures visualize results on the ImageNet validation set and show how this approach enables some intuitive observations. For instance, the example of the lighter below (upper left panel) is particularly interesting because the ground truth class — lighter/matchstick — actually defines the bottom-right matchstick object, while the most salient visual features (with the highest node values) are actually from the upper-left red light, which conceptually shares visual cues with a lighter. This can also be seen from the overlaid red lines, which indicate the image patches with the highest impact on the prediction. Thus, although the visual cue is a mistake, the output prediction is correct. In addition, the four child nodes of the wooden spoon below have similar feature importance values (see numbers visualized in the nodes; higher indicates more importance), which is because the wooden texture of the table is similar to that of the spoon.

Visualization of the results obtained by the proposed GradCAT. Images are from the ImageNet validation dataset.

Second, different from the original ViT, our hierarchical architecture retains spatial relationships in learned representations. The top layers output low-resolution features maps of input images, enabling the model to easily perform attention-based interpretation by applying Class Attention Map (CAM) on the learned representations at the top hierarchical level. This enables high-quality weakly-supervised object localization with just image-level labels. See the following figure for examples.

Visualization of CAM-based attention results. Warmer colors indicate higher attention. Images are from the ImageNet validation dataset.

Convergence Advantages
With this design, feature learning only happens at local regions independently, and feature abstraction happens inside the aggregation function. This design and simple implementation is general enough for other types of visual understanding tasks beyond classification. It also improves the model convergence speed greatly, significantly reducing the training time to reach the desired maximum accuracy.

We validate this advantage in two ways. First, we compare the ViT structure on the ImageNet accuracy with a different number of total training epochs. The results are shown on the left side of the figure below, demonstrating much faster convergence than the original ViT, e.g., around 20% improvement in accuracy over ViT with 30 total training epochs.

Second, we modify the architecture to conduct unconditional image generation tasks, since training ViT-based models for image generation tasks is challenging due to convergence and speed issues. Creating such a generator is straightforward by transposing the proposed architecture: the input is an embedding vector, the output is a full image in RGB channels, and the block aggregation is replaced by a block de-aggregation component supported by Pixel Shuffling. Surprisingly, we find our generator is easy to train and demonstrates faster convergence speed, as well as better FID score (which measures how similar generated images are to real ones), than the capacity-comparable SAGAN.

Left: ImageNet accuracy given different number of total training epochs compared with standard ViT architecture. Right: ImageNet 64×64 image generation FID scores (lower is better) with single 1000-epoch training. On both tasks, our method shows better convergence speed.

Conclusion
In this work we demonstrate the simple idea that decoupled feature learning and feature information extraction in this nested hierarchy design leads to better feature interpretability through a new gradient-based class-aware tree traversal method. Moreover, the architecture improves convergence on not only classification tasks but also image generation tasks. The proposed idea is focusing on aggregation function and thereby is orthogonal to advanced architecture design for self-attention. We hope this new research encourages future architecture designers to explore more interpretable and data-efficient ViT-based models for visual understanding, like the adoption of this work for high-resolution image generation. We have also released the source code for the image classification portion of this work.

Acknowledgements
We gratefully acknowledge the contributions of other co-authors, including Han Zhang, Long Zhao, Ting Chen, Sercan Arik, Tomas Pfister. We also thank Xiaohua Zhai, Jeremy Kubica, Kihyuk Sohn, and Madeleine Udell for the valuable feedback of the work.

Read More

Robot See, Robot Do

People learn to do things by watching others — from mimicking new dance moves, to watching YouTube cooking videos. We’d like robots to do the same, i.e., to learn new skills by watching people do things during training. Today, however, the predominant paradigm for teaching robots is to remote control them using specialized hardware for teleoperation and then train them to imitate pre-recorded demonstrations. This limits both who can provide the demonstrations (programmers & roboticists) and where they can be provided (lab settings). If robots could instead self-learn new tasks by watching humans, this capability could allow them to be deployed in more unstructured settings like the home, and make it dramatically easier for anyone to teach or communicate with them, expert or otherwise. Perhaps one day, they might even be able to use Youtube videos to grow their collection of skills over time.

Our motivation is to have robots watch people do tasks, naturally with their hands, and then use that data as demonstrations for learning. Video by Teh Aik Hui and Nathaniel Lim. License: CC-BY

However, an obvious but often overlooked problem is that a robot is physically different from a human, which means it often completes tasks differently than we do. For example, in the pen manipulation task below, the hand can grab all the pens together and quickly transfer them between containers, whereas the two-fingered gripper must transport one at a time. Prior research assumes that humans and robots can do the same task similarly, which makes manually specifying one-to-one correspondences between human and robot actions easy. But with stark differences in physique, defining such correspondences for seemingly easy tasks can be surprisingly difficult and sometimes impossible.

Physically different end-effectors (i.e., “grippers”) (i.e., the part that interacts with the environment) induce different control strategies when solving the same task. Left: The hand grabs all pens and quickly transfers them between containers. Right: The two-fingered gripper transports one pen at a time.

In “XIRL: Cross-Embodiment Inverse RL”, presented as an oral paper at CoRL 2021, we explore these challenges further and introduce a self-supervised method for Cross-embodiment Inverse Reinforcement Learning (XIRL). Rather than focusing on how individual human actions should correspond to robot actions, XIRL learns the high-level task objective from videos, and summarizes that knowledge in the form of a reward function that is invariant to embodiment differences, such as shape, actions and end-effector dynamics. The learned rewards can then be used together with reinforcement learning to teach the task to agents with new physical embodiments through trial and error. Our approach is general and scales autonomously with data — the more embodiment diversity presented in the videos, the more invariant and robust the reward functions become. Experiments show that our learned reward functions lead to significantly more sample efficient (roughly 2 to 4 times) reinforcement learning on new embodiments compared to alternative methods. To extend and build on our work, we are releasing an accompanying open-source implementation of our method along with X-MAGICAL, our new simulated benchmark for cross-embodiment imitation.

Cross-Embodiment Inverse Reinforcement Learning (XIRL)
The underlying observation in this work is that in spite of the many differences induced by different embodiments, there still exist visual cues that reflect progression towards a common task objective. For example, in the pen manipulation task above, the presence of pens in the cup but not the mug, or the absence of pens on the table, are key frames that are common to different embodiments and indirectly provide cues for how close to being complete a task is. The key idea behind XIRL is to automatically discover these key moments in videos of different length and cluster them meaningfully to encode task progression. This motivation shares many similarities with unsupervised video alignment research, from which we can leverage a method called Temporal Cycle Consistency (TCC), which aligns videos accurately while learning useful visual representations for fine-grained video understanding without requiring any ground-truth correspondences.

We leverage TCC to train an encoder to temporally align video demonstrations of different experts performing the same task. The TCC loss tries to maximize the number of cycle-consistent frames (or mutual nearest-neighbors) between pairs of sequences using a differentiable formulation of soft nearest-neighbors. Once the encoder is trained, we define our reward function as simply the negative Euclidean distance between the current observation and the goal observation in the learned embedding space. We can subsequently insert the reward into a standard MDP and use an RL algorithm to learn the demonstrated behavior. Surprisingly, we find that this simple reward formulation is effective for cross-embodiment imitation.

XIRL self-supervises reward functions from expert demonstrations using temporal cycle consistency (TCC), then uses them for downstream reinforcement learning to learn new skills from third-person demonstrations.

X-MAGICAL Benchmark
To evaluate the performance of XIRL and baseline alternatives (e.g., TCN, LIFS, Goal Classifier) in a consistent environment, we created X-MAGICAL, which is a simulated benchmark for cross-embodiment imitation. X-MAGICAL features a diverse set of agent embodiments, with differences in their shapes and end-effectors, designed to solve tasks in different ways. This leads to differences in execution speeds and state-action trajectories, which poses challenges for current imitation learning techniques, e.g., ones that use time as a heuristic for weak correspondences between two trajectories. The ability to generalize across embodiments is precisely what X-MAGICAL evaluates.

The SweepToTop task we considered for our experiments is a simplified 2D equivalent of a common household robotic sweeping task, where an agent has to push three objects into a goal zone in the environment. We chose this task specifically because its long-horizon nature highlights how different agent embodiments can generate entirely different trajectories (shown below). X-MAGICAL features a Gym API and is designed to be easily extendable to new tasks and embodiments. You can try it out today with pip install x-magical.

Different agent shapes in the SweepToTop task in the X-MAGICAL benchmark need to use different strategies to reposition objects into the target area (pink), i.e., to “clear the debris”. For example, the long-stick can clear them all in one fell swoop, whereas the short-stick needs to do multiple consecutive back-and-forths.
Left: Heatmap of state visitation for each embodiment across all expert demonstrations. Right: Examples of expert trajectories for each embodiment.

Highlights
In our first set of experiments, we checked whether our learned embodiment-invariant reward function can enable successful reinforcement learning, when the expert demonstrations are provided through the agent itself. We find that XIRL significantly outperforms alternative methods especially on the tougher agents (e.g., short-stick and gripper).

Same-embodiment setting: Comparison of XIRL with baseline reward functions, using SAC for RL policy learning. XIRL is roughly 2 to 4 times more sample efficient than some of the baselines on the harder agents (short-stick and gripper).

We also find that our approach shows great potential for learning reward functions that generalize to novel embodiments. For instance, when reward learning is performed on embodiments that are different from the ones on which the policy is trained, we find that it results in significantly more sample efficient agents compared to the same baselines. Below, in the gripper subplot (bottom right) for example, the reward is first learned on demonstration videos from long-stick, medium-stick and short-stick, after which the reward function is used to train the gripper agent.

Cross-embodiment setting: XIRL performs favorably when compared with other baseline reward functions, trained on observation-only demonstrations from different embodiments. Each agent (long-stick, medium-stick, short-stick, gripper) had its reward trained using demonstrations from the other three embodiments.

We also find that we can train on real-world human demonstrations, and use the learned reward to train a Sawyer arm in simulation to push a puck to a designated target zone. In these experiments as well, our method outperforms baseline alternatives. For example, our XIRL variant trained only on the real-world demonstrations (purple in the plots below) reaches 80% of the total performance roughly 85% faster than the RLV baseline (orange).

What Do The Learned Reward Functions Look Like?
To further explore the qualitative nature of our learned rewards in more challenging real-world scenarios, we collect a dataset of the pen transfer task using various household tools.

Below, we show rewards extracted from a successful (top) and unsuccessful (bottom) demonstration. Both demonstrations follow a similar trajectory at the start of the task execution. The successful one nets a high reward for placing the pens consecutively into the mug then into the glass cup, while the unsuccessful one obtains a low reward because it drops the pens outside the glass cup towards the end of the execution (orange circle). These results are promising because they show that our learned encoder can represent fine-grained visual differences relevant to a task.

Conclusion
We highlighted XIRL, our approach to tackling the cross-embodiment imitation problem. XIRL learns an embodiment-invariant reward function that encodes task progress using a temporal cycle-consistency objective. Policies learned using our reward functions are significantly more sample-efficient than baseline alternatives. Furthermore, the reward functions do not require manually paired video frames between the demonstrator and the learner, giving them the ability to scale to an arbitrary number of embodiments or experts with varying skill levels. Overall, we are excited about this direction of work, and hope that our benchmark promotes further research in this area. For more details, please check out our paper and download the code from our GitHub repository.

Acknowledgments
Kevin and Andy summarized research performed together with Pete Florence, Jonathan Tompson, Jeannette Bohg (faculty at Stanford University) and Debidatta Dwibedi. All authors would additionally like to thank Alex Nichol, Nick Hynes, Sean Kirmani, Brent Yi, Jimmy Wu, Karl Schmeckpeper and Minttu Alakuijala for fruitful technical discussions, and Sam Toyer for invaluable help with setting up the simulated benchmark.

Read More

Unlocking the Full Potential of Datacenter ML Accelerators with Platform-Aware Neural Architecture Search

Continuing advances in the design and implementation of datacenter (DC) accelerators for machine learning (ML), such as TPUs and GPUs, have been critical for powering modern ML models and applications at scale. These improved accelerators exhibit peak performance (e.g., FLOPs) that is orders of magnitude better than traditional computing systems. However, there is a fast-widening gap between the available peak performance offered by state-of-the-art hardware and the actual achieved performance when ML models run on that hardware.

One approach to address this gap is to design hardware-specific ML models that optimize both performance (e.g., throughput and latency) and model quality. Recent applications of neural architecture search (NAS), an emerging paradigm to automate the design of ML model architectures, have employed a platform-aware multi-objective approach that includes a hardware performance objective. While this approach has yielded improved model performance in practice, the details of the underlying hardware architecture are opaque to the model. As a result, there is untapped potential to build full capability hardware-friendly ML model architectures, with hardware-specific optimizations, for powerful DC ML accelerators.

In “Searching for Fast Model Families on Datacenter Accelerators”, published at CVPR 2021, we advanced the state of the art of hardware-aware NAS by automatically adapting model architectures to the hardware on which they will be executed. The approach we propose finds optimized families of models for which additional hardware performance gains cannot be achieved without loss in model quality (called Pareto optimization). To accomplish this, we infuse a deep understanding of hardware architecture into the design of the NAS search space for discovery of both single models and model families. We provide quantitative analysis of the performance gap between hardware and traditional model architectures and demonstrate the advantages of using true hardware performance (i.e., throughput and latency), instead of the performance proxy (FLOPs), as the performance optimization objective. Leveraging this advanced hardware-aware NAS and building upon the EfficientNet architecture, we developed a family of models, called EfficientNetX, that demonstrate the effectiveness of this approach for Pareto-optimized ML models on TPUs and GPUs.

Platform-Aware NAS for DC ML Accelerators
To achieve high performance, ML models need to adapt to modern ML accelerators. Platform-aware NAS integrates knowledge of the hardware accelerator properties into all three pillars of NAS: (i) the search objectives; (ii) the search space; and (iii) the search algorithm (shown below). We focus on the new search space because it contains the building blocks needed to compose the models and is the key link between the ML model architectures and accelerator hardware architectures.

We construct TPU/GPU specialized search spaces with TPU/GPU-friendly operations to infuse hardware awareness into NAS. For example, a key adaptation is maximizing parallelism to ensure different hardware components inside the accelerators work together as efficiently as possible. This includes the matrix multiplication units (MXUs) in TPUs and the TensorCore in GPUs for matrix/tensor computation, as well as the vector processing units (VPUs) in TPUs and CUDA cores in GPUs for vector processing. Maximizing model arithmetic intensity (i.e., optimizing the parallelism between computation and operations on the high bandwidth memory) is also critical to achieve top performance. To tap into the full potential of the hardware, it is crucial for ML models to achieve high parallelism inside and across these hardware components.

Overview of platform-aware NAS on TPUs/GPUs, highlighting the search space and search objectives.

Advanced platform-aware NAS has an optimized search space containing a set of complementary techniques to holistically improve parallelism for ML model execution on TPUs and GPUs:

  1. It uses specialized tensor reshaping techniques to maximize the parallelism in the MXUs / TensorCores.
  2. It dynamically selects different activation functions depending on matrix operation types to ensure overlapping of vector and matrix/tensor processing.
  3. It employs hybrid convolutions and a novel fusion strategy to strike a balance between total compute and arithmetic intensity to ensure that computation and memory access happens in parallel and to reduce the contention on VPUs / CUDA cores.
  4. With latency-aware compound scaling (LACS), which uses hardware performance instead of FLOPs as the performance objective to search for model depth, width and resolutions, we ensure parallelism at all levels for the entire model family on the Pareto-front.

EfficientNet-X: Platform-Aware NAS-Optimized Computer Vision Models for TPUs and GPUs
Using this approach to platform-aware NAS, we have designed EfficientNet-X, an optimized computer vision model family for TPUs and GPUs. This family builds upon the EfficientNet architecture, which itself was originally designed by traditional multi-objective NAS without true hardware-awareness as the baseline. The resulting EfficientNet-X model family achieves an average speedup of ~1.5x–2x over EfficientNet on TPUv3 and GPUv100, respectively, with comparable accuracy.

In addition to the improved speeds, EfficientNet-X has shed light on the non-proportionality between FLOPs and true performance. Many think FLOPs are a good ML performance proxy (i.e., FLOPs and performance are proportional), but they are not. While FLOPs are a good performance proxy for simple hardware such as scalar machines, they can exhibit a margin of error of up to 400% on advanced matrix/tensor machines. For example, because of its hardware-friendly model architecture, EfficientNet-X requires ~2x more FLOPs than EfficientNet, but is ~2x faster on TPUs and GPUs.

EfficientNet-X family achieves 1.5x–2x speedup on average over the state-of-the-art EfficientNet family, with comparable accuracy on TPUv3 and GPUv100.

Self-Driving ML Model Performance on New Accelerator Hardware Platforms
Platform-aware NAS exposes the inner workings of the hardware and leverages these properties when designing hardware-optimized ML models. In a sense, the “platform-awareness” of the model is a “gene” that preserves knowledge of how to optimize performance for a hardware family, even on new generations, without the need to redesign the models. For example, TPUv4i delivers up to 3x higher peak performance (FLOPS) than its predecessor TPUv2, but EfficientNet performance only improves by 30% when migrating from TPUv2 to TPUv4i. In comparison, EfficientNet-X retains its platform-aware properties even on new hardware and achieves a 2.6x speedup when migrating from TPUv2 to TPUv4i, utilizing almost all of the 3x peak performance gain expected when upgrading between the two generations.

Hardware peak performance ratio of TPUv2 to TPUv4i and the geometric mean speedup of EfficientNet-X and EfficientNet families, respectively, when migrating from TPUv2 to TPUv4i.

Conclusion and Future Work
We demonstrate how to improve the capabilities of platform-aware NAS for datacenter ML accelerators, especially TPUs and GPUs. Both platform-aware NAS and the EfficientNet-X model family have been deployed in production and materialize up to ~40% efficiency gains and significant quality improvements for various internal computer vision projects across Google. Additionally, because of its deep understanding of accelerator hardware architecture, platform-aware NAS was able to identify critical performance bottlenecks on TPUv2-v4i architectures and has enabled design enhancements to future TPUs with significant potential performance uplift. As next steps, we are working on expanding platform-aware NAS’s capabilities to the ML hardware and model design beyond computer vision.

Acknowledgements
Special thanks to our co-authors: Mingxing Tan, Ruoming Pang, Andrew Li, Liqun Cheng, Quoc Le. We also thank many collaborators including Jeff Dean, David Patterson, Shengqi Zhu, Yun Ni, Gang Wu, Tao Chen, Xin Li, Yuan Qi, Amit Sabne, Shahab Kamali, and many others from the broad Google research and engineering teams who helped on the research and the subsequent broad production deployment of platform-aware NAS.

Read More

Ask a Techspert: What does AI do when it doesn’t know?

As humans, we constantly learn from the world around us. We experience inputs that shape our knowledge — including the boundaries of both what we know and what we don’t know.

Many of today’s machines also learn by example. However, these machines are typically trained on datasets and information that doesn’t always include rare or out-of-the-ordinary examples that inevitably come up in real-life scenarios. What is an algorithm to do when faced with the unknown?

I recently spoke with Abhijit Guha Roy, an engineer on the Health AI team, and Ian Kivlichan, an engineer on the Jigsaw team, to hear more about using AI in real-world scenarios and better understand the importance of training it to know when it doesn’t know.

Abhijit, tell me about your recent research in the dermatology space.

We’re applying deep learning to a number of areas in health, including in medical imaging where it can be used to aid in the identification of health conditions and diseases that might require treatment. In the dermatological field, we have shown that AI can be used to help identify possible skin issues and are in the process of advancing research and products, including DermAssist, that can support both clinicians and people like you and me.

In these real-world settings, the algorithm might come up against something it’s never seen before. Rare conditions, while individually infrequent, might not be so rare in aggregate. These so-called “out-of-distribution” examples are a common problem for AI systems which can perform less well when it’s exposed to things they haven’t seen before in its training.

Can you explain what “out-distribution” means for AI?

Most traditional machine learning examples that are used to train AI deal with fairly unsubtle — or obvious — changes. For example, if an algorithm that is trained to identify cats and dogs comes across a car, then it can typically detect that the car — which is an “out-of-distribution” example — is an outlier. Building an AI system that can recognize the presence of something it hasn’t seen before in training is called “out-of-distribution detection,” and is an active and promising field of AI research.

Okay, let’s go back to how this applies to AI in medical settings.

Going back to our research in the dermatology space, the differences between skin conditions can be much more subtle than recognizing a car from a dog or a cat, even more subtle than recognizing a previously unseen “pick-up truck” from a “truck”. As such, the out-of-distribution detection task in medical AI demands even more of our focused attention.

This is where our latest research comes in. We trained our algorithm to recognize even the most subtle of outliers (a so-called “near-out of distribution” detection task). Then, instead of the model inaccurately guessing, it can take a safer course of action — like deferring to human experts.

Ian, you’re working on another area where AI needs to know when it doesn’t know something. What’s that?

The field of content moderation. Our team at Jigsaw used AI to build a free tool called Perspective that scores comments according to how likely they are to be considered toxic by readers. Our AI algorithms help identify toxic language and online harassment at scale so that human content moderators can make better decisions for their online communities. A range of online platforms use Perspective more than 600 million times a day to reduce toxicity and the human time required to moderate content.

In the real world, online conversations — both the things people say and even the ways they say them — are continually changing. For example, two years ago, nobody would have understood the phrase “non-fungible token (NFT).” Our language is always evolving, which means a tool like Perspective doesn’t just need to identify potentially toxic or harassing comments, it also needs to “know when it doesn’t know,” and then defer to human moderators when it encounters comments very different from anything it has encountered before.

In our recent research, we trained Perspective to identify comments it was uncertain about and flag them for separate human review. By prioritizing these comments, human moderators can correct more than 80% of the mistakes the AI might otherwise have made.

What connects these two examples?

We have more in common with the dermatology problem than you’d expect at first glance — even though the problems we try to solve are so different.

Building AI that knows when it doesn’t know something means you can prevent certain errors that might have unintended consequences. In both cases, the safest course of action for the algorithm entails deferring to human experts rather than trying to make a decision that could lead to potentially negative effects downstream.

There are some fields where this isn’t as important and others where it’s critical. You might not care if an automated vegetable sorter incorrectly sorts a purple carrot after being trained on orange carrots, but you would definitely care if an algorithm didn’t know what to do about an abnormal shadow on an X-ray that a doctor might recognize as an unexpected cancer.

How is AI uncertainty related to AI safety?

Most of us are familiar with safety protocols in the workplace. In safety-critical industries like aviation or medicine, protocols like “safety checklists” are routine and very important in order to prevent harm to both the workers and the people they serve.

It’s important that we also think about safety protocols when it comes to machines and algorithms, especially when they are integrated into our daily workflow and aid in decision-making or triaging that can have a downstream impact.

Teaching algorithms to refrain from guessing in unfamiliar scenarios and to ask for help from human experts falls within these protocols, and is one of the ways we can reduce harm and build trust in our systems. This is something Google is committed to, as outlined in its AI Principles.

Read More

Can Robots Follow Instructions for New Tasks?

People can flexibly maneuver objects in their physical surroundings to accomplish various goals. One of the grand challenges in robotics is to successfully train robots to do the same, i.e., to develop a general-purpose robot capable of performing a multitude of tasks based on arbitrary user commands. Robots that are faced with the real world will also inevitably encounter new user instructions and situations that were not seen during training. Therefore, it is imperative for robots to be trained to perform multiple tasks in a variety of situations and, more importantly, to be capable of solving new tasks as requested by human users, even if the robot was not explicitly trained on those tasks.

Existing robotics research has made strides towards allowing robots to generalize to new objects, task descriptions, and goals. However, enabling robots to complete instructions that describe entirely new tasks has largely remained out-of-reach. This problem is remarkably difficult since it requires robots to both decipher the novel instructions and identify how to complete the task without any training data for that task. This goal becomes even more difficult when a robot needs to simultaneously handle other axes of generalization, such as variability in the scene and positions of objects. So, we ask the question: How can we confer noteworthy generalization capabilities onto real robots capable of performing complex manipulation tasks from raw pixels? Furthermore, can the generalization capabilities of language models help support better generalization in other domains, such as visuomotor control of a real robot?

In “BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning”, published at CoRL 2021, we present new research that studies how robots can generalize to new tasks that they were not trained to do. The system, called BC-Z, comprises two key components: (i) the collection of a large-scale demonstration dataset covering 100 different tasks and (ii) a neural network policy conditioned on a language or video instruction of the task. The resulting system can perform at least 24 novel tasks, including ones that require interaction with pairs of objects that were not previously seen together. We are also excited to release the robot demonstration dataset used to train our policies, along with pre-computed task embeddings.

The BC-Z system allows a robot to complete instructions for new tasks that the robot was not explicitly trained to do. It does so by training the policy to take as input a description of the task along with the robot’s camera image and to predict the correct action.

Collecting Data for 100 Tasks
Generalizing to a new task altogether is substantially harder than generalizing to held-out variations in training tasks. Simply put, we want robots to have more generalization all around, which requires that we train them on large amounts of diverse data.

We collect data by teleoperating the robot with a virtual reality headset. This data collection follows a scheme similar to how one might teach an autonomous car to drive. First, the human operator records complete demonstrations of each task. Then, once the robot has learned an initial policy, this policy is deployed under close supervision where, if the robot starts to make a mistake or gets stuck, the operator intervenes and demonstrates a correction before allowing the robot to resume.

This mixture of demonstrations and interventions has been shown to significantly improve performance by mitigating compounding errors. In our experiments, we see a 2x improvement in performance when using this data collection strategy compared to only using human demonstrations.

Example demonstrations collected for 12 out of the 100 training tasks, visualized from the perspective of the robot and shown at 2x speed.

Training a General-Purpose Policy
For all 100 tasks, we use this data to train a neural network policy to map from camera images to the position and orientation of the robot’s gripper and arm. Crucially, to allow this policy the potential to solve new tasks beyond the 100 training tasks, we also input a description of the task, either in the form of a language command (e.g., “place grapes in red bowl”) or a video of a person doing the task.

To accomplish a variety of tasks, the BC-Z system takes as input either a language command describing the task or a video of a person doing the task, as shown here.

By training the policy on 100 tasks and conditioning the policy on such a description, we unlock the possibility that the neural network will be able to interpret and complete instructions for new tasks. This is a challenge, however, because the neural network needs to correctly interpret the instruction, visually identify relevant objects for that instruction while ignoring other clutter in the scene, and translate the interpreted instruction and perception into the robot’s action space.

Experimental Results
In language models, it is well known that sentence embeddings generalize on compositions of concepts encountered in training data. For instance, if you train a translation model on sentences like “pick up a cup” and “push a bowl”, the model should also translate “push a cup” correctly.

We study the question of whether the compositional generalization capabilities found in language encoders can be transferred to real robots, i.e., being able to compose unseen object-object and task-object pairs.

We test this method by pre-selecting a set of 28 tasks, none of which were among the 100 training tasks. For example, one of these new test tasks is to pick up the grapes and place them into a ceramic bowl, but the training tasks involve doing other things with the grapes and placing other items into the ceramic bowl. The grapes and the ceramic bowl never appeared in the same scene during training.

In our experiments, we see that the robot can complete many tasks that were not included in the training set. Below are a few examples of the robot’s learned policy.

The robot completes three instructions of tasks that were not in its training data, shown at 2x speed.

Quantitatively, we see that the robot can succeed to some degree on a total of 24 out of the 28 held-out tasks, indicating a promising capacity for generalization. Further, we see a notably small gap between the performance on the training tasks and performance on the test tasks. These results indicate that simply improving multi-task visuomotor control could considerably improve performance.

The BC-Z performance on held-out tasks, i.e., tasks that the robot was not trained to perform. The system correctly interprets the language command and translates that into action to complete many of the tasks in our evaluation.

Takeaways
The results of this research show that simple imitation learning approaches can be scaled in a way that enables zero-shot generalization to new tasks. That is, it shows one of the first indications of robots being able to successfully carry out behaviors that were not in the training data. Interestingly, language embeddings pre-trained on ungrounded language corpora make for excellent task conditioners. We demonstrated that natural language models can not only provide a flexible input interface to robots, but that pretrained language representations actually confer new generalization capabilities to the downstream policy, such as composing unseen object pairs together.

In the course of building this system, we confirmed that periodic human interventions are a simple but important technique for achieving good performance. While there is a substantial amount of work to be done in the future, we believe that the zero-shot generalization capabilities of BC-Z are an important advancement towards increasing the generality of robotic learning systems and allowing people to command robots. We have released the teleoperated demonstrations used to train the policy in this paper, which we hope will provide researchers with a valuable resource for future multi-task robotic learning research.

Acknowledgements
We would like to thank the co-authors of this research: Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, and Sergey Levine. This project was a collaboration between Google Research and the Everyday Robot Project. We would like to give special thanks to Noah Brown, Omar Cortes, Armando Fuentes, Kyle Jeffrey, Linda Luu, Sphurti Kirit More, Jornell Quiambao, Jarek Rettinghouse, Diego Reyes, Rosario Jau-regui Ruano, and Clayton Tan for overseeing robot operations and collecting human videos of the tasks, as well as Jeffrey Bingham, Jonathan Weisz, and Kanishka Rao for valuable discussions. We would also like to thank Tom Small for creating animations in this post and Paul Mooney for helping with dataset open-sourcing.

Read More