Real-time tracking of wildfire boundaries using satellite imagery

Real-time tracking of wildfire boundaries using satellite imagery

As global temperatures rise, wildfires around the world are becoming more frequent and more dangerous. Their effects are felt by many communities as people evacuate their homes or suffer harm even from proximity to the fire and smoke.

As part of Google’s mission to help people access trusted information in critical moments, we use satellite imagery and machine learning (ML) to track wildfires and inform affected communities. Our wildfire tracker was recently expanded. It provides updated fire boundary information every 10–15 minutes, is more accurate than similar satellite products, and improves on our previous work. These boundaries are shown for large fires in the continental US, Mexico, and most of Canada and Australia. They are displayed, with additional information from local authorities, on Google Search and Google Maps, allowing people to keep safe and stay informed about potential dangers near them, their homes or loved ones.

Real-time boundary tracking of the 2021-2022 Wrattonbully bushfire, shown as a red polygon in Google Maps.

Inputs

Wildfire boundary tracking requires balancing spatial resolution and update frequency. The most scalable method to obtain frequent boundary updates is to use geostationary satellites, i.e., satellites that orbit the earth once every 24 hours. These satellites remain at a fixed point above Earth, providing continual coverage of the area surrounding that point. Specifically, our wildfire tracker models use the GOES-16 and GOES-18 satellites to cover North America, and the Himawari-9 and GK2A satellites to cover Australia. These provide continent-scale images every 10 minutes. The spatial resolution is 2km at nadir (the point directly below the satellite), and lower as one moves away from nadir. The goal here is to provide people with warnings as soon as possible, and refer them to authoritative sources for spatially precise, on-the-ground data, as necessary.

Smoke plumes obscuring the 2018 Camp Fire in California. [Image from NASA Worldview]

Determining the precise extent of a wildfire is nontrivial, since fires emit massive smoke plumes, which can spread far from the burn area and obscure the flames. Clouds and other meteorological phenomena further obscure the underlying fire. To overcome these challenges, it is common to rely on infrared (IR) frequencies, particularly in the 3–4 μm wavelength range. This is because wildfires (and similar hot surfaces) radiate considerably at this frequency band, and these emissions diffract with relatively minor distortions through smoke and other particulates in the atmosphere. This is illustrated in the figure below, which shows a multispectral image of a wildfire in Australia. The visible channels (blue, green, and red) mostly show the triangular smoke plume, while the 3.85 μm IR channel shows the ring-shaped burn pattern of the fire itself. Even with the added information from the IR bands, however, determining the exact extent of the fire remains challenging, as the fire has variable emission strength, and multiple other phenomena emit or reflect IR radiation.

Himawari-8 hyperspectral image of a wildfire. Note the smoke plume in the visible channels (blue, green, and red), and the ring indicating the current burn area in the 3.85μm band.

Model

Prior work on fire detection from satellite imagery is typically based on physics-based algorithms for identifying hotspots from multispectral imagery. For example, the National Oceanic and Atmospheric Administration (NOAA) fire product identifies potential wildfire pixels in each of the GOES satellites, primarily by relying on the 3.9 μm and 11.2 μm frequencies (with auxiliary information from two other frequency bands).

In our wildfire tracker, the model is trained on all satellite inputs, allowing it to learn the relative importance of different frequency bands. The model receives a sequence of the three most recent images from each band so as to compensate for temporary obstructions such as cloud cover. Additionally, the model receives inputs from two geostationary satellites, achieving a super-resolution effect whereby the detection accuracy improves upon the pixel size of either satellite. In North America, we also supply the aforementioned NOAA fire product as input. Finally, we compute the relative angles of the sun and the satellites, and provide these as additional input to the model.

All inputs are resampled to a uniform 1 km–square grid and fed into a convolutional neural network (CNN). We experimented with several architectures and settled on a CNN followed by a 1×1 convolutional layer to yield separate classification heads for fire and cloud pixels (shown below). The number of layers and their sizes are hyperparameters, which are optimized separately for Australia and North America. When a pixel is identified as a cloud, we override any fire detection since heavy clouds obscure underlying fires. Even so, separating the cloud classification task improves the performance of fire detection as we incentivize the system to better identify these edge cases.

CNN architecture for the Australia model; a similar architecture was used for North America. Adding a cloud classification head improves fire classification performance.

To train the network, we used thermal anomalies data from the MODIS and VIIRS polar-orbiting satellites as labels. MODIS and VIIRS have higher spatial accuracy (750–1000 meters) than the geostationary satellites we use as inputs. However, they cover a given location only once every few hours, which occasionally causes them to miss rapidly-advancing fires. Therefore, we use MODIS and VIIRS to construct a training set, but at inference time we rely on the high-frequency imagery from geostationary satellites.

Even when limiting attention to active fires, most pixels in an image are not currently burning. To reduce the model’s bias towards non-burning pixels, we upsampled fire pixels in the training set and applied focal loss to encourage improvements in the rare misclassified fire pixels.

The progressing boundary of the 2022 McKinney fire, and a smaller nearby fire.

Evaluation

High-resolution fire signals from polar-orbiting satellites are a plentiful source for training data. However, such satellites use sensors that are similar to geostationary satellites, which increases the risk of systemic labeling errors (e.g., cloud-related misdetections) being incorporated into the model. To evaluate our wildfire tracker model without such bias, we compared it against fire scars (i.e., the shape of the total burnt area) measured by local authorities. Fire scars are obtained after a fire has been contained and are more reliable than real-time fire detection techniques. We compare each fire scar to the union of all fire pixels detected in real time during the wildfire to obtain an image such as the one shown below. In this image, green represents correctly identified burn areas (true positive), yellow represents unburned areas detected as burn areas (false positive), and red represents burn areas that were not detected (false negative).

Example evaluation for a single fire. Pixel size is 1km x 1km.

We compare our models to official fire scars using the precision and recall metrics. To quantify the spatial severity of classification errors, we take the maximum distance between a false positive or false negative pixel and the nearest true positive fire pixel. We then average each metric across all fires. The results of the evaluation are summarized below. Most severe misdetections were found to be a result of errors in the official data, such as a missing scar for a nearby fire.

Test set metrics comparing our models to official fire scars.

We performed two additional experiments on wildfires in the United States (see table below). First, we evaluated an earlier model that relies only on NOAA’s GOES-16 and GOES-17 fire products. Our model outperforms this approach in all metrics considered, demonstrating that the raw satellite measurements can be used to enhance the existing NOAA fire product.

Next, we collected a new test set consisting of all large fires in the United States in 2022. This test set was not available during training because the model launched before the fire season began. Evaluating the performance on this test set shows performance in line with expectations from the original test set.

Comparison between models on fires in the United States.

Conclusion

Boundary tracking is part of Google’s wider commitment to bring accurate and up-to-date information to people in critical moments. This demonstrates how we use satellite imagery and ML to track wildfires, and provide real time support to affected people in times of crisis. In the future, we plan to keep improving the quality of our wildfire boundary tracking, to expand this service to more countries and continue our work helping fire authorities access critical information in real time.

Acknowledgements

This work is a collaboration between teams from Google Research, Google Maps and Crisis Response, with support from our partnerships and policy teams. We would also like to thank the fire authorities whom we partner with around the world.

Read More

Google Research, 2022 & beyond: ML & computer systems

Google Research, 2022 & beyond: ML & computer systems

(This is Part 3 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

Great machine learning (ML) research requires great systems. With the increasing sophistication of the algorithms and hardware in use today and with the scale at which they run, the complexity of the software necessary to carry out day-to-day tasks only increases. In this post, we provide an overview of the numerous advances made across Google this past year in systems for ML that enable us to support the serving and training of complex models while easing the complexity of implementation for end users. This blog post also highlights our research on leveraging ML itself to help improve and design the next generations of system stacks.

Distributed systems for ML

This year, we’ve made significant strides in improving our systems to better support large-scale computation in ML and scientific computing in general. The Google TPU hardware has been designed with scaling in mind since its inception, and each year we strive to push the boundaries even further. This year, we designed state-of-the-art serving techniques for large models, improved automatic partitioning of tensor programs and reworked the APIs of our libraries to make sure all of those developments are accessible to a wide audience of users.

One of our biggest efficiency improvements this year is the CollectiveEinsum strategy for evaluating the large scale matrix multiplication operations that are at the heart of neural networks. Unlike previously popular SPMD partitioning strategies that separate communication from device-local computation, this approach uses the fast TPU ICI links to overlap them, leading to up to 1.38x performance improvements. This algorithm was also a key component of our work on efficiently scaling Transformer inference, which presents a wide variety of strategies that trade off between latency and hardware utilization, reaching state-of-the-art model FLOPs utilization (MFU) of 76% in throughput-optimized configurations.

An illustration of AllGather-Einsum with 2-way intra-layer model parallelism, proposed in CollectiveEinsum strategy. Top: Illustration of non-overlapped execution. Bottom: Illustration of the CollectiveEinsum technique.

We have also integrated SPMD-style partitioning as a first class concept into both TensorFlow, with the DTensor extension, and JAX, with the redesigned array type. In both libraries, tensors that seem complete to the programmer can be transparently sharded over a number of devices just by attaching declarative layout annotations. In fact, both approaches are compatible with existing code written for single-device computations that can now scale into a multi-device program, usually without any code modifications!

Integrating SPMD partitioning into the core of our ML frameworks means that being able to infer and optimize the way array programs are mapped onto a larger set of devices is critical for performance. In the past, this motivated the development of GSPMD, an important milestone in this area. However, GSPMD relies heavily on heuristics, and it still sometimes requires non-trivial decisions to be made manually, which often results in suboptimal performance. To make partitioning inference fully automatic, we collaborated with external colleagues to develop Alpa, a fully automated system that explores strategies for both operator-level (model) parallelism and pipeline parallelism between larger sub-computations. It successfully matches hand-tuned performance on popular models such as Transformers, but is also capable of successfully scaling up other models, such as convolutional networks and mixture-of-experts models that often cause existing automated methods to struggle.

Alpa overview. The inter-operator identifies the best way to assign a subgraph to a submesh. The intra-operator pass finds the best intra-operator parallelism plan for each pipeline stage. Finally, the runtime orchestration generates a static plan that orders the computation and communication.

In a similar vein, the recently published Pathways system adds an additional layer of virtualization on top of the usual TPU runtime — accelerators are managed by long-lived processes instead of being allocated directly to users. A single end user can then connect to an arbitrary number of Pathways-controlled devices and write their program as if all the devices were attached directly to their process, even though in reality they may even span multiple data centers. Thanks to Pathways: (1) job startup time can be reduced, (2) it is easier to achieve fault tolerance, and (3) multitenancy becomes a viable option, enabling multiple jobs to be executed simultaneously for even more efficient hardware utilization. The ease with which Pathways enables computation spanning multiple TPU pods is crucial, as it lets us avoid future scaling bottlenecks.

Pathways overview. Top Left: Distributed computation expressed as a Directed Acyclic Graph. Top Right: The resource manager allocates virtual slices of accelerator meshes for each compiled function (e.g., A, B, and C). Bottom: Centralized schedulers for gang-schedule computations that are then dispatched by per-shard executors. (See paper for details.)

Another notable release is TensorStore, a new library for multi-dimensional array storage. TensorStore is particularly useful for training large language models (LLMs) with multi-controller runtimes, where every process only manages a subset of all parameters, all of which must be collated into a consistent checkpoint. TensorStore provides database-grade guarantees (ACID) for efficient and concurrent multi-dimensional array serialization into many storage backends (e.g., Google Cloud Storage, various filesystems, HTTP servers) and has been successfully used for compute-intensive workloads such as PaLM and reconstructions of the human cortex and fruit fly brain.

A fly brain reconstruction for which the underlying data can be easily accessed and manipulated using TensorStore.

Top

Programming languages for ML

The robustness and correctness of our technical infrastructure are vital for ML efforts, which is why we remain committed to ensuring that it is built on a sound technical and theoretical basis, backed by cutting-edge research in programming languages and compiler construction.

We continued investing in the open-source MLIR compiler infrastructure, building a more controllable, composable and modular compiler stack. In addition, much progress has been made in code generation for sparse linear algebra and it is now possible to generate both dense and sparse code from almost identical MLIR programs. Finally, we also continued the development of the IREE compiler, preparing it for use on both powerful computers located in data centers and mobile devices such as smartphones.

On the more theoretical side we explored ways to formalize and verify the code-generation techniques we use. We also published a novel approach used to implement and formalize automatic differentiation (AD) systems, which are central to ML libraries. We decomposed the reverse-mode AD algorithm into three independent program transformations, which are significantly simpler and easier to verify, highlighting the unique features of JAX’s implementation.

Leveraging programming language techniques, such as abstract interpretation and program synthesis, we successfully reduced the number of resources required to perform a neural architecture search (NAS). This effort, 𝛼NAS, led to the discovery of more efficient models without degradation in accuracy.

In the past year, we published a number of new open-source libraries in the JAX ecosystem, Rax and T5X being just two examples. With the continued effort around jax2tf, JAX models can now be deployed on mobile devices using TensorFlow Lite and on the web using TensorFlow.js.

Top

Hardware accelerators & ML

Hardware design for ML

The use of customized hardware, such as TPUs and GPUs, has shown tremendous benefits in terms of both performance gain and energy efficiency (hence reducing the carbon footprint). In a recent MLPerf competition, we set new performance records on five benchmarks on TPUs v4, achieving speedups that are on average 1.42x higher than the next fastest submission. However, in order to keep up with recent advances, we are also developing customized hardware architectures for specific popular models.

TPUs demonstrated significant speedup in all five published benchmarks (MLPerf 2.0) over the fastest non-Google submission (NVIDIA on-premises). Taller bars are better. The numbers inside the bars represent the quantity of chips / accelerators used for each of the submissions.

However, building a new hardware accelerator incurs high initial cost and requires significant development and deployment time. To make single-workload accelerators viable, the design cycle time has to be reduced. Full-stack Search Technique (FAST) addresses this problem by introducing a hardware accelerator search framework that simultaneously optimizes data path, scheduling, and important compiler decisions. FAST introduces an approximate template capable of describing diverse types of architectures and versatile memory hierarchy resulting in accelerators that improve single-workload performance per Thermal Design Power (known to highly correlate with performance per Total Cost of Ownership) by 3.7x compared to TPU v3. This shows that single-workload accelerators could be practical for moderate-sized datacenter deployments.

ML for hardware design

To automate the chip design process as much as possible, we continue to push the capabilities of ML at various stages of the hardware design, including high-level architectural exploration, verification, and placement and routing.

We recently open-sourced a distributed RL infrastructure called Circuit Training, along with a circuit environment described in our recent Nature paper. We used this infrastructure in production to produce macro placements for the latest generation of TPU chips. Tackling architectural exploration, PRIME introduces an ML-based approach for searching hardware design space that utilizes only existing data (e.g., from traditional accelerator design efforts) without any further hardware simulation. This approach alleviates the need to run time-consuming simulations, even when the set of target applications changes. PRIME improves performance over state-of-the-art simulation-driven methods by about 1.2x–1.5x while reducing the simulation time by 93%–99%. AutoApprox automatically generates approximate low-power deep learning accelerators without any accuracy loss by mapping each neural network layer to an appropriate approximation level.

PRIME uses logged accelerator data, consisting of both feasible and infeasible accelerators, to train a conservative model, which is used to design accelerators while meeting design constraints. PRIME designs accelerators with up to 1.5x smaller latency, while reducing the required hardware simulation time by up to 99%.

Hardware-dependent model design

While NAS has shown tremendous capability in discovering state-of-the-art models in terms of accuracy and efficiency, it is still limited by lack of hardware knowledge. Platform-aware NAS addresses this gap by incorporating knowledge of the hardware architecture into the design of the NAS search space. The resulting EfficientNet-X model is 1.5x–2x faster than EfficientNet on TPU v3 and GPU v100, respectively, with similar accuracy. Both platform-aware NAS and EfficientNet-X have been deployed in production, demonstrating significant accuracy gains and up to ~40% efficiency improvement for various production vision models. NaaS goes even further by searching for neural network architectures and hardware architectures together. Using this approach on Edge TPUs, NaaS discovers vision models that are 2x more energy efficient with the same accuracy.

Overview of platform-aware NAS on TPUs/GPUs, highlighting the search space and search objectives.

Top

ML for navigating constrained search spaces

Apart from changing the hardware and the workload for better efficiency, we can also optimize the middle layer, including the partitioner, which maps the workload onto multiple devices, and the compiler, which translates the workload into a low-level presentation understood by the hardware. In previous years, we demonstrated how we can apply ML to find better device placement and compiler decisions. In the past year, we further explored this direction and found that many optimization search spaces are heavily constrained, where valid solutions are quite sparse.

To address this challenge, we developed several techniques to enable a learned model to effectively navigate a constrained search space. Telamalloc employs a combination of ML model plus heuristics to make a decision when multiple options are available, and leverages a constraint solver to infer further dependent decisions. Telamalloc speeds up the memory allocation pass in the Edge TPU compiler compared to a production Integer Linear Programming approach and enables important real-world models that could not otherwise be supported.

A Transferable Approach for Partitioning Machine Learning Models on Multi-Chip-Modules” proposes a slightly different approach. It applies reinforcement learning (RL) to propose the decisions in a single step, and asks the constraint solver to adjust the proposed solution to be valid. For a BERT model on an Edge TPU-based multi-chip mesh, this approach discovers a better distribution of the model across devices using a much smaller time budget compared to non-learned search strategies.

Top

ML for large-scale production systems

We also deployed ML to improve efficiency of various large-scale systems running in production. We recently released MLGO, the first industrial-grade general framework for integrating ML techniques systematically in the LLVM infrastructure. MLGO can replace heuristics in LLVM with an RL policy to make optimization decisions. When testing on a set of internal large-scale applications, we found that the trained policy can reduce binary size by 3%–7% when optimizing inlining decisions and can improve throughput by 0.3% ~1.5% when optimizing register allocation decisions. Within our production ML compiler, XLA, a learned cost model published a few years back, was recently deployed to guide the selection of optimal tile sizes of TPU kernels for top ML workloads, saving ~2% of the total TPU compute time in our data centers overall.We also recently replaced an existing heuristic in YouTube cache replacement algorithm with a new hybrid algorithm that combines a simple heuristic with a learned model, improving byte miss ratio at the peak by ~9%.

Illustration of MLGO during inlining. “#bbs”, “#users”, and “callsite height” are example caller-callee pair features.

Top

AI & sustainability

Given the global climate change crisis, there has been understandable concern about the environmental impact of ML. In a recent paper, we showed that by following best practices, ML practitioners can reduce carbon dioxide equivalent emissions (CO2e) from training by orders of magnitude. We call the practices the “4Ms”

  1. Model. The first step is to select the most efficient ML model architecture. For example, Primer runs ~4x faster on the same hardware while achieving the same quality scores than the popular Transformer developed four years earlier.
  2. Machine. The second practice is to use the most energy efficient computer available. For example, when the Transformer model was first published in 2017, a popular GPU was the Nvidia P100. Using a recent processor optimized for ML training, such as TPU v4, improves performance per Watt by ~15x.
  3. Mechanization. Computers for training needed to be housed in a data center. Large cloud data centers are typically ~1.4x more energy-efficient than the typical smaller on-premise data center.
  4. Map. The biggest surprise in our investigation was the impact on the cleanliness of the energy supply by picking the best location. Moreover, in the cloud, location is the easiest of the four factors to change. The difference between a typical location and a well chosen location can be ~9x, even within the same country.

In this example, multiplying the 4Ms together yields a 4x × 15x × 1.4x × 9x or ~750x reduction in CO2e over four years by following the best practices over the training of the original Transformer model using GPUs of 2017.

We are continuing to explore this space and in 2023 we will be releasing a further study that demonstrates how to reduce the CO2e of current model training by up to 20x by carefully selecting the machine, mechanization and location of training.

Top

Concluding thoughts

As the field of ML advances, we continue our investment in developing high-performance, energy-efficient, and easy-to-use systems and infrastructure to enable rapid exploration of new ideas. At the same time, we continue to explore the capability of ML to improve the performance of complex systems and automate labor-intensive tasks in system design.

Google Research, 2022 & beyond

This was the second blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models Computer Vision Multimodal Models
Generative Models Responsible AI ML & Computer Systems
Algorithms* Robotics Health
General Science & Quantum Community Engagement
* Articles will be linked as they are released.

Read More

Open Source Vizier: Towards reliable and flexible hyperparameter and blackbox optimization

Open Source Vizier: Towards reliable and flexible hyperparameter and blackbox optimization

Google Vizier is the de-facto system for blackbox optimization over objective functions and hyperparameters across Google, having serviced some of Google’s largest research efforts and optimized a wide range of products (e.g., Search, Ads, YouTube). For research, it has not only reduced language model latency for users, designed computer architectures, accelerated hardware, assisted protein discovery, and enhanced robotics, but also provided a reliable backend interface for users to search for neural architectures and evolve reinforcement learning algorithms. To operate at the scale of optimizing thousands of users’ critical systems and tuning millions of machine learning models, Google Vizier solved key design challenges in supporting diverse use cases and workflows, while remaining strongly fault-tolerant.

Today we are excited to announce Open Source (OSS) Vizier (with an accompanying systems whitepaper published at AutoML Conference 2022), a standalone Python package based on Google Vizier. OSS Vizier is designed for two main purposes: (1) managing and optimizing experiments at scale in a reliable and distributed manner for users, and (2) developing and benchmarking algorithms for automated machine learning (AutoML) researchers.

System design

OSS Vizier works by having a server provide services, namely the optimization of blackbox objectives, or functions, from multiple clients. In the main workflow, a client sends a remote procedure call (RPC) and asks for a suggestion (i.e., a proposed input for the client’s blackbox function), from which the service begins to spawn a worker to launch an algorithm (i.e., a Pythia policy) to compute the following suggestions. The suggestions are then evaluated by clients to form their corresponding objective values and measurements, which are sent back to the service. This pipeline is repeated multiple times to form an entire tuning trajectory.

The use of the ubiquitous gRPC library, which is compatible with most programming languages, such as C++ and Rust, allows maximum flexibility and customization, where the user can also write their own custom clients and even algorithms outside of the default Python interface. Since the entire process is saved to an SQL datastore, a smooth recovery is ensured after a crash, and usage patterns can be stored as valuable datasets for research into meta-learning and multitask transfer-learning methods such as the OptFormer and HyperBO.

In the distributed pipeline, multiple clients each send a “Suggest” request to the Service API, which produces Suggestions for the clients using Pythia. The clients evaluate these suggestions and return measurements. All transactions are stored to allow fault-tolerance.

Usage

Because of OSS Vizier’s emphasis as a service, in which clients can send requests to the server at any point in time, it is thus designed for a broad range of scenarios — the budget of evaluations, or trials, can range from tens to millions, and the evaluation latency can range from seconds to weeks. Evaluations can be done asynchronously (e.g., tuning an ML model) or in synchronous batches (e.g., wet lab settings involving multiple simultaneous experiments). Furthermore, evaluations may fail due to transient errors and be retried, or may fail due to persistent errors (e.g., the evaluation is impossible) and should not be retried.

This broadly supports a variety of applications, which include hyperparameter tuning deep learning models or optimizing non-computational objectives, which can be e.g., physical, chemical, biological, mechanical, or even human-evaluated, such as cookie recipes.

The OSS Vizier API allows (1) developers to integrate other packages, with PyGlove and Vertex Vizier already included, and (2) users to optimize their experiments, such as machine learning pipelines and cookie recipes.

Integrations, algorithms, and benchmarks

As Google Vizier is heavily integrated with many of Google’s internal frameworks and products, OSS Vizier will naturally be heavily integrated with many of Google’s open source and external frameworks. Most prominently, OSS Vizier will serve as a distributed backend for PyGlove to allow large-scale evolutionary searches over combinatorial primitives such as neural architectures and reinforcement learning algorithms. Furthermore, OSS Vizier shares the same client-based API with Vertex Vizier, allowing users to quickly swap between open-source and production-quality services.

For AutoML researchers, OSS Vizier is also outfitted with a useful collection of algorithms and benchmarks (i.e., objective functions) unified under common APIs for assessing the strengths and weaknesses of proposed methods. Most notably, via TensorFlow Probability, researchers can now use the JAX-based Gaussian Process Bandit algorithm, based on the default algorithm in Google Vizier that tunes internal users’ objectives.

Resources and future direction

We provide links to the codebase, documentation, and systems whitepaper. We plan to allow user contributions, especially in the form of algorithms and benchmarks, and further integrate with the open-source AutoML ecosystem. Going forward, we hope to see OSS Vizier as a core tool for expanding research and development over blackbox optimization and hyperparameter tuning.

Acknowledgements

OSS Vizier was developed by members of the Google Vizier team in collaboration with the TensorFlow Probability team: Setareh Ariafar, Lior Belenki, Emily Fertig, Daniel Golovin, Tzu-Kuo Huang, Greg Kochanski, Chansoo Lee, Sagi Perel, Adrian Reyes, Xingyou (Richard) Song, and Richard Zhang.

In addition, we thank Srinivas Vasudevan, Jacob Burnim, Brian Patton, Ben Lee, Christopher Suter, and Rif A. Saurous for further TensorFlow Probability integrations, Daiyi Peng and Yifeng Lu for PyGlove integrations, Hao Li for Vertex/Cloud integrations, Yingjie Miao for AutoRL integrations, Tom Hennigan, Varun Godbole, Pavel Sountsov, Alexey Volkov, Mihir Paradkar, Richard Belleville, Bu Su Kim, Vytenis Sakenas, Yujin Tang, Yingtao Tian, and Yutian Chen for open source and infrastructure help, and George Dahl, Aleksandra Faust, Claire Cui, and Zoubin Ghahramani for discussions.

Finally we thank Tom Small for designing the animation for this post.

Read More

The Flan Collection: Advancing open source methods for instruction tuning

The Flan Collection: Advancing open source methods for instruction tuning

Language models are now capable of performing many new natural language processing (NLP) tasks by reading instructions, often that they hadn’t seen before. The ability to reason on new tasks is mostly credited to training models on a wide variety of unique instructions, known as “instruction tuning”, which was introduced by FLAN and extended in T0, Super-Natural Instructions, MetaICL, and InstructGPT. However, much of the data that drives these advances remain unreleased to the broader research community. 

In “The Flan Collection: Designing Data and Methods for Effective Instruction Tuning”, we closely examine and release a newer and more extensive publicly available collection of tasks, templates, and methods for instruction tuning to advance the community’s ability to analyze and improve instruction-tuning methods. This collection was first used in Flan-T5 and Flan-PaLM, for which the latter achieved significant improvements over PaLM. We show that training a model on this collection yields improved performance over comparable public collections on all tested evaluation benchmarks, e.g., a 3%+ improvement on the 57 tasks in the Massive Multitask Language Understanding (MMLU) evaluation suite and 8% improvement on BigBench Hard (BBH). Analysis suggests the improvements stem both from the larger and more diverse set of tasks and from applying a set of simple training and data augmentation techniques that are cheap and easy to implement: mixing zero-shot, few-shot, and chain of thought prompts at training, enriching tasks with input inversion, and balancing task mixtures. Together, these methods enable the resulting language models to reason more competently over arbitrary tasks, even those for which it hasn’t seen any fine-tuning examples. We hope making these findings and resources publicly available will accelerate research into more powerful and general-purpose language models.

Public instruction tuning data collections

Since 2020, several instruction tuning task collections have been released in rapid succession, shown in the timeline below. Recent research has yet to coalesce around a unified set of techniques, with different sets of tasks, model sizes, and input formats all represented. This new collection, referred to below as “Flan 2022”, combines prior collections from FLAN, P3/T0, and Natural Instructions with new dialog, program synthesis, and complex reasoning tasks.

A timeline of public instruction tuning collections, including: UnifiedQA, CrossFit, Natural Instructions, FLAN, P3/T0, MetaICL, ExT5, Super-Natural Instructions, mT0, Unnatural Instructions, Self-Instruct, and OPT-IML Bench. The table describes the release date, the task collection name, the model name, the base model(s) that were finetuned with this collection, the model size, whether the resulting model is Public (green) or Not Public (red), whether they train with zero-shot prompts (“ZS”), few-shot prompts (“FS”), chain-of-thought prompts (“CoT”) together (“+”) or separately (“/”), the number of tasks from this collection in Flan 2022, the total number of examples, and some notable methods, related to the collections, used in these works. Note that the number of tasks and examples vary under different assumptions and so are approximations. Counts for each are reported using task definitions from the respective works.

In addition to scaling to more instructive training tasks, The Flan Collection combines training with different types of input-output specifications, including just instructions (zero-shot prompting), instructions with examples of the task (few-shot prompting), and instructions that ask for an explanation with the answer (chain of thought prompting). Except for InstructGPT, which leverages a collection of proprietary data, Flan 2022 is the first work to publicly demonstrate the strong benefits of mixing these prompting settings together during training. Instead of a trade-off between the various settings, mixing prompting settings during training improves all prompting settings at inference time, as shown below for both tasks held-in and held-out from the set of fine-tuning tasks.

Training jointly with zero-shot and few-shot prompt templates improves performance on both held-in and held-out tasks. The stars indicate the peak performance in each setting. Red lines denote the zero-shot prompted evaluation, lilac denotes few-shot prompted evaluation.

Evaluating instruction tuning methods

To understand the overall effects of swapping one instruction tuning collection for another, we fine-tune equivalently-sized T5 models on popular public instruction-tuning collections, including Flan 2021, T0++, and Super-Natural Instructions. Each model is then evaluated on a set of tasks that are already included in each of the instruction tuning collections, a set of five chain-of-thought tasks, and then a set of 57 diverse tasks from the MMLU benchmark, both with zero-shot and few-shot prompts. In each case, the new Flan 2022 model, Flan-T5, outperforms these prior works, demonstrating a more powerful general-purpose NLP reasoner.

Comparing public instruction tuning collections on held-in, chain-of-thought, and held-out evaluation suites, such as BigBench Hard and MMLU. All models except OPT-IML-Max (175B) are trained by us, using T5-XL with 3B parameters. Green text indicates improvement over the next best comparable T5-XL (3B) model.

Single task fine-tuning

In applied settings, practitioners usually deploy NLP models fine-tuned specifically for one target task, where training data is already available. We examine this setting to understand how Flan-T5 compares to T5 models as a starting point for applied practitioners. Three settings are compared: fine-tuning T5 directly on the target task, using Flan-T5 without further fine-tuning on the target task, and fine-tuning Flan-T5 on the target task. For both held-in and held-out tasks, fine-tuning Flan-T5 offers an improvement over fine-tuning T5 directly. In some instances, usually where training data is limited for a target task, Flan-T5 without further fine-tuning outperforms T5 with direct fine-tuning.

Flan-T5 outperforms T5 on single-task fine-tuning. We compare single-task fine-tuned T5 (blue bars), single-task fine-tuned Flan-T5 (red), and Flan-T5 without any further fine-tuning (beige).

An additional benefit of using Flan-T5 as a starting point is that training is significantly faster and cheaper, converging more quickly than T5 fine-tuning, and usually peaking at higher accuracies. This suggests less task-specific training data may be necessary to achieve similar or better results on a particular task.

Flan-T5 converges faster than T5 on single-task fine-tuning, for each of five held-out tasks from Flan fine-tuning. Flan-T5’s learning curve is indicated with the solid lines, and T5’s learning curve with the dashed line. All tasks are held-out during Flan finetuning.

There are significant energy efficiency benefits for the NLP community to adopt instruction-tuned models like Flan-T5 for single task fine-tuning, rather than conventional non-instruction-tuned models. While pre-training and instruction fine-tuning are financially and computationally expensive, they are a one-time cost, usually amortized over millions of subsequent fine-tuning runs, which can become more costly in aggregate, for the most prominent models. Instruction-tuned models offer a promising solution in significantly reducing the amount of fine-tuning steps needed to achieve the same or better performance.

Conclusion

The new Flan instruction tuning collection unifies the most popular prior public collections and their methods, while adding new templates and simple improvements like training with mixed prompt settings. The resulting method outperforms Flan, P3, and Super-Natural Instructions on held-in, chain of thought, MMLU, and BBH benchmarks by 3–17% across zero-shot and few-shot variants. Results suggest this new collection serves as a more performant starting point for researchers and practitioners interested in both generalizing to new instructions or fine-tuning on a single new task.

Acknowledgements

It was a privilege to work with Jason Wei, Barret Zoph, Le Hou, Hyung Won Chung, Tu Vu, Albert Webson, Denny Zhou, and Quoc V Le on this project.

Read More

Learning with Queried Hints

Learning with Queried Hints

In many computing applications the system needs to make decisions to serve requests that arrive in an online fashion. Consider, for instance, the example of a navigation app that responds to driver requests. In such settings there is inherent uncertainty about important aspects of the problem. For example, the preferences of the driver with respect to features of the route are often unknown and the delays of road segments can be uncertain. The field of online machine learning studies such settings and provides various techniques for decision-making problems under uncertainty.

A navigation engine has to decide how to route this user’s request. The satisfaction of the user will depend on the (uncertain) congestion of the two routes and unknown preferences of the user on various features, such as how scenic, safe, etc., the route is.

A very well known problem in this framework is the multi-armed bandit problem, in which the system has a set of n available options (arms) from which it is asked to choose in each round (user request), e.g., a set of precomputed alternative routes in navigation. The user’s satisfaction is measured by a reward that depends on unknown factors such as user preferences and road segment delays. An algorithm’s performance over T rounds is compared against the best fixed action in hindsight by means of the regret (the difference between the reward of the best arm and the reward obtained by the algorithm over all T rounds). In the experts variant of the multi-armed bandit problem, all rewards are observed after each round and not just the one played by the algorithm.

An instance of the experts problem. The table presents the rewards obtained by following each of the 3 experts at each round = 1, 2, 3, 4. The best expert in hindsight (and hence the benchmark to compare against) is the middle one, with total reward 21. If, for example, we had selected expert 1 in the first two rounds and expert 3 in the last two rounds (recall that we need to select before observing the rewards of each round), we would have extracted reward 17, which would give a regret equal to 21 – 17 = 4.

These problems have been extensively studied, and existing algorithms can achieve sublinear regret. For example, in the multi-armed bandit problem, the best existing algorithms can achieve regret that is of the order √T. However, these algorithms focus on optimizing for worst-case instances, and do not account for the abundance of available data in the real world that allows us to train machine learned models capable of aiding us in algorithm design.

In “Online Learning and Bandits with Queried Hints” (presented at ITCS 2023), we show how an ML model that provides us with a weak hint can significantly improve the performance of an algorithm in bandit-like settings. Many ML models are trained accurately using relevant past data. In the routing application, for example, specific past data can be used to estimate road segment delays and past feedback from drivers can be used to learn the quality of certain routes. Models trained with such data can, in certain cases, give very accurate feedback. However, our algorithms achieve strong guarantees even when the feedback from the model is in the form of a less explicit weak hint. Specifically, we merely ask that the model predict which of two options will be better. In the navigation application this is equivalent to having the algorithm pick two routes and query an ETA model for which of the two is faster, or presenting the user with two routes with different characteristics and letting them pick the one that is best for them. By designing algorithms that leverage such a hint we can: Improve the regret of the bandits setting on an exponential scale in terms of dependence on T and improve the regret of the experts setting from order of √T to become independent of T. Specifically, our upper bound only depends on the number of experts n and is at most log(n).

Algorithmic Ideas

Our algorithm for the bandits setting utilizes the well known upper confidence bound (UCB) algorithm. The UCB algorithm maintains, as a score for each arm, the average reward observed on that arm so far and adds to it an optimism parameter that becomes smaller with the number of times the arm has been pulled, thus balancing between exploration and exploitation. Our algorithm applies the UCB scores on pairs of arms, mainly in an effort to utilize the available pairwise comparison model that can designate the better of two arms. Each pair of arms i and j is grouped as a meta-arm (i, j) whose reward in each round is equal to the maximum reward between the two arms. Our algorithm observes the UCB scores of the meta-arms and picks the pair (i, j) that has the highest score. The pair of arms are then passed as a query to the ML auxiliary pairwise prediction model, which responds with the best of the two arms. This response is the arm that is finally used by the algorithm.

The decision problem considers three candidate routes. Our algorithm instead considers all pairs of the candidate routes. Suppose pair 2 is the one with the highest score in the current round. The pair is given to the auxiliary ML pairwise prediction model, which outputs whichever of the two routes is better in the current round.

Our algorithm for the experts setting takes a follow-the-regularized-leader (FtRL) approach, which maintains the total reward of each expert and adds random noise to each, before picking the best for the current round. Our algorithm repeats this process twice, drawing random noise two times and picking the highest reward expert in each of the two iterations. The two selected experts are then used to query the auxiliary ML model. The model’s response for the best between the two experts is the one played by the algorithm.

Results

Our algorithms utilize the concept of weak hints to achieve strong improvements in terms of theoretical guarantees, including an exponential improvement in the dependence of regret on the time horizon or even removing this dependence altogether. To illustrate how the algorithm can outperform existing baseline solutions, we present a setting where 1 of the n candidate arms is consistently marginally better than the n-1 remaining arms. We compare our ML probing algorithm against a baseline that uses the standard UCB algorithm to pick the two arms to submit to the pairwise comparison model. We observe that the UCB baseline keeps accumulating regret whereas the probing algorithm quickly identifies the best arm and keeps playing it, without accumulating regret.

An example in which our algorithm outperforms a UCB based baseline. The instance considers n arms, one of which is always marginally better than the remaining n-1.

Conclusion

In this work we explore how a simple pairwise comparison ML model can provide simple hints that prove very powerful in settings such as the experts and bandits problems. In our paper we further present how these ideas apply to more complex settings such as online linear and convex optimization. We believe our model of hints can have more interesting applications in ML and combinatorial optimization problems.

Acknowledgements

We thank our co-authors Aditya Bhaskara (University of Utah), Sungjin Im (University of California, Merced), and Kamesh Munagala (Duke University).

Read More

Deciphering Clinical Abbreviations with Privacy Protecting ML

Deciphering Clinical Abbreviations with Privacy Protecting ML

Today many people have digital access to their medical records, including their doctor’s clinical notes. However, clinical notes are hard to understand because of the specialized language that clinicians use, which contains unfamiliar shorthand and abbreviations. In fact, there are thousands of such abbreviations, many of which are specific to certain medical specialities and locales or can mean multiple things in different contexts. For example, a doctor might write in their clinical notes, “pt referred to pt for lbp“, which is meant to convey the statement: “Patient referred to physical therapy for low back pain.” Coming up with this translation is tough for laypeople and computers because some abbreviations are uncommon in everyday language (e.g., “lbp” means “low back pain”), and even familiar abbreviations, such as “pt” for “patient”, can have alternate meanings, such as “physical therapy.” To disambiguate between multiple meanings, the surrounding context must be considered. It’s no easy task to decipher all the meanings, and prior research suggests that expanding the shorthand and abbreviations can help patients better understand their health, diagnoses, and treatments.

In “Deciphering clinical abbreviations with a privacy protecting machine learning system”, published in Nature Communications, we report our findings on a general method that deciphers clinical abbreviations in a way that is both state-of-the-art and is on-par with board certified physicians in this task. We built the model using only public data on the web that wasn’t associated with any patient (i.e., no potentially sensitive data) and evaluated performance on real, de-identified notes from inpatient and outpatient clinicians from different health systems. To enable the model to generalize from web-data to notes, we created a way to algorithmically re-write large amounts of internet text to look as if it were written by a doctor (called web-scale reverse substitution), and we developed a novel inference method, (called elicitive inference).

The model input is a string that may or may not contain medical abbreviations. We trained a model to output a corresponding string in which all abbreviations are simultaneously detected and expanded. If the input string does not contain an abbreviation, the model will output the original string. By Rajkomar et al used under CC BY 4.0/ Cropped from original.

Rewriting Text to Include Medical Abbreviations

Building a system to translate doctors’ notes would usually start with a large, representative dataset of clinical text where all abbreviations are labeled with their meanings. But no such dataset for general use by researchers exists. We therefore sought to develop an automated way to create such a dataset but without the use of any actual patient notes, which might include sensitive data. We also wanted to ensure that models trained on this data would still work well on real clinical notes from multiple hospital sites and types of care, such as both outpatient and inpatient.

To do this, we referenced a dictionary of thousands of clinical abbreviations and their expansions, and found sentences on the web that contained uses of the expansions from this dictionary. We then “rewrote” those sentences by abbreviating each expansion, resulting in web data that looked like it was written by a doctor. For instance, if a website contained the phrase “patients with atrial fibrillation can have chest pain,” we would rewrite this sentence to “pts with af can have cp.” We then used the abbreviated text as input to the model, with the original text serving as the label. This approach provided us with large amounts of data to train our model to perform abbreviation expansion.

The idea of “reverse substituting” the long-forms for their abbreviations was introduced in prior research, but our distributed algorithm allows us to extend the technique to large, web-sized datasets. Our algorithm, called web-scale reverse substitution (WSRS), is designed to ensure that rare terms occur more frequently and common terms are down-sampled across the public web to derive a more balanced dataset. With this data in-hand, we trained a series of large transformer-based language models to expand the web text.

We generate text to train our model on the decoding task by extracting phrases from public web pages that have corresponding medical abbreviations (shaded boxes on the left) and then substituting in the appropriate abbreviations (shaded dots, right). Since some words are found much more frequently than others (“patient” more than “posterior tibialis”, both of which can be abbreviated “pt”), we downsampled common expansions to derive a more balanced dataset across the thousands of abbreviations. By Rajkomar et al used under CC BY 4.0.

Adapting Protein Alignment Algorithms to Unstructured Clinical Text

Evaluation of these models on the particular task of abbreviation expansion is difficult. Because they produce unstructured text as output, we had to figure out which abbreviations in the input correspond to which expansion in the output. To achieve this, we created a modified version of the Needleman Wunsch algorithm, which was originally designed for divergent sequence alignment in molecular biology, to align the model input and output and extract the corresponding abbreviation-expansion pairs. Using this alignment technique, we were able to evaluate the model’s capacity to detect and expand abbreviations accurately. We evaluated Text-to-Text Transfer Transformer (T5) models of various sizes (ranging from 60 million to over 60 billion parameters) and found that larger models performed translation better than smaller models, with the biggest model achieving the best performance.

Creating New Model Inference Techniques to Coax the Model

However, we did find something unexpected. When we evaluated the performance on multiple external test sets from real clinical notes, we found the models would leave some abbreviations unexpanded, and for larger models, the problem of incomplete expansion was even worse. This is mainly due to the fact that while we substitute expansions on the web for their abbreviations, we have no way of handling the abbreviations that are already present. This means that the abbreviations appear in both the original and rewritten text used as respective labels and input, and the model learns not to expand them.

To address this, we developed a new inference-chaining technique in which the model output is fed again as input to coax the model to make further expansions as long as the model is confident in the expansion. In technical terms, our best-performing technique, which we call elicitive inference, involves examining the outputs from a beam search above a certain log-likelihood threshold. Using elicitive inference, we were able to achieve state-of-the-art capability of expanding abbreviations in multiple external test sets.

Real example of the model’s input (left) and output (right).

Comparative Performance

We also sought to understand how patients and doctors currently perform at deciphering clinical notes, and how our model compared. We found that lay people (people without specific medical training) demonstrated less than 30% comprehension of the abbreviations present in the sample medical texts. When we allowed them to use Google Search, their comprehension increased to nearly 75%, still leaving 1 out of 5 abbreviations indecipherable. Unsurprisingly, medical students and trained physicians performed much better at the task with an accuracy of 90%. We found that our largest model was capable of matching or exceeding experts, with an accuracy of 98%.

How does the model perform so well compared to physicians in this task? There are two important factors in the model’s high comparative performance. Part of the discrepancy is that there were some abbreviations that clinicians did not even attempt to expand (such as “cm” for centimeter), which partly lowered the measured performance. This might seem unimportant, but for non-english speakers, these abbreviations may not be familiar, and so it may be helpful to have them written out. In contrast, our model is designed to comprehensively expand abbreviations. In addition, clinicians are familiar with abbreviations they commonly see in their speciality, but other specialists use shorthand that are not understood by those outside their fields. Our model is trained on thousands of abbreviations across multiple specialities and therefore can decipher a breadth of terms.

Towards Improved Health Literacy

We think there are numerous avenues in which large language models (LLMs) can help advance the health literacy of patients by augmenting the information they see and read. Most LLMs are trained on data that does not look like clinical note data, and the unique distribution of this data makes it challenging to deploy these models in an out-of-the-box fashion. We have demonstrated how to overcome this limitation. Our model also serves to “normalize” clinical note data, facilitating additional capabilities of ML to make the text easier for patients of all educational and health-literacy levels to understand.

Acknowledgements

This work was carried out in collaboration with Yuchen Liu, Jonas Kemp, Benny Li, Ming-Jun Chen, Yi Zhang, Afroz Mohiddin, and Juraj Gottweis. We thank Lisa Williams, Yun Liu, Arelene Chung, and Andrew Dai for many useful conversations and discussions about this work.

Read More

Google Research, 2022 & Beyond: Responsible AI

Google Research, 2022 & Beyond: Responsible AI

<!–

This is the second post in our “Google Research, 2022 & Beyond” series. Other topics in the series can be found below:

Language Models Computer Vision Multimodal Models
Generative Models Responsible AI Algorithms*
ML & Computer Systems Robotics Health
General Science & Quantum Community Engagement
* Other articles in the series will be linked as they are released.

–>

The last year showed tremendous breakthroughs in artificial intelligence (AI), particularly in large language models (LLMs) and text-to-image models. These technological advances require that we are thoughtful and intentional in how they are developed and deployed. In this blogpost, we share ways we have approached Responsible AI across our research in the past year and where we’re headed in 2023. We highlight four primary themes covering foundational and socio-technical research, applied research, and product solutions, as part of our commitment to build AI products in a responsible and ethical manner, in alignment with our AI Principles.

  · Theme 1: Responsible AI Research Advancements
  · Theme 2: Responsible AI Research in Products
  · Theme 3: Tools and Techniques
  · Theme 4: Demonstrating AI’s Societal Benefit

Theme 1: Responsible AI Research Advancements

Machine Learning Research

When machine learning (ML) systems are used in real world contexts, they can fail to behave in expected ways, which reduces their realized benefit. Our research identifies situations in which unexpected behavior may arise, so that we can mitigate undesired outcomes.

Across several types of ML applications, we showed that models are often underspecified, which means they perform well in exactly the situation in which they are trained, but may not be robust or fair in new situations, because the models rely on “spurious correlations” — specific side effects that are not generalizable. This poses a risk to ML system developers, and demands new model evaluation practices.

We surveyed evaluation practices currently used by ML researchers and introduced improved evaluation standards in work addressing common ML pitfalls. We identified and demonstrated techniques to mitigate causal “shortcuts”, which lead to a lack of ML system robustness and dependency on sensitive attributes, such as age or gender.

Shortcut learning: Age impacts correct medical diagnosis.

To better understand the causes of and mitigations for robustness issues, we decided to dig deeper into model design in specific domains. In computer vision, we studied the robustness of new vision transformer models and developed new negative data augmentation techniques to improve their robustness. For natural language tasks, we similarly investigated how different data distributions improve generalization across different groups and how ensembles and pre-trained models can help.

Another key part of our ML work involves developing techniques to build models that are more inclusive. For example, we look to external communities to guide understanding of when and why our evaluations fall short using participatory systems, which explicitly enable joint ownership of predictions and allow people to choose whether to disclose on sensitive topics.

Sociotechnical Research

In our quest to include a diverse range of cultural contexts and voices in AI development and evaluation, we have strengthened community-based research efforts, focusing on particular communities who are less represented or may experience unfair outcomes of AI. We specifically looked at evaluations of unfair gender bias, both in natural language and in contexts such as gender-inclusive health. This work is advancing more accurate evaluations of unfair gender bias so that our technologies evaluate and mitigate harms for people with queer and non-binary identities.

Alongside our fairness advancements, we also reached key milestones in our larger efforts to develop culturally-inclusive AI. We championed the importance of cross-cultural considerations in AI — in particular, cultural differences in user attitudes towards AI and mechanisms for accountability — and built data and techniques that enable culturally-situated evaluations, with a focus on the global south. We also described user experiences of machine translation, in a variety of contexts, and suggested human-centered opportunities for their improvement.

Human-Centered Research

At Google, we focus on advancing human-centered research and design. Recently, our work showed how LLMs can be used to rapidly prototype new AI-based interactions. We also published five new interactive explorable visualizations that introduce key ideas and guidance to the research community, including how to use saliency to detect unintended biases in ML models, and how federated learning can be used to collaboratively train a model with data from multiple users without any raw data leaving their devices.

Our interpretability research explored how we can trace the behavior of language models back to the training data itself, suggested new ways to compare differences in what models pay attention to, how we can explain emergent behavior, and how to identify human-understandable concepts learned by models. We also proposed a new approach for recommender systems that uses natural language explanations to make it easier for people to understand and control their recommendations.

Creativity and AI Research

We initiated conversations with creative teams on the rapidly changing relationship between AI technology and creativity. In the creative writing space, Google’s PAIR and Magenta teams developed a novel prototype for creative writing, and facilitated a writers’ workshop to explore the potential and limits of AI to assist creative writing. The stories from a diverse set of creative writers were published as a collection, along with workshop insights. In the fashion space, we explored the relationship between fashion design and cultural representation, and in the music space, we started examining the risks and opportunities of AI tools for music.

Top

Theme 2: Responsible AI Research in Products

The ability to see yourself reflected in the world around you is important, yet image-based technologies often lack equitable representation, leaving people of color feeling overlooked and misrepresented. In addition to efforts to improve representation of diverse skin tones across Google products, we introduced a new skin tone scale designed to be more inclusive of the range of skin tones worldwide. Partnering with Harvard professor and sociologist, Dr. Ellis Monk, we released the Monk Skin Tone (MST) Scale, a 10-shade scale that is available for the research community and industry professionals for research and product development. Further, this scale is being incorporated into features on our products, continuing a long line of our work to improve diversity and skin tone representation on Image Search and filters in Google Photos.

The 10 shades of the Monk Skin Tone Scale.

This is one of many examples of how Responsible AI in Research works closely with products across the company to inform research and develop new techniques. In another example, we leveraged our past research on counterfactual data augmentation in natural language to improve SafeSearch, reducing unexpected shocking Search results by 30%, especially on searches related to ethnicity, sexual orientation, and gender. To improve video content moderation, we developed new approaches for helping human raters focus their attention on segments of long videos that are more likely to contain policy violations. And, we’ve continued our research on developing more precise ways of evaluating equal treatment in recommender systems, accounting for the broad diversity of users and use cases.

In the area of large models, we incorporated Responsible AI best practices as part of the development process, creating Model Cards and Data Cards (more details below), Responsible AI benchmarks, and societal impact analysis for models such as GLaM, PaLM, Imagen, and Parti. We also showed that instruction fine-tuning results in many improvements for Responsible AI benchmarks. Because generative models are often trained and evaluated on human-annotated data, we focused on human-centric considerations like rater disagreement and rater diversity. We also presented new capabilities using large models for improving responsibility in other systems. For example, we have explored how language models can generate more complex counterfactuals for counterfactual fairness probing. We will continue to focus on these areas in 2023, also understanding the implications for downstream applications.

Top

Theme 3: Tooling and Techniques

Responsible Data

Data Documentation:

Extending our earlier work on Model Cards and the Model Card Toolkit, we released Data Cards and the Data Cards Playbook, providing developers with methods and tools to document appropriate uses and essential facts related to a model or dataset. We have also advanced research on best practices for data documentation, such as accounting for a dataset’s origins, annotation processes, intended use cases, ethical considerations, and evolution. We also applied this to healthcare, creating “healthsheets” to underlie the foundation of our international Standing Together collaboration, bringing together patients, health professionals, and policy-makers to develop standards that ensure datasets are diverse and inclusive and to democratize AI.

New Datasets:

Fairness: We released a new dataset to assist in ML fairness and adversarial testing tasks, primarily for generative text datasets. The dataset contains 590 words and phrases that show interactions between adjectives, words, and phrases that have been shown to have stereotypical associations with specific individuals and groups based on their sensitive or protected characteristics.

A partial list of the sensitive characteristics in the dataset denoting their associations with adjectives and stereotypical associations.

Toxicity: We constructed and publicly released a dataset of 10,000 posts to help identify when a comment’s toxicity depends on the comment it’s replying to. This improves the quality of moderation-assistance models and supports the research community working on better ways to remedy online toxicity.

Societal Context Data: We used our experimental societal context repository (SCR) to supply the Perspective team with auxiliary identity and connotation context data for terms relating to categories such as ethnicity, religion, age, gender, or sexual orientation — in multiple languages. This auxiliary societal context data can help augment and balance datasets to significantly reduce unintended biases, and was applied to the widely used Perspective API toxicity models.

Learning Interpretability Tool (LIT)

An important part of developing safer models is having the tools to help debug and understand them. To support this, we released a major update to the Learning Interpretability Tool (LIT), an open-source platform for visualization and understanding of ML models, which now supports images and tabular data. The tool has been widely used in Google to debug models, review model releases, identify fairness issues, and clean up datasets. It also now lets you visualize 10x more data than before, supporting up to 100s of thousands of data points at once.

A screenshot of the Language Interpretability Tool displaying generated sentences on a data table.

Counterfactual Logit Pairing

ML models are sometimes susceptible to flipping their prediction when a sensitive attribute referenced in an input is either removed or replaced. For example, in a toxicity classifier, examples such as “I am a man” and “I am a lesbian” may incorrectly produce different outputs. To enable users in the Open Source community to address unintended bias in their ML models, we launched a new library, Counterfactual Logit Pairing (CLP), which improves a model’s robustness to such perturbations, and can positively influence a model’s stability, fairness, and safety.

Illustration of fairness predictions that can be mitigated using counterfactual logit pairing.

Top

Theme 4: Demonstrating AI’s Societal Benefit

We believe that AI can be used to explore and address hard, unanswered questions around humanitarian and environmental issues. Our research and engineering efforts span many areas, including accessibility, health, and media representation, with the end goal of promoting inclusion and meaningfully improving people’s lives.

Accessibility

Following many years of research, we launched Project Relate, an Android app that uses a personalized AI-based speech recognition model to enable people with non-standard speech to communicate more easily with others. The app is available to English speakers 18+ in Australia, Canada, Ghana, India, New Zealand, the UK, and the US.

To help catalyze advances in AI to benefit people with disabilities, we also launched the Speech Accessibility Project. This project represents the culmination of a collaborative, multi-year effort between researchers at Google, Amazon, Apple, Meta, Microsoft, and the University of Illinois Urbana-Champaign. Together, this group built a large dataset of impaired speech that is available to developers to empower research and product development for accessibility applications. This work also complements our efforts to assist people with severe motor and speech impairments through improvements to techniques that make use of a user’s eye gaze.

Health

We’re also focused on building technology to better the lives of people affected by chronic health conditions, while addressing systemic inequities, and allowing for transparent data collection. As consumer technologies — such as fitness trackers and mobile phones — become central in data collection for health, we’ve explored use of technology to improve interpretability of clinical risk scores and to better predict disability scores in chronic diseases, leading to earlier treatment and care. And, we advocated for the importance of infrastructure and engineering in this space.

Many health applications use algorithms that are designed to calculate biometrics and benchmarks, and generate recommendations based on variables that include sex at birth, but might not account for users’ current gender identity. To address this issue, we completed a large, international study of trans and non-binary users of consumer technologies and digital health applications to learn how data collection and algorithms used in these technologies can evolve to achieve fairness.

Media

We partnered with the Geena Davis Institute on Gender in Media (GDI) and the Signal Analysis and Interpretation Laboratory (SAIL) at the University of Southern California (USC) to study 12 years of representation in TV. Based on an analysis of over 440 hours of TV programming, the report highlights findings and brings attention to significant disparities in screen and speaking time for light and dark skinned characters, male and female characters, and younger and older characters. This first-of-its-kind collaboration uses advanced AI models to understand how people-oriented stories are portrayed in media, with the ultimate goal to inspire equitable representation in mainstream media.

MUSE demo Source: Video Collection / Getty Images.

Top

Plans for 2023 and Beyond

We’re committed to creating research and products that exemplify positive, inclusive, and safe experiences for everyone. This begins by understanding the many aspects of AI risks and safety inherent in the innovative work that we do, and including diverse sets of voices in coming to this understanding.

  • Responsible AI Research Advancements: We will strive to understand the implications of the technology that we create, through improved metrics and evaluations, and devise methodology to enable people to use technology to become better world citizens.
  • Responsible AI Research in Products: As products leverage new AI capabilities for new user experiences, we will continue to collaborate closely with product teams to understand and measure their societal impacts and to develop new modeling techniques that enable the products to uphold Google’s AI Principles.
  • Tools and Techniques: We will develop novel techniques to advance our ability to discover unknown failures, explain model behaviors, and to improve model output through training, responsible generation, and failure mitigation.
  • Demonstrating AI’s Social Benefit: We plan to expand our efforts on AI for the Global Goals, bringing together research, technology, and funding to accelerate progress on the Sustainable Development Goals. This commitment will include $25 million to support NGOs and social enterprises. We will further our work on inclusion and equity by forming more collaborations with community-based experts and impacted communities. This includes continuing the Equitable AI Research Roundtables (EARR), focused on the potential impacts and downstream harms of AI with community based experts from the Othering and Belonging Institute at UC Berkeley, PolicyLink, and Emory University School of Law.

Building ML models and products in a responsible and ethical manner is both our core focus and core commitment.

Acknowledgements

This work reflects the efforts from across the Responsible AI and Human-Centered Technology community, from researchers and engineers to product and program managers, all of whom contribute to bringing our work to the AI community.

Google Research, 2022 & Beyond

This was the second blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models Computer Vision Multimodal Models
Generative Models Responsible AI Algorithms*
ML & Computer Systems Robotics Health
General Science & Quantum Community Engagement
* Articles will be linked as they are released.

Read More