HDR+ with Bracketing on Pixel Phones

Posted by Manfred Ernst and Bartlomiej Wronski, Software Engineers, Google Research

We’re continuously working to improve the Pixel — making it more helpful, more capable, and more fun — with regular updates, such as the recent V8.2 update to the Camera app. One such improvement (launched on Pixel 5 and Pixel 4a 5G in October) is a feature that operates “under the hood”, HDR+ with Bracketing. This feature works by merging images taken with different exposure times to improve image quality (especially in shadows), resulting in more natural colors, improved details and texture, and reduced noise.

Why Are HDR Scenes Hard to Capture?
The original HDR+ burst photography system is the engine behind high-quality mobile photography, which captures a rapid series of deliberately underexposed images, then combines and renders them in a way that preserves detail across the range of tones. But this system had one limitation: scenes with high dynamic range (HDR) like the one below were noisy in the shadows because all images captured are underexposed.

The same photo using HDR+ (red outline) and HDR+ with Bracketing (green outline). While the characteristic HDR+ look remains the same, bracketing improves image quality, especially in shadows, with more natural colors, improved details and texture, and reduced noise.

Capturing HDR scenes is difficult because of the physical constraints of image sensors combined with limited signal in the shadows. We can correctly expose either the shadows or the highlights, but not both at the same time.

The same scene shot with different exposure settings and tonemapped to similar overall brightness. Left/Top: Exposure set for the highlights. The bright blue sky is preserved, but the shadows are very noisy. Right/Bottom: Exposure set for the shadows. Noise in the shadows is reduced, but the sky is clipped (white).

Photographers sometimes work around these limitations by taking two different exposures and combining them. This approach, known as exposure bracketing, can deliver the best of both worlds, but it is time-consuming to do by hand. It is also challenging in computational photography because it requires:

  1. Capturing additional long exposure frames while maintaining the fast, predictable capture experience of the Pixel camera.
  2. Taking advantage of long exposure frames while avoiding ghosting artifacts caused by motion between frames.

To avoid these challenges, the original HDR+ system used a different approach to handle high dynamic range scenes.

The Limits of HDR+
The capture strategy used by HDR+ is based on underexposure, which avoids loss of detail in the highlights. While this strategy comes at the expense of noise in the shadows, HDR+ offsets the increased noise through the use of burst photography.

Using bursts to improve image quality. HDR+ starts from a burst of full-resolution raw images (left). Depending on conditions, between 2 and 15 images are aligned and merged into a computational raw image (middle). The merged image has reduced noise and increased dynamic range, leading to a higher quality final result (right).

This approach works well for scenes with moderate dynamic range, but breaks down for HDR scenes. To understand why, we need to take a closer look at how two types of noise get into an image.

Noise in Burst Photography
One important type of noise is called shot noise, which depends only on the total amount of light captured — the sum of N frames, each with E seconds of exposure time has the same amount of shot noise as a single frame exposed for N × E seconds. If this were the only type of noise present in captured images, burst photography would be as efficient as taking longer exposures. Unfortunately, a second type of noise, read noise, is introduced by the sensor every time a frame is captured. Read noise doesn’t depend on the amount of light captured but instead depends on the number of frames taken — that is, with each frame taken, an additional fixed amount of read noise is added.

This is why using burst photography to reduce total noise isn’t as efficient as simply taking longer exposures: taking multiple frames can reduce the effect of shot noise, but will also increase read noise. Even though read noise increases with the number of frames, it is still possible to reduce the overall noisiness with burst photography, but it becomes less efficient. If one were to break a long exposure into N shorter exposures, the ratio of signal to noise in the final image would be lower because of the additional read noise. In this case, to get back to the signal-to-noise ratio in the single long exposure, one would need to merge N2 short-exposure frames. In the example below, if a long exposure were divided into 12 short exposures, we’d have to capture 144 (12 × 12) short frames to match the signal-to-noise ratio in the shadows! Capturing and processing this many frames would be much more time consuming — burst capture and processing could take over a minute and result in a poor user experience. Instead, with bracketing one can capture both short and long exposures — combining highlight protection and noise reduction.

Left: The result of merging 12 short-exposure frames in Night Sight mode. Right: A single frame whose exposure time is 12 times longer than an individual short exposure. The longer exposure has significantly less noise in the shadows but sacrifices the highlights.

Solving with Bracketing
While the challenges of bracketing prevented the original HDR+ system from using it, incremental improvements since then, plus a recent concentrated effort, have made it possible in the Camera app. To start, adding bracketing to HDR+ required redesigning the capture strategy. Capturing is complicated by zero shutter lag (ZSL), which underpins the fast capture experience on Pixel. With ZSL, the frames displayed in the viewfinder before the shutter press are the frames we use for HDR+ burst merging. For bracketing, we capture an additional long exposure frame after the shutter press, which is not shown in the viewfinder. Note that holding the camera still for half a second after the shutter press to accommodate the long exposure can help improve image quality, even with a typical amount of handshake.

Capture strategy. Top: The original HDR+ method captures short exposures before the shutter press, six in this example. Bottom: HDR+ with Bracketing captures five short exposures before the shutter press and one long exposure after the shutter press.

For Night Sight, the capture strategy isn’t constrained by the viewfinder — because all frames are captured after the shutter press while the viewfinder is stopped, this mode easily accommodates capturing longer exposure frames. In this case, we capture three long exposures to further reduce noise.

Capture strategy for Night Sight. Top: The original Night Sight captured 15 short exposure frames. Bottom: Night Sight with bracketing captures 12 short and 3 long exposures.

The Merging Algorithm
When merging bracketed shots, we choose one of the short frames as the reference frame to avoid potentially clipped highlights and motion blur. All other frames are aligned to this frame before they are merged. This introduces a challenge — for complex scene motion or occluded regions, it is impossible to find exactly matching regions and a naïve merge algorithm would produce ghosting artifacts in these cases.

Left: Ghosting artifacts are visible around the silhouette of a moving person, when deghosting is disabled.
Right: Robust merging produces a clean image.

To address this, we designed a new spatial merge algorithm, similar to the one used for Super Res Zoom, that decides per pixel whether image content should be merged or not. This deghosting is more complicated for frames with different exposures. Long exposure frames have different noise characteristics, clipped highlights, and different amounts of motion blur, which makes comparisons with the short exposure reference frame more difficult. In addition, ghosting artifacts are more visible in bracketed shots, because noise that would otherwise mask these errors is reduced. Despite those challenges, our algorithm is as robust to these issues as the original HDR+ and Super Res Zoom and doesn’t produce ghosting artifacts. At the same time, it merges images 40% faster than its predecessors. Because it merges RAW images early in the photographic pipeline, we were able to achieve all of those benefits while keeping the rest of processing and the signature HDR+ look unchanged. Furthermore, users who prefer to use computational RAW images can take advantage of those image quality and performance improvements.

Bracketing on Pixel
HDR+ with Bracketing is available to users of Pixel 4a (5G) and 5 in the default camera, as well as in Night Sight and Portrait modes. For users of Pixel 4 and 4a, the Google Camera app supports bracketing in Night Sight mode. No user interaction is needed to activate HDR+ with Bracketing — depending on the dynamic range of the scene, and the presence of motion, HDR+ with bracketing chooses the best exposures to maximize image quality (examples).

Acknowledgements
HDR+ with Bracketing is the result of a collaboration across several teams at Google. The project would not have been possible without the joint efforts of Sam Hasinoff, Dillon Sharlet, Kiran Murthy, Mike Milne, Andy Radin, Nicholas Wilson, Navin Sarma‎, Gabriel Nava, Emily To, Sushil Nath, Alexander Schiffhauer, Isaac Reynolds, Bill Strathearn, Marius Renn, Alex Hong, Jose Ricardo Lima, Bob Hung, Ying Chen Lou, Joy Hsu, Blade Chiu, David Massoud, Jean Hsu, Ellie Yang, and Marc Levoy.

Read More

Evolving Reinforcement Learning Algorithms

Posted by John D. Co-Reyes, Research Intern and Yingjie Miao, Senior Software Engineer, Google Research

A long-term, overarching goal of research into reinforcement learning (RL) is to design a single general purpose learning algorithm that can solve a wide array of problems. However, because the RL algorithm taxonomy is quite large, and designing new RL algorithms requires extensive tuning and validation, this goal is a daunting one. A possible solution would be to devise a meta-learning method that could design new RL algorithms that generalize to a wide variety of tasks automatically.

In recent years, AutoML has shown great success in automating the design of machine learning components, such as neural networks architectures and model update rules. One example is Neural Architecture Search (NAS), which has been used to develop better neural network architectures for image classification and efficient architectures for running on phones and hardware accelerators. In addition to NAS, AutoML-Zero shows that it’s even possible to learn the entire algorithm from scratch using basic mathematical operations. One common theme in these approaches is that the neural network architecture or the entire algorithm is represented by a graph, and a separate algorithm is used to optimize the graph for certain objectives.

These earlier approaches were designed for supervised learning, in which the overall algorithm is more straightforward. But in RL, there are more components of the algorithm that could be potential targets for design automation (e.g., neural network architectures for agent networks, strategies for sampling from the replay buffer, overall formulation of the loss function), and it is not always clear what the best model update procedure would be to integrate these components. Prior efforts for the automation RL algorithm discovery have focused primarily on model update rules. These approaches learn the optimizer or RL update procedure itself and commonly represent the update rule with a neural network such as an RNN or CNN, which can be efficiently optimized with gradient-based methods. However, these learned rules are not interpretable or generalizable, because the learned weights are opaque and domain specific.

In our paper “Evolving Reinforcement Learning Algorithms”, accepted at ICLR 2021, we show that it’s possible to learn new, analytically interpretable and generalizable RL algorithms by using a graph representation and applying optimization techniques from the AutoML community. In particular, we represent the loss function, which is used to optimize an agent’s parameters over its experience, as a computational graph, and use Regularized Evolution to evolve a population of the computational graphs over a set of simple training environments. This results in increasingly better RL algorithms, and the discovered algorithms generalize to more complex environments, even those with visual observations like Atari games.

RL Algorithm as a Computational Graph
Inspired by ideas from NAS, which searches over the space of graphs representing neural network architectures, we meta-learn RL algorithms by representing the loss function of an RL algorithm as a computational graph. In this case, we use a directed acyclic graph for the loss function, with nodes representing inputs, operators, parameters and outputs. For example, in the computational graph for DQN, input nodes include data from the replay buffer, operator nodes include neural network operators and basic math operators, and the output node represents the loss, which will be minimized with gradient descent.

There are a few benefits of such a representation. This representation is expressive enough to define existing algorithms but also new, undiscovered algorithms. It is also interpretable. This graph representation can be analyzed in the same way as human designed RL algorithms, making it more interpretable than approaches that use black box function approximators for the entire RL update procedure. If researchers can understand why a learned algorithm is better, then they can both modify the internal components of the algorithm to improve it and transfer the beneficial components to other problems. Finally, the representation supports general algorithms that can solve a wide variety of problems.

Example computation graph for DQN which computes the squared Bellman error.

We implemented this representation using the PyGlove library, which conveniently turns the graph into a search space that can be optimized with regularized evolution.

Evolving RL Algorithms
We use an evolutionary based approach to optimize the RL algorithms of interest. First, we initialize a population of training agents with randomized graphs. This population of agents is trained in parallel over a set of training environments. The agents first train on a hurdle environment — an easy environment, such as CartPole, intended to quickly weed out poorly performing programs.

If an agent cannot solve the hurdle environment, the training is stopped early with a score of zero. Otherwise the training proceeds to more difficult environments (e.g., Lunar Lander, simple MiniGrid environments, etc.). The algorithm performance is evaluated and used to update the population, where more promising algorithms are further mutated. To reduce the search space, we use a functional equivalence checker which will skip over newly proposed algorithms if they are functionally the same as previously examined algorithms. This loop continues as new mutated candidate algorithms are trained and evaluated. At the end of training, we select the best algorithm and evaluate its performance over a set of unseen test environments.

The population size in the experiments was around 300 agents, and we observed the evolution of good candidate loss functions after 20-50 thousand mutations, requiring about three days of training. We were able to train on CPUs because the training environments were simple, controlling for the computational and energy cost of training. To further control the cost of training, we seeded the initial population with human-designed RL algorithms such as DQN.

Overview of meta-learning method. Newly proposed algorithms must first perform well on a hurdle environment before being trained on a set of harder environments. Algorithm performance is used to update a population where better performing algorithms are further mutated into new algorithms. At the end of training, the best performing algorithm is evaluated on test environments.

Learned Algorithms
We highlight two discovered algorithms that exhibit good generalization performance. The first is DQNReg, which builds on DQN by adding a weighted penalty on the Q-values to the normal squared Bellman error. The second learned loss function, DQNClipped, is more complex, although its dominating term has a simple form — the max of the Q-value and the squared Bellman error (modulo a constant). Both algorithms can be viewed as a way to regularize the Q-values. While DQNReg adds a soft constraint, DQNClipped can be interpreted as a kind of constrained optimization that will minimize the Q-values if they become too large. We show that this learned constraint kicks in during the early stage of training when overestimating the Q-values is a potential issue. Once this constraint is satisfied, the loss will instead minimize the original squared Bellman error.

A closer analysis shows that while baselines like DQN commonly overestimate Q-values, our learned algorithms address this issue in different ways. DQNReg underestimates the Q-values, while DQNClipped has similar behavior to double dqn in that it slowly approaches the ground truth without overestimating it.

It’s worth pointing out that these two algorithms consistently emerge when the evolution is seeded with DQN. Learning from scratch, the method rediscovers the TD algorithm. For completeness, we release a dataset of top 1000 performing algorithms discovered during evolution. Curious readers could further investigate the properties of these learned loss functions.

Overestimated values are generally a problem in value-based RL. Our method learns algorithms that have found a way to regularize the Q-values and thus reduce overestimation.

Learned Algorithms Generalization Performance
Normally in RL, generalization refers to a trained policy generalizing across tasks. However, in this work we’re interested in algorithmic generalization performance, which means how well an algorithm works over a set of environments. On a set of classical control environments, the learned algorithms can match baselines on the dense reward tasks (CartPole, Acrobot, LunarLander) and outperform DQN on the sparser reward task, MountainCar.

Performance of learned algorithms versus baselines on classical control environments.

On a set of sparse reward MiniGrid environments, which test a variety of different tasks, we see that DQNReg greatly outperforms baselines on both the training and test environments, in terms of sample efficiency and final performance. In fact, the effect is even more pronounced on the test environments, which vary in size, configuration, and existence of new obstacles, such as lava.

Training environment performance versus training steps as measured by episode return over 10 training seeds. DQNReg can match or outperform baselines in sample efficiency and final performance.
DQNReg can greatly outperform baselines on unseen test environments.

We visualize the performance of normal DDQN vs. the learned algorithm DQNReg on a few MiniGrid environments. The starting location, wall configuration, and object configuration of these environments are randomized at each reset, which requires the agent to generalize instead of simply memorizing the environment. While DDQN often struggles to learn any meaningful behavior, DQNReg can learn the optimal behavior efficiently.

DDQN
DQNReg (Learned) 

Even on image-based Atari environments we observe improved performance, even though training was on non-image-based environments. This suggests that meta-training on a set of cheap but diverse training environments with a generalizable algorithm representation could enable radical algorithmic generalization.

Env DQN DDQN PPO DQNReg
Asteroid 1364.5 734.7 2097.5 2390.4
Bowling 50.4 68.1 40.1 80.5
Boxing 88.0 91.6 94.6 100.0
RoadRunner   39544.0     44127.0     35466.0     65516.0  
Performance of learned algorithm, DQNReg, against baselines on several Atari games. Performance is evaluated over 200 test episodes every 1 million steps.

Conclusion
In this post, we’ve discussed learning new interpretable RL algorithms by representing their loss functions as computational graphs and evolving a population of agents over this representation. The computational graph formulation allows researchers to both build upon human-designed algorithms and study the learned algorithms using the same mathematical toolset as the existing algorithms. We analyzed a few of the learned algorithms and can interpret them as a form of entropy regularization to prevent value overestimation. These learned algorithms can outperform baselines and generalize to unseen environments. The top performing algorithms are available for further analytical study.

We hope that future work will extend to more varied RL settings such as actor critic algorithms or offline RL. Furthermore we hope that this work can lead to machine assisted algorithm development where computational meta-learning can help researchers find new directions to pursue and incorporate learned algorithms into their own work.

Acknowledgements
We thank our co-authors Daiyi Peng, Esteban Real, Sergey Levine, Quoc V. Le, Honglak Lee, and Aleksandra Faust. We also thank Luke Metz for helpful early discussions and feedback on the paper, Hanjun Dai for early discussions on related research ideas, Xingyou Song, Krzysztof Choromanski, and Kevin Wu for helping with infrastructure, and Jongwook Choi for helping with environment selection. Finally we thank Tom Small for designing animations for this post.

Read More

A whale of a tale about responsibility and AI

A couple of years ago, Google AI for Social Good’s Bioacoustics team created a ML model that helps the scientific community detect the presence of humpback whale sounds using an acoustic recording. This tool, developed in partnership with the National Oceanic and Atmospheric Association, helps biologists study whale behaviors, patterns, population and potential human interactions. 

We realized other researchers could use this model for their work, too — it could help them better understand the oceans and protect key biodiversity areas. We wanted to freely share this model, but  struggled with a big dilemma: On one hand, it could help ocean scientists. On the other, though, we worried about whale poachers or other bad actors. What if they used our shared knowledge in a way we didn’t intend? 

We decided to consult with experts in the field in order to help us responsibly open source this machine learning model. We worked with Google’s Responsible Innovation team to use our AI Principles — aguide to responsibly developing technology — to make a decision.

The team gave us the guidance we needed to open source a machine learning model that could be socially beneficial and was built and tested for safety, while also upholding high standards of scientific excellence for the marine biologists and researchers worldwide. 

On Earth Day — and every day — putting the AI Principles into practice is important to the communities we serve, on land and in the sea. 

Curious about diving deeper? You can use AI to explore thousands of hours of humpback whale songs and make your own discoveries with our Pattern Radio and see our collaboration with the National Oceanic and Atmospheric Association of the United States as well as our work with Fisheries and Oceans Canada (DFO) to apply machine learning to protect killer whales in the Salish Sea.

Read More

How we’re minimizing AI’s carbon footprint

A photograph of a textbook about computer architecture.

The book that led to my visit to Google.

When I first visited Google back in 2002, I was a computer science professor at UC Berkeley. My colleague John Hennessey and I were updating our textbook on computer architecture, and Larry Page — who rode a hot-rodded electric scooter at the time — agreed to show me how his then three-year-old company designed its computing for Search. I remember the setup was lean yet powerful: just 6,000 low-cost PC servers and 12,000 PC disks answering 70 million queries around the world, every day. It was my first real look at how Google built its computer systems from the ground up, optimizing for efficiency at every level.

When I joined the company in 2016, it was with the goal of helping research how to maximize the efficiency of computer systems built specifically for artificial intelligence. Last year, Google set an ambitious goal of operating on 24/7 carbon-free energy, everywhere, by the end of the decade. But at the same time, machine learning systems are quickly becoming larger and more capable. What will be the environmental impact of those systems — and how can we neutralize that impact going forward? 

Today, we’re publishing a detailed analysis that addresses both of those questions. It’s an account of the energy- and carbon-costs of training six state-of-the art ML models, including five of our own. (Training a model is like building infrastructure: You spend the energy to train the model once, after which it’s used and reused many times, possibly by hundreds of millions of people.) To our knowledge, it’s the most thorough evaluation of its kind yet published. And while we had reason to believe our systems were efficient, we were encouraged by just how efficient they turned out to be.

For instance, we found that developing the Evolved Transformer model, a more efficient version of the popular Transformer architecture for ML, emitted nearly 100 times less carbon dioxide equivalent than a widely cited estimate. Of the roughly 12.7 terawatt-hours of electricity that Google uses every year, less than 1/200th of a percent of it was spent training our most computationally demanding models.  

What’s more, our analysis found that there already exist many ways to develop and train ML systems even more efficiently: Specially designed models, processors and data centers can dramatically reduce energy requirements, while the right selection of energy sources can go a long way to reduce the carbon that’s emitted during training. In fact, the right combination of model, processor, data center and energy source can reduce the carbon footprint of training an ML system by 1000 times. 

There’s no one easy trick for achieving a reduction that large, so let’s unpack that figure.  Minimizing a system’s carbon footprint is a two-part problem: First you want to minimize the energy the system consumes, then you have to supply that energy from the cleanest source possible.

Our analysis took a closer look at GShard and Switch Transformer, two models recently developed at Google Research. They’re the largest models we’ve ever created, but they both use a technique called sparse activation that enables them to only use a small fraction of their total architecture for a given task. It’s a bit like how your brain uses a small fraction of its 100 billion neurons to help you read this sentence. The result is that these sparse models consume less than one tenth the energy that you’d expect of similarly sized dense models — without sacrificing accuracy.

But to minimize ML’s energy use, you need more than just efficient models — you also need efficient processors and data centers to train and serve them. Google’s Tensor Processing Units (TPUs) are specifically designed for machine learning, which makes them up to five times more efficient than off-the-shelf processors. And the cloud computing data centers that house those TPUs are up to twice as efficient as typical enterprise data centers. 

Once you’ve minimized your energy requirements, you have to think about where that energy originates. The electricity a data center consumes is determined by the grid where it’s located. And depending on what resources were used to generate the electricity on that grid, this may emit carbon. 

The carbon intensity of grids varies greatly across regions, so it really matters where models are trained. For instance, the mix of energy supplying Google’s Iowa data center produces 0.080kg of CO2e per kilowatt hour of electricity, when combining the electricity supplied by the grid and produced by Google’s wind farms in Iowa. That’s 5.4 times less than the U.S. average. 

Any one of these four factors — models, chips, data centers and energy sources — can have a sizable effect on the costs associated with developing an ML system. But their cumulative impact can be enormous.

When John and I updated our textbook with what we’d learned on our visit to Google back in 2002, we wrote that “reducing the power per PC [server]” presented “a major opportunity for the future.” Nearly 20 years later, Google has found many opportunities to streamline its systems — but plenty remain to be seized. As a result of our analysis, we’ve already begun shifting where we train our computationally intensive ML models. We’re optimizing data center efficiency by shifting compute tasks to times when low-carbon power sources are most plentiful. Our Oklahoma data center, in addition to receiving its energy from cleaner sources, will house many of our next generation of TPUs, which are even more efficient than their predecessors. And sparse activation is just one example of the algorithmic ingenuity Google is using to design ML models that work smarter, not harder.

Read More

MaX-DeepLab: Dual-Path Transformers for End-to-End Panoptic Segmentation

Posted by Huiyu Wang, Student Researcher and Liang-Chieh Chen, Research Scientist, Google Research

Panoptic segmentation is a computer vision task that unifies semantic segmentation (assigning a class label to each pixel) and instance segmentation (detecting and segmenting each object instance). A core task for real-world applications, panoptic segmentation predicts a set of non-overlapping masks along with their corresponding class labels (i.e., category of object, like “car”, “traffic light”, “road”, etc.) and is generally accomplished using multiple surrogate sub-tasks that approximate (e.g., by using box detection methods) the goals of panoptic segmentation.

An example image and its panoptic segmentation masks from the Cityscapes dataset.
Previous methods approximate panoptic segmentation with a tree of surrogate sub-tasks.

Each surrogate sub-task in this proxy tree introduces extra manually-designed modules, such as anchor design rules, box assignment rules, non-maximum suppression (NMS), thing-stuff merging, etc. Although there are good solutions to individual surrogate sub-tasks and modules, undesired artifacts are introduced when these sub-tasks come together in a pipeline for panoptic segmentation, especially in challenging conditions (e.g., two people with similar bounding boxes will trigger NMS, resulting in a missing mask).

Previous efforts, such as DETR, attempted to solve some of these issues by simplifying the box detection sub-task into an end-to-end operation, which is more computationally efficient and results in fewer undesired artifacts. However, the training process still relies heavily on box detection, which does not align with the mask-based definition of panoptic segmentation. Another line of work completely removes boxes from the pipeline, which has the benefit of removing an entire surrogate sub-task along with its associated modules and artifacts. For example, Axial-DeepLab predicts pixel-wise offsets to predefined instance centers, but the surrogate sub-task it uses encounters challenges with highly deformable objects, which have a large variety of shapes (e.g., a cat), or nearby objects with close centers in the image plane, e.g. the image below of a dog seated in a chair.

When the centers of the dog and the chair are close to each other, Axial-DeepLab merges them into one object.

In “MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers”, to be presented at CVPR 2021, we propose the first fully end-to-end approach for the panoptic segmentation pipeline, directly predicting class-labeled masks by extending the Transformer architecture to this computer vision task. Dubbed MaX-DeepLab for extending Axial-DeepLab with a Mask Xformer, our method employs a dual-path architecture that introduces a global memory path, allowing for direct communication with any convolution layers. As a result, MaX-DeepLab shows a significant 7.1% panoptic quality (PQ) gain in the box-free regime on the challenging COCO dataset, closing the gap between box-based and box-free methods for the first time. MaX-DeepLab achieves the state-of-the-art 51.3% PQ on COCO test-dev set, without test time augmentation.

MaX-DeepLab is fully end-to-end: It predicts panoptic segmentation masks directly from images.

End-to-End Panoptic Segmentation
Inspired by DETR, our model directly predicts a set of non-overlapping masks and their corresponding semantic labels, with output masks and classes that are optimized with a PQ-style objective. Specifically, inspired by the evaluation metric, PQ, which is defined as the recognition quality (whether or not the predicted class is correct) times the segmentation quality (whether the predicted mask is correct), we define a similarity metric between two class-labeled masks in the exact same way. The model is directly trained by maximizing this similarity between ground truth masks and predicted masks via one-to-one matching. This direct modeling of panoptic segmentation enables end-to-end training and inference, removing the hand-coded priors that are necessary in existing box-based and box-free methods.

MaX-DeepLab directly predicts N masks and N classes with a CNN and a mask transformer.

Dual-Path Transformer
Instead of stacking a traditional transformer on top of a convolutional neural network (CNN), we propose a dual-path framework for combining CNNs with transformers. Specifically, we enable any CNN layer to read and write to global memory by using a dual-path transformer block. This proposed block adopts all four types of attention between the CNN-path and the memory-path, and can be inserted anywhere in a CNN, enabling communication with the global memory at any layer. MaX-DeepLab also employs a stacked-hourglass-style decoder that aggregates multi-scale features into a high resolution output. The output is then multiplied with the global memory feature, to form the mask set prediction. The classes for the masks are predicted with another branch of the mask transformer.

An overview of the dual-path transformer architecture.

Results
We evaluate MaX-DeepLab on one of the most challenging panoptic segmentation datasets, COCO, against both of the state-of-the-art box-free (Axial-DeepLab) and box-based (DetectoRS) methods. MaX-DeepLab, without test time augmentation, achieves the state-of-the-art result of 51.3% PQ on the test-dev set.

Comparison on COCO test-dev set.

This result surpasses Axial-DeepLab by 7.1% PQ in the box-free regime and DetectoRS by 1.7% PQ, bridging the gap between box-based and box-free methods for the first time. For a consistent comparison with DETR, we also evaluated a lightweight version of MaX-DeepLab that matches the number of parameters and computations of DETR. The lightweight MaX-DeepLab outperforms DETR by 3.3% PQ on the val set and 3.0% PQ on the test-dev set. In addition, we performed extensive ablation studies and analyses on our end-to-end formulation, model scaling, dual-path architectures, and loss functions. Also the extra-long training schedule of DETR is not necessary for MaX-DeepLab.

As an example in the figure below, MaX-DeepLab correctly segments a dog sitting on a chair. Axial-DeepLab relies on a surrogate sub-task of regressing object center offsets. It fails because the centers of the dog and the chair are close to each other. DetectoRS classifies object bounding boxes, instead of masks, as a surrogate sub-task. It filters out the chair mask because the chair bounding box has a low confidence.

A case study for MaX-DeepLab and state-of-the-art box-free and box-based methods.

Another example shows how MaX-DeepLab correctly segments images with challenging conditions.

MaX-DeepLab correctly segments the overlapping zebras. This case is also challenging for other methods since the zebras have similar bounding boxes and nearby object centers. (credit & license)

Conclusion
We have shown for the first time that panoptic segmentation can be trained end-to-end. MaX-DeepLab directly predicts masks and classes with a mask transformer, removing the need for many hand-designed priors such as object bounding boxes, thing-stuff merging, etc. Equipped with a PQ-style loss and a dual-path transformer, MaX-DeepLab achieves the state-of-the-art result on the challenging COCO dataset, closing the gap between box-based and box-free methods.

Acknowledgements
We are thankful to our co-authors, Yukun Zhu, Hartwig Adam, and Alan Yuille. We also thank Maxwell Collins, Sergey Ioffe, Jiquan Ngiam, Siyuan Qiao, Chen Wei, Jieneng Chen, and the Mobile Vision team for the support and valuable discussions.

Read More

Multi-Task Robotic Reinforcement Learning at Scale

Posted by Karol Hausman, Senior Research Scientist and Yevgen Chebotar, Research Scientist, Robotics at Google

For general-purpose robots to be most useful, they would need to be able to perform a range of tasks, such as cleaning, maintenance and delivery. But training even a single task (e.g., grasping) using offline reinforcement learning (RL), a trial and error learning method where the agent uses training previously collected data, can take thousands of robot-hours, in addition to the significant engineering needed to enable autonomous operation of a large-scale robotic system. Thus, the computational costs of building general-purpose everyday robots using current robot learning methods becomes prohibitive as the number of tasks grows.

Multi-task data collection across multiple robots where different robots collect data for different tasks.

In other large-scale machine learning domains, such as natural language processing and computer vision, a number of strategies have been applied to amortize the effort of learning over multiple skills. For example, pre-training on large natural language datasets can enable few- or zero-shot learning of multiple tasks, such as question answering and sentiment analysis. However, because robots collect their own data, robotic skill learning presents a unique set of opportunities and challenges. Automating this process is a large engineering endeavour, and effectively reusing past robotic data collected by different robots remains an open problem.

Today we present two new advances for robotic RL at scale, MT-Opt, a new multi-task RL system for automated data collection and multi-task RL training, and Actionable Models, which leverages the acquired data for goal-conditioned RL. MT-Opt introduces a scalable data-collection mechanism that is used to collect over 800,000 episodes of various tasks on real robots and demonstrates a successful application of multi-task RL that yields ~3x average improvement over baseline. Additionally, it enables robots to master new tasks quickly through use of its extensive multi-task dataset (new task fine-tuning in <1 day of data collection). Actionable Models enables learning in the absence of specific tasks and rewards by training an implicit model of the world that is also an actionable robotic policy. This drastically increases the number of tasks the robot can perform (via visual goal specification) and enables more efficient learning of downstream tasks.

Large-Scale Multi-Task Data Collection System
The cornerstone for both MT-Opt and Actionable Models is the volume and quality of training data. To collect diverse, multi-task data at scale, users need a way to specify tasks, decide for which tasks to collect the data, and finally, manage and balance the resulting dataset. To that end, we create a scalable and intuitive multi-task success detector using data from all of the chosen tasks. The multi-task success is trained using supervised learning to detect the outcome of a given task and it allows users to quickly define new tasks and their rewards. When this success detector is being applied to collect data, it is periodically updated to accommodate distribution shifts caused by various real-world factors, such as varying lighting conditions, changing background surroundings, and novel states that the robots discover.

Second, we simultaneously collect data for multiple distinct tasks across multiple robots by using solutions to easier tasks to effectively bootstrap learning of more complex tasks. This allows training of a policy for the harder tasks and improves the data collected for them. As such, the amount of per-task data and the number of successful episodes for each task grows over time. To further improve the performance, we focus data collection on underperforming tasks, rather than collecting data uniformly across tasks.

This system collected 9600 robot hours of data (from 57 continuous data collection days on seven robots). However, while this data collection strategy was effective at collecting data for a large number of tasks, the success rate and data volume was imbalanced between tasks.

Learning with MT-Opt
We address the data collection imbalance by transferring data across tasks and re-balancing the per-task data. The robots generate episodes that are labelled as success or failure for each task and are then copied and shared across other tasks. The balanced batch of episodes is then sent to our multi-task RL training pipeline to train the MT-Opt policy.

Data sharing and task re-balancing strategy used by MT-Opt. The robots generate episodes which then get labelled as success or failure for the current task and are then shared across other tasks.

MT-Opt uses Q-learning, a popular RL method that learns a function that estimates the future sum of rewards, called the Q-function. The learned policy then picks the action that maximizes this learned Q-function. For multi-task policy training, we specify the task as an extra input to a large Q-learning network (inspired by our previous work on large-scale single-task learning with QT-Opt) and then train all of the tasks simultaneously with offline RL using the entire multi-task dataset. In this way, MT-Opt is able to train on a wide variety of skills that include picking specific objects, placing them into various fixtures, aligning items on a rack, rearranging and covering objects with towels, etc.

Compared to single-task baselines, MT-Opt performs similarly on the tasks that have the most data and significantly improves performance on underrepresented tasks. So, for a generic lifting task, which has the most supporting data, MT-Opt achieved an 89% success rate (compared to 88% for QT-Opt) and achieved a 50% average success rate across rare tasks, compared to 1% with a single-task QT-Opt baseline and 18% using a naïve, multi-task QT-Opt baseline. Using MT-Opt not only enables zero-shot generalization to new but similar tasks, but also can quickly (in about 1 day of data collection on seven robots) be fine-tuned to new, previously unseen tasks. For example, when applied to an unseen towel-covering task, the system achieved a zero-shot success rate of 92% for towel-picking and 79% for object-covering, which wasn’t present in the original dataset.

Example tasks that MT-Opt is able to learn, such as instance and indiscriminate grasping, chasing, placing, aligning and rearranging.

<!–

Example tasks that MT-Opt is able to learn, such as instance and indiscriminate grasping, chasing, placing, aligning and rearranging.

–>

Towel-covering task that was not present in the original dataset. We fine-tune MT-Opt on this novel task in 1 day to achieve a high (>90%) success rate.

Learning with Actionable Models
While supplying a rigid definition of tasks facilitates autonomous data collection for MT-Opt, it limits the number of learnable behaviors to a fixed set. To enable learning a wider range of tasks from the same data, we use goal-conditioned learning, i.e., learning to reach given goal configurations of a scene in front of the robot, which we specify with goal images. In contrast to explicit model-based methods that learn predictive models of future world observations, or approaches that employ online data collection, this approach learns goal-conditioned policies via offline model-free RL.

To learn to reach any goal state, we perform hindsight relabeling of all trajectories and sub-sequences in our collected dataset and train a goal-conditioned Q-function in a fully offline manner (in contrast to learning online using a fixed set of success examples as in recursive classification). One challenge in this setting is the distributional shift caused by learning only from “positive” hindsight relabeled examples. This we address by employing a conservative strategy to minimize Q-values of unseen actions using artificial negative actions. Furthermore, to enable reaching temporary-extended goals, we introduce a technique for chaining goals across multiple episodes.

Actionable Models relabel sub-sequences with all intermediate goals and regularize Q-values with artificial negative actions.

Training with Actionable Models allows the system to learn a large repertoire of visually indicated skills, such as object grasping, container placing and object rearrangement. The model is also able to generalize to novel objects and visual objectives not seen in the training data, which demonstrates its ability to learn general functional knowledge about the world. We also show that downstream reinforcement learning tasks can be learned more efficiently by either fine-tuning a pre-trained goal-conditioned model or through a goal-reaching auxiliary objective during training.

Example tasks (specified by goal-images) that our Actionable Model is able to learn.

Conclusion
The results of both MT-Opt and Actionable Models indicate that it is possible to collect and then learn many distinct tasks from large diverse real-robot datasets within a single model, effectively amortizing the cost of learning across many skills. We see this an important step towards general robot learning systems that can be further scaled up to perform many useful services and serve as a starting point for learning downstream tasks.

This post is based on two papers, “MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale” and “Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills,” with additional information and videos on the project websites for MT-Opt and Actionable Models.

Acknowledgements
This research was conducted by Dmitry Kalashnikov, Jake Varley, Yevgen Chebotar, Ben Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, Yao Lu, Alex Irpan, Ben Eysenbach, Ryan Julian and Ted Xiao. We’d like to give special thanks to Josh Weaver, Noah Brown, Khem Holden, Linda Luu and Brandon Kinman for their robot operation support; Anthony Brohan for help with distributed learning and testing infrastructure; Tom Small for help with videos and project media; Julian Ibarz, Kanishka Rao, Vikas Sindhwani and Vincent Vanhoucke for their support; Tuna Toksoz and Garrett Peake for improving the bin reset mechanisms; Satoshi Kataoka, Michael Ahn, and Ken Oslund for help with the underlying control stack, and the rest of the Robotics at Google team for their overall support and encouragement. All the above contributions were incredibly enabling for this research.

Read More

Presenting the iGibson Challenge on Interactive and Social Navigation

Posted by Anthony Francis, Software Engineer and Alexander Toshev, Staff Research Scientist, Google Research

Computer vision has significantly advanced over the past decade thanks to large-scale benchmarks, such as ImageNet for image classification or COCO for object detection, which provide vast datasets and criteria for evaluating models. However, these traditional benchmarks evaluate passive tasks in which the emphasis is on perception alone, whereas more recent computer vision research has tackled active tasks, which require both perception and action (often called “embodied AI”).

The First Embodied AI Workshop, co-organized by Google at CVPR 2020, hosted several benchmark challenges for active tasks, including the Stanford and Google organized Sim2Real Challenge with iGibson, which provided a real-world setup to test navigation policies trained in photo-realistic simulation environments. An open-source setup in the challenge enabled the community to train policies in simulation, which could then be run in repeatable real world navigation experiments, enabling the evaluation of the “sim-to-real gap” — the difference between simulation and the real world. Many research teams submitted solutions during the pandemic, which were run safely by challenge organizers on real robots, with winners presenting their results virtually at the workshop.

This year, Stanford and Google are proud to announce a new version of the iGibson Challenge on Interactive and Social Navigation, one of the 10 active visual challenges affiliated with the Second Embodied AI Workshop at CVPR 2021. This year’s Embodied AI Workshop is co-organized by Google and nine other research organizations, and explores issues such as simulation, sim-to-real transfer, visual navigation, semantic mapping and change detection, object rearrangement and restoration, auditory navigation, and following instructions for navigation and interaction tasks. In addition, this year’s interactive and social iGibson challenge explores interactive navigation and social navigation — how robots can learn to interact with people and objects in their environments — by combining the iGibson simulator, the Google Scanned Objects Dataset, and simulated pedestrians within realistic human environments.

New Challenges in Navigation
Active perception tasks are challenging, as they require both perception and actions in response. For example, point navigation involves navigating through mapped space, such as driving robots over kilometers in human-friendly buildings, while recognizing and avoiding obstacles. Similarly object navigation involves looking for objects in buildings, requiring domain invariant representations and object search behaviors. Additionally, visual language instruction navigation involves navigating through buildings based on visual images and commands in natural language. These problems become even harder in a real-world environment, where robots must be able to handle a variety of physical and social interactions that are much more dynamic and challenging to solve. In this year’s iGibson Challenge, we focus on two of those settings:

  • Interactive Navigation: In a cluttered environment, an agent navigating to a goal must physically interact with objects to succeed. For example, an agent should recognize that a shoe can be pushed aside, but that an end table should not be moved and a sofa cannot be moved.
  • Social Navigation: In a crowded environment in which people are also moving about, an agent navigating to a goal must move politely around the people present with as little disruption as possible.

New Features of the iGibson 2021 Dataset
To facilitate research into techniques that address these problems, the iGibson Challenge 2021 dataset provides simulated interactive scenes for training. The dataset includes eight fully interactive scenes derived from real-world apartments, and another seven scenes held back for testing and evaluation.

iGibson provides eight fully interactive scenes derived from real-world apartments.

To enable interactive navigation, these scenes are populated with small objects drawn from the Google Scanned Objects Dataset, a dataset of common household objects scanned in 3D for use in robot simulation and computer vision research, licensed under a Creative Commons license to give researchers the freedom to use them in their research.

The Google Scanned Objects Dataset contains 3D models of many common objects.

The challenge is implemented in Stanford’s open-source iGibson simulation platform, a fast, interactive, photorealistic robotic simulator with physics based on Bullet. For this year’s challenge, iGibson has been expanded with fully interactive environments and pedestrian behaviors based on the ORCA crowd simulation algorithm.

iGibson environments include ORCA crowd simulations and movable objects.

Participating in the Challenge
The iGibson Challenge has launched and its leaderboard is open in the Dev phase, in which participants are encouraged to submit robotic control to the development leaderboard, where they will be tested on the Interactive and Social Navigation challenges on our holdout dataset. The Test phase opens for teams to submit final solutions on May 16th and closes on May 31st, with the winner demo scheduled for June 20th, 2021. For more details on participating, please check out the iGibson Challenge Page.

Acknowledgements
We’d like to thank our colleagues at at the Stanford Vision and Learning Lab (SVL) for working with us to advance the state of interactive and social robot navigation, including Chengshu Li, Claudia Pérez D’Arpino, Fei Xia, Jaewoo Jang, Roberto Martin-Martin and Silvio Savarese. At Google, we would like to thank Aleksandra Faust, Anelia Angelova, Carolina Parada, Edward Lee, Jie Tan, Krista Reyman and the rest of our collaborators on mobile robotics. We would also like to thank our co-organizers on the Embodied AI Workshop, including AI2, Facebook, Georgia Tech, Intel, MIT, SFU, Stanford, UC Berkeley, and University of Washington.

Read More

Monster Mash: A Sketch-Based Tool for Casual 3D Modeling and Animation

Posted by Cassidy Curtis, Visual Designer and David Salesin, Principal Scientist, Google Research

3D computer animation is a time-consuming and highly technical medium — to complete even a single animated scene requires numerous steps, like modeling, rigging and animating, each of which is itself a sub-discipline that can take years to master. Because of its complexity, 3D animation is generally practiced by teams of skilled specialists and is inaccessible to almost everyone else, despite decades of advances in technology and tools. With the recent development of tools that facilitate game character creation and game balance, a natural question arises: is it possible to democratize the 3D animation process so it’s accessible to everyone?

To explore this concept, we start with the observation that most forms of artistic expression have a casual mode: a classical guitarist might jam without any written music, a trained actor could ad-lib a line or two while rehearsing, and an oil painter can jot down a quick gesture drawing. What these casual modes have in common is that they allow an artist to express a complete thought quickly and intuitively without fear of making a mistake. This turns out to be essential to the creative process — when each sketch is nearly effortless, it is possible to iteratively explore the space of possibilities far more effectively.

In this post, we describe Monster Mash, an open source tool presented at SIGGRAPH Asia 2020 that allows experts and amateurs alike to create rich, expressive, deformable 3D models from scratch — and to animate them — all in a casual mode, without ever having to leave the 2D plane. With Monster Mash, the user sketches out a character, and the software automatically converts it to a soft, deformable 3D model that the user can immediately animate by grabbing parts of it and moving them around in real time. There is also an online demo, where you can try it out for yourself.

Creating a walk cycle using Monster Mash. Step 1: Draw a character. Step 2: Animate it.

Creating a 2D Sketch
The insight that makes this casual sketching approach possible is that many 3D models, particularly those of organic forms, can be described by an ordered set of overlapping 2D regions. This abstraction makes the complex task of 3D modeling much easier: the user creates 2D regions by drawing their outlines, then the algorithm creates a 3D model by stitching the regions together and inflating them. The result is a simple and intuitive user interface for sketching 3D figures.

For example, suppose the user wants to create a 3D model of an elephant. The first step is to draw the body as a closed stroke (a). Then the user adds strokes to depict other body parts such as legs (b). Drawing those additional strokes as open curves provides a hint to the system that they are meant to be smoothly connected with the regions they overlap. The user can also specify that some new parts should go behind the existing ones by drawing them with the right mouse button (c), and mark other parts as symmetrical by double-clicking on them (d). The result is an ordered list of 2D regions.

Steps in creating a 2D sketch of an elephant.

Stitching and Inflation
To understand how a 3D model is created from these 2D regions, let’s look more closely at one part of the elephant. First, the system identifies where the leg must be connected to the body (a) by finding the segment (red) that completes the open curve. The system cuts the body’s front surface along that segment, and then stitches the front of the leg together with the body (b). It then inflates the model into 3D by solving a modified form of Poisson’s equation to produce a surface with a rounded cross-section (c). The resulting model (d) is smooth and well-shaped, but because all of the 3D parts are rooted in the drawing plane, they may intersect each other, resulting in a somewhat odd-looking “elephant”. These intersections will be resolved by the deformation system.

Illustration of the details of the stitching and inflation process. The schematic illustrations (b, c) are cross-sections viewed from the elephant’s front.

Layered Deformation
At this point we just have a static model — we need to give the user an easy way to pose the model, and also separate the intersecting parts somehow. Monster Mash’s layered deformation system, based on the well-known smooth deformation method as-rigid-as-possible (ARAP), solves both of these problems at once. What’s novel about our layered “ARAP-L” approach is that it combines deformation and other constraints into a single optimization framework, allowing these processes to run in parallel at interactive speed, so that the user can manipulate the model in real time.

The framework incorporates a set of layering and equality constraints, which move body parts along the z axis to prevent them from visibly intersecting each other. These constraints are applied only at the silhouettes of overlapping parts, and are dynamically updated each frame.

In steps (d) through (h) above, ARAP-L transforms a model from one with intersecting 3D parts to one with the depth ordering specified by the user. The layering constraints force the leg’s silhouette to stay in front of the body (green), and the body’s silhouette to stay behind the leg (yellow). Equality constraints (red) seal together the loose boundaries between the leg and the body.

Meanwhile, in a separate thread of the framework, we satisfy point constraints to make the model follow user-defined control points (described in the section below) in the xy-plane. This ARAP-L method allows us to combine modeling, rigging, deformation, and animation all into a single process that is much more approachable to the non-specialist user.

The model deforms to match the point constraints (red dots) while the layering constraints prevent the parts from visibly intersecting.

Animation
To pose the model, the user can create control points anywhere on the model’s surface and move them. The deformation system converges over multiple frames, which gives the model’s movement a soft and floppy quality, allowing the user to intuitively grasp its dynamic properties — an essential prerequisite for kinesthetic learning.

Because the effect of deformations converges over multiple frames, our system lends 3D models a soft and dynamic quality.

To create animation, the system records the user’s movements in real time. The user can animate one control point, then play back that movement while recording additional control points. In this way, the user can build up a complex action like a walk by layering animation, one body part at a time. At every stage of the animation process, the only task required of the user is to move points around in 2D, a low-risk workflow meant to encourage experimentation and play.

Conclusion
We believe this new way of creating animation is intuitive and can thus help democratize the field of computer animation, encouraging novices who would normally be unable to try it on their own as well as experts who often require fast iteration under tight deadlines. Here you can see a few of the animated characters that have been created using Monster Mash. Most of these were created in a matter of minutes.

A selection of animated characters created using Monster Mash. The original hand-drawn outline used to create each 3D model is visible as an inset above each character.

All of the code for Monster Mash is available as open source, and you can watch our presentation and read our paper from SIGGRAPH Asia 2020 to learn more. We hope this software will make creating 3D animations more broadly accessible. Try out the online demo and see for yourself!

Acknowledgements
Monster Mash is the result of a collaboration between Google Research, Czech Technical University in Prague, ETH Zürich, and the University of Washington. Key contributors include Marek Dvorožňák, Daniel Sýkora, Cassidy Curtis, Brian Curless, Olga Sorkine-Hornung, and David Salesin. We are also grateful to Hélène Leroux, Neth Nom, David Murphy, Samuel Leather, Pavla Sýkorová, and Jakub Javora for participating in the early interactive sessions.

Read More

Announcing the 2021 Research Scholar Program Recipients

Posted by Negar Saei, Program Manager, University Relations

In March 2020 we introduced the Research Scholar Program, an effort focused on developing collaborations with new professors and encouraging the formation of long-term relationships with the academic community. In November we opened the inaugural call for proposals for this program, which was received with enthusiastic interest from faculty who are working on cutting edge research across many research areas in computer science, including machine learning, human computer interaction, health research, systems and more.

Today, we are pleased to announce that in this first year of the program we have granted 77 awards, which included 86 principal investigators representing 15+ countries and over 50 universities. Of the 86 award recipients, 43% identify as an historically marginalized group within technology. Please see the full list of 2021 recipients on our web page, as well as in the list below.

We offer our congratulations to this year’s recipients, and look forward to seeing what they achieve!

Algorithms and Optimization
Alexandros Psomas, Purdue University
Auction Theory Beyond Independent, Quasi-Linear Bidders
Julian Shun, Massachusetts Institute of Technology
Scalable Parallel Subgraph Finding and Peeling Algorithms
Mary Wootters, Stanford University
The Role of Redundancy in Algorithm Design
Pravesh K. Kothari, Carnegie Mellon University
Efficient Algorithms for Robust Machine Learning
Sepehr Assadi, Rutgers University
Graph Clustering at Scale via Improved Massively Parallel Algorithms

Augmented Reality and Virtual Reality
Srinath Sridhar, Brown University
Perception and Generation of Interactive Objects

Geo
Miriam E. Marlier, University of California, Los Angeles
Mapping California’s Compound Climate Hazards in Google Earth Engine
Suining He, The University of Connecticut
Fairness-Aware and Cross-Modality Traffic Learning and Predictive Modeling for Urban Smart Mobility Systems

Human Computer Interaction
Arvind Satyanarayan, Massachusetts Institute of Technology
Generating Semantically Rich Natural Language Captions for Data Visualizations to Promote Accessibility
Dina EL-Zanfaly, Carnegie Mellon University
In-the-making: An intelligence mediated collaboration system for creative practices
Katharina Reinecke, University of Washington
Providing Science-Backed Answers to Health-related Questions in Google Search
Misha Sra, University of California, Santa Barbara
Hands-free Game Controller for Quadriplegic Individuals
Mohsen Mosleh, University of Exeter Business School
Effective Strategies to Debunk False Claims on Social Media: A large-scale digital field experiments approach
Tanushree Mitra, University of Washington
Supporting Scalable Value-Sensitive Fact-Checking through Human-AI Intelligence

Health Research
Catarina Barata, Instituto Superior Técnico, Universidade de Lisboa
DeepMutation – A CNN Model To Predict Genetic Mutations In Melanoma Patients
Emma Pierson, Cornell Tech, the Jacobs Institute, Technion-Israel Institute of Technology, and Cornell University
Using cell phone mobility data to reduce inequality and improve public health
Jasmine Jones, Berea College
Reachout: Co-Designing Social Connection Technologies for Isolated Young Adults
Mojtaba Golzan, University of Technology Sydney, Jack Phu, University of New South Wales
Autonomous Grading of Dynamic Blood Vessel Markers in the Eye using Deep Learning
Serena Yeung, Stanford University
Artificial Intelligence Analysis of Surgical Technique in the Operating Room

Machine Learning and data mining
Aravindan Vijayaraghavan, Northwestern University, Sivaraman Balakrishnan, Carnegie Mellon University
Principled Approaches for Learning with Test-time Robustness
Cho-Jui Hsieh, University of California, Los Angeles
Scalability and Tunability for Neural Network Optimizers
Golnoosh Farnadi, University of Montreal, HEC Montreal/MILA
Addressing Algorithmic Fairness in Decision-focused Deep Learning
Harrie Oosterhuis, Radboud University
Search and Recommendation Systems that Learn from Diverse User Preferences
Jimmy Ba, University of Toronto
Model-based Reinforcement Learning with Causal World Models
Nadav Cohen, Tel-Aviv University
A Dynamical Theory of Deep Learning
Nihar Shah, Carnegie Mellon University
Addressing Unfairness in Distributed Human Decisions
Nima Fazeli, University of Michigan
Semi-Implicit Methods for Deformable Object Manipulation
Qingyao Ai, University of Utah
Metric-agnostic Ranking Optimization
Stefanie Jegelka, Massachusetts Institute of Technology
Generalization of Graph Neural Networks under Distribution Shifts
Virginia Smith, Carnegie Mellon University
A Multi-Task Approach for Trustworthy Federated Learning

Mobile
Aruna Balasubramanian, State University of New York – Stony Brook
AccessWear: Ubiquitous Accessibility using Wearables
Tingjun Chen, Duke University
Machine Learning- and Optical-enabled Mobile Millimeter-Wave Networks

Machine Perception
Amir Patel, University of Cape Town
WildPose: 3D Animal Biomechanics in the Field using Multi-Sensor Data Fusion
Angjoo Kanazawa, University of California, Berkeley
Practical Volumetric Capture of People and Scenes
Emanuele Rodolà, Sapienza University of Rome
Fair Geometry: Toward Algorithmic Debiasing in Geometric Deep Learning
Minchen Wei, The Hong Kong Polytechnic University
Accurate Capture of Perceived Object Colors for Smart Phone Cameras
Mohsen Ali, Information Technology University of the Punjab, Pakistan, Izza Aftab, Information Technology University of the Punjab, Pakistan
Is Economics From Afar Domain Generalizable?
Vineeth N Balasubramanian, Indian Institute of Technology Hyderabad
Bridging Perspectives of Explainability and Adversarial Robustness
Xin Yu, University of Technology Sydney, Linchao Zhu, University of Technology Sydney
Sign Language Translation in the Wild

Networking
Aurojit Panda, New York University
Bertha: Network APIs for the Programmable Network Era
Cristina Klippel Dominicini, Instituto Federal do Espirito Santo
Polynomial Key-based Architecture for Source Routing in Network Fabrics
Noa Zilberman, University of Oxford
Exposing Vulnerabilities in Programmable Network Devices
Rachit Agarwal, Cornell University
Designing Datacenter Transport for Terabit Ethernet

Natural Language Processing
Danqi Chen, Princeton University
Improving Training and Inference Efficiency of NLP Models
Derry Tanti Wijaya, Boston University, Anietie Andy, University of Pennsylvania
Exploring the evolution of racial biases over time through framing analysis
Eunsol Choi, University of Texas at Austin
Answering Information Seeking Questions In The Wild
Kai-Wei Chang, University of California, Los Angeles
Certified Robustness to against language differences in Cross-Lingual Transfer
Mohohlo Samuel Tsoeu, University of Cape Town
Corpora collection and complete natural language processing of isiXhosa, Sesotho and South African Sign languages
Natalia Diaz Rodriguez, University of Granada (Spain) + ENSTA, Institut Polytechnique Paris, Inria. Lorenzo Baraldi, University of Modena and Reggio Emilia
SignNet: Towards democratizing content accessibility for the deaf by aligning multi-modal sign representations

Other Research Areas
John Dickerson, University of Maryland – College Park, Nicholas Mattei, Tulane University
Fairness and Diversity in Graduate Admissions
Mor Nitzan, Hebrew University
Learning representations of tissue design principles from single-cell data
Nikolai Matni, University of Pennsylvania
Robust Learning for Safe Control

Privacy
Foteini Baldimtsi, George Mason University
Improved Single-Use Anonymous Credentials with Private Metabit
Yu-Xiang Wang, University of California, Santa Barbara
Stronger, Better and More Accessible Differential Privacy with autodp

Quantum Computing
Ashok Ajoy, University of California, Berkeley
Accelerating NMR spectroscopy with a Quantum Computer
John Nichol, University of Rochester
Coherent spin-photon coupling
Jordi Tura i Brugués, Leiden University
RAGECLIQ – Randomness Generation with Certification via Limited Quantum Devices
Nathan Wiebe, University of Toronto
New Frameworks for Quantum Simulation and Machine Learning
Philipp Hauke, University of Trento
ProGauge: Protecting Gauge Symmetry in Quantum Hardware
Shruti Puri, Yale University
Surface Code Co-Design for Practical Fault-Tolerant Quantum Computing

Structured data, extraction, semantic graph, and database management
Abolfazl Asudeh, University Of Illinois, Chicago
An end-to-end system for detecting cherry-picked trendlines
Eugene Wu, Columbia University
Interactive training data debugging for ML analytics
Jingbo Shang, University of California, San Diego
Structuring Massive Text Corpora via Extremely Weak Supervision

Security
Chitchanok Chuengsatiansup, The University of Adelaide, Markus Wagner, The University of Adelaide
Automatic Post-Quantum Cryptographic Code Generation and Optimization
Elette Boyle, IDC Herzliya, Israel
Cheaper Private Set Intersection via Advances in “Silent OT”
Joseph Bonneau, New York University
Zeroizing keys in secure messaging implementations
Yu Feng , University of California, Santa Barbara, Yuan Tian, University of Virginia
Exploit Generation Using Reinforcement Learning

Software engineering and programming languages
Kelly Blincoe, University of Auckland
Towards more inclusive software engineering practices to retain women in software engineering
Fredrik Kjolstad, Stanford University
Sparse Tensor Algebra Compilation to Domain-Specific Architectures
Milos Gligoric, University of Texas at Austin
Adaptive Regression Test Selection
Sarah E. Chasins, University of California, Berkeley
If you break it, you fix it: Synthesizing program transformations so that library maintainers can make breaking changes

Systems
Adwait Jog, College of William & Mary
Enabling Efficient Sharing of Emerging GPUs
Heiner Litz, University of California, Santa Cruz
Software Prefetching Irregular Memory Access Patterns
Malte Schwarzkopf, Brown University
Privacy-Compliant Web Services by Construction
Mehdi Saligane, University of Michigan
Autonomous generation of Open Source Analog & Mixed Signal IC
Nathan Beckmann, Carnegie Mellon University
Making Data Access Faster and Cheaper with Smarter Flash Caches
Yanjing Li, University of Chicago
Resilient Accelerators for Deep Learning Training Tasks

Read More

How fact checkers and Google.org are fighting misinformation

Misinformation can have dramatic consequences on people’s lives — from finding reliable information on everything from elections to vaccinations — and the pandemic has only exacerbated the problem as accurate information can save lives. To help fight the rise in minsformation, Full Fact, a nonprofit that provides tools and resources to fact checkers, turned to Google.org for help. Today, ahead of International Fact Checking Day, we’re sharing the impact of this work.

Every day, millions of claims, like where to vote and COVID-19 vaccination rates, are made across a multitude of platforms and media. It was becoming increasingly difficult for fact checkers to identify the most important claims to investigate.

We’re not just fighting an epidemic; we’re fighting an infodemic. Fake news spreads faster and more easily than this virus and is just as dangerous. Tedros Adhanom
Director General of the World Health Organization

Last year, Google.org provided Full Fact with $2 million and seven Googlers from the Google.org Fellowship, a pro-bono program that matches teams of Googlers with nonprofits for up to six months to work full-time on technical projects. The Fellows helped Full Fact build AI tools to help fact checkers detect claims made by key politicians, then group them by topic and match them with similar claims from across press, social networks and even radio using speech to text technology. Over the past year, Full Fact boosted the amount of claims they could process by 1000x, detecting and clustering over 100,000 claims per day — that’s more than 36.5 million total claims per year!

The AI-powered tools empower fact checkers to be more efficient, so that they can spend more time actually checking and debunking facts rather than identifying which facts to check. Using a machine learning BERT-based model, the technology now works across four languages (English, French, Portuguese and Spanish). And Full Fact’s work has expanded to South Africa, Nigeria, Kenya with their partner Africa Check and Argentina with Chequeado. In total in 2020, Full Fact’s fact checks appeared 237 million times across the internet. 

Graphic showing the following impact statistics: 1000x increase in detected claims, fact checks appeared 237 million times in search results, the technology works across 4 languages, and  50K claims were detected per day in the UK election.

If you’re interested in learning more about how you can use Google to fact check and spot misinformation, check out some of our tips and tricks. Right now more than ever we need to empower citizens to find reliable authoritative information, and we’re excited about the impact that Full Fact and its partners have had in making the internet a safer place for everyone. 

Read More