Introducing Model Search: An Open Source Platform for Finding Optimal ML Models

Posted by Hanna Mazzawi, Research Engineer and Xavi Gonzalvo, Research Scientist, Google Research

The success of a neural network (NN) often depends on how well it can generalize to various tasks. However, designing NNs that can generalize well is challenging because the research community’s understanding of how a neural network generalizes is currently somewhat limited: What does the appropriate neural network look like for a given problem? How deep should it be? Which types of layers should be used? Would LSTMs be enough or would Transformer layers be better? Or maybe a combination of the two? Would ensembling or distillation boost performance? These tricky questions are made even more challenging when considering machine learning (ML) domains where there may exist better intuition and deeper understanding than others.

In recent years, AutoML algorithms have emerged [e.g., 1, 2, 3] to help researchers find the right neural network automatically without the need for manual experimentation. Techniques like neural architecture search (NAS), use algorithms, like reinforcement learning (RL), evolutionary algorithms, and combinatorial search, to build a neural network out of a given search space. With the proper setup, these techniques have demonstrated they are capable of delivering results that are better than the manually designed counterparts. But more often than not, these algorithms are compute heavy, and need thousands of models to train before converging. Moreover, they explore search spaces that are domain specific and incorporate substantial prior human knowledge that does not transfer well across domains. As an example, in image classification, the traditional NAS searches for two good building blocks (convolutional and downsampling blocks), that it arranges following traditional conventions to create the full network.

To overcome these shortcomings and to extend access to AutoML solutions to the broader research community, we are excited to announce the open source release of Model Search, a platform that helps researchers develop the best ML models, efficiently and automatically. Instead of focusing on a specific domain, Model Search is domain agnostic, flexible and is capable of finding the appropriate architecture that best fits a given dataset and problem, while minimizing coding time, effort and compute resources. It is built on Tensorflow, and can run either on a single machine or in a distributed setting.

Overview
The Model Search system consists of multiple trainers, a search algorithm, a transfer learning algorithm and a database to store the various evaluated models. The system runs both training and evaluation experiments for various ML models (different architectures and training techniques) in an adaptive, yet asynchronous, fashion. While each trainer conducts experiments independently, all trainers share the knowledge gained from their experiments. At the beginning of every cycle, the search algorithm looks up all the completed trials and uses beam search to decide what to try next. It then invokes mutation over one of the best architectures found thus far and assigns the resulting model back to a trainer.

Model Search schematic illustrating the distributed search and ensembling. Each trainer runs independently to train and evaluate a given model. The results are shared with the search algorithm, which it stores. The search algorithm then invokes mutation over one of the best architectures and then sends the new model back to a trainer for the next iteration. S is the set of training and validation examples and A are all the candidates used during training and search.

The system builds a neural network model from a set of predefined blocks, each of which represents a known micro-architecture, like LSTM, ResNet or Transformer layers. By using blocks of pre-existing architectural components, Model Search is able to leverage existing best knowledge from NAS research across domains. This approach is also more efficient, because it explores structures, not their more fundamental and detailed components, therefore reducing the scale of the search space.

Neural network micro architecture blocks that work well, e.g., a ResNet Block.

Because the Model Search framework is built on Tensorflow, blocks can implement any function that takes a tensor as an input. For example, imagine that one wants to introduce a new search space built with a selection of micro architectures. The framework will take the newly defined blocks and incorporate them into the search process so that algorithms can build the best possible neural network from the components provided. The blocks provided can even be fully defined neural networks that are already known to work for the problem of interest. In that case, Model Search can be configured to simply act as a powerful ensembling machine.

The search algorithms implemented in Model Search are adaptive, greedy and incremental, which makes them converge faster than RL algorithms. They do however imitate the “explore & exploit” nature of RL algorithms by separating the search for a good candidate (explore step), and boosting accuracy by ensembling good candidates that were discovered (exploit step). The main search algorithm adaptively modifies one of the top k performing experiments (where k can be specified by the user) after applying random changes to the architecture or the training technique (e.g., making the architecture deeper).

An example of an evolution of a network over many experiments. Each color represents a different type of architecture block. The final network is formed via mutations of high performing candidate networks, in this case adding depth.

To further improve efficiency and accuracy, transfer learning is enabled between various internal experiments. Model Search does this in two ways — via knowledge distillation or weight sharing. Knowledge distillation allows one to improve candidates’ accuracies by adding a loss term that matches the high performing models’ predictions in addition to the ground truth. Weight sharing, on the other hand, bootstraps some of the parameters (after applying mutation) in the network from previously trained candidates by copying suitable weights from previously trained models and randomly initializing the remaining ones. This enables faster training, which allows opportunities to discover more (and better) architectures.

Experimental Results
Model Search improves upon production models with minimal iterations. In a recent paper, we demonstrated the capabilities of Model Search in the speech domain by discovering a model for keyword spotting and language identification. Over fewer than 200 iterations, the resulting model slightly improved upon internal state-of-the-art production models designed by experts in accuracy using ~130K fewer trainable parameters (184K compared to 315K parameters).

Model accuracy given iteration in our system compared to the previous production model for keyword spotting, a similar graph can be found for language identification in the linked paper.

We also applied Model Search to find an architecture suitable for image classification on the heavily explored CIFAR-10 imaging dataset. Using a set known convolution blocks, including convolutions, resnet blocks (i.e., two convolutions and a skip connection), NAS-A cells, fully connected layers, etc., we observed that we were able to quickly reach a benchmark accuracy of 91.83 in 209 trials (i.e., exploring only 209 models). In comparison, previous top performers reached the same threshold accuracy in 5807 trials for the NasNet algorithm (RL), and 1160 for PNAS (RL + Progressive).

Conclusion
We hope the Model Search code will provide researchers with a flexible, domain-agnostic framework for ML model discovery. By building upon previous knowledge for a given domain, we believe that this framework is powerful enough to build models with the state-of-the-art performance on well studied problems when provided with a search space composed of standard building blocks.

Acknowledgements
Special thanks to all code contributors to the open sourcing and the paper: Eugen Ehotaj, Scotty Yak, Malaika Handa, James Preiss, Pai Zhu, Aleks Kracun, Prashant Sridhar, Niranjan Subrahmanya, Ignacio Lopez Moreno, Hyun Jin Park, and Patrick Violette.

Read More

Marian Croak’s vision for responsible AI at Google

Dr. Marian Croak has spent decades working on groundbreaking technology, with over 200 patents in areas such as Voice over IP, which laid the foundation for the calls we all use to get things done and stay in touch during the pandemic. For the past six years she’s been a VP at Google working on everything from site reliability engineering to bringing public Wi-Fi to India’s railroads.

Now, she’s taking on a new project: making sure Google develops artificial intelligence responsibly and that it has a positive impact. To do this, Marian has created and will lead a new center of expertise on responsible AI within Google Research.

I sat down (virtually) with Marian to talk about her new role and her vision for responsible AI at Google. You can watch parts of our conversation in the video above, or read on for a few key points she discussed.

Technology should be designed with people in mind. 

“My graduate studies were in both quantitative analysis and social psychology. I did my dissertation on looking at societal factors that influence inter-group bias as well as altruistic behavior. And so I’ve always approached engineering with that kind of mindset, looking at the impact of what we’re doing on users in general. […] What I believe very, very strongly is that any technology that we’re designing should have a positive impact on society.”

Responsible AI research requires input from many different teams.

“I’m excited to be able to galvanize the brilliant talent that we have at Google working on this. We have to make sure we have the frameworks and the software and the best practices designed by the researchers and the applied engineers […] so we can proudly say that our systems are behaving in responsible ways. The research that’s going on needs to inform that work, the work we’re doing with engineering better solutions, and it needs to be shared with the outside world as well. I am thrilled to support teams doing both pure research as well as applied research — both are valuable and absolutely necessary to ensure technology has a positive impact on the world.’’

This area is new, and there are still growing pains.

“This field, the field of responsible AI and ethics, is new. Most institutions have only developed principles, and they’re very high-level, abstract principles, in the last five years. There’s a lot of dissension, a lot of conflict in terms of trying to standardize on normative definitions of these principles. Whose definition of fairness, or safety, are we going to use? There’s quite a lot of conflict right now within the field, and it can be polarizing at times. And what I’d like to do is have people have the conversation in a more diplomatic way, perhaps, than we’re having it now, so we can truly advance this field.”

Compromise can be tough, but the result is worth it.  

“If you look at the work we did on VoIP, it required such a huge organizational and business shift in the company I was working for. We had to bring teams together that were very contentious — people who had domain expertise in the internet and could move in a fast and furious way, along with others who were very methodical and disciplined in their approach. Huge conflicts! But over time it settled, and we were able to really make a huge difference in terms of being able to scale VoIP in a way that allowed it to handle billions and billions of calls in a very robust and resilient way. So it was more than worth it.”

(Photo credit: Phobymo)

Read More

Mastering Atari with Discrete World Models

Posted by Danijar Hafner, Student Researcher, Google Research

Deep reinforcement learning (RL) enables artificial agents to improve their decisions over time. Traditional model-free approaches learn which of the actions are successful in different situations by interacting with the environment through a large amount of trial and error. In contrast, recent advances in deep RL have enabled model-based approaches to learn accurate world models from image inputs and use them for planning. World models can learn from fewer interactions, facilitate generalization from offline data, enable forward-looking exploration, and allow reusing knowledge across multiple tasks.

Despite their intriguing benefits, existing world models (such as SimPLe) have not been accurate enough to compete with the top model-free approaches on the most competitive reinforcement learning benchmarks — to date, the well-established Atari benchmark requires model-free algorithms, such as DQN, IQN, and Rainbow, to reach human-level performance. As a result, many researchers have focused instead on developing task-specific planning methods, such as VPN and MuZero, which learn by predicting sums of expected task rewards. However, these methods are specific to individual tasks and it is unclear how well they would generalize to new tasks or learn from unsupervised datasets. Similar to the recent breakthrough of unsupervised representation learning in computer vision [1, 2], world models aim to learn patterns in the environment that are more general than any particular task to later solve tasks more efficiently.

Today, in collaboration with DeepMind and the University of Toronto, we introduce DreamerV2, the first RL agent based on a world model to achieve human-level performance on the Atari benchmark. It constitutes the second generation of the Dreamer agent that learns behaviors purely within the latent space of a world model trained from pixels. DreamerV2 relies exclusively on general information from the images and accurately predicts future task rewards even when its representations were not influenced by those rewards. Using a single GPU, DreamerV2 outperforms top model-free algorithms with the same compute and sample budget.

Gamer normalized median score across the 55 Atari games after 200 million steps. DreamerV2 substantially outperforms previous world models. Moreover, it exceeds top model-free agents within the same compute and sample budget.
Behaviors learned by DreamerV2 for some of the 55 Atari games. These videos show images from the environment. Video predictions are shown below in the blog post.

An Abstract Model of the World
Just like its predecessor, DreamerV2 learns a world model and uses it to train actor-critic behaviors purely from predicted trajectories. The world model automatically learns to compute compact representations of its images that discover useful concepts, such as object positions, and learns how these concepts change in response to different actions. This lets the agent generate abstractions of its images that ignore irrelevant details and enables massively parallel predictions on a single GPU. During 200 million environment steps, DreamerV2 predicts 468 billion compact states for learning its behavior.

DreamerV2 builds upon the Recurrent State-Space Model (RSSM) that we introduced for PlaNet and was also used for DreamerV1. During training, an encoder turns each image into a stochastic representation that is incorporated into the recurrent state of the world model. Because the representations are stochastic, they do not have access to perfect information about the images and instead extract only what is necessary to make predictions, making the agent robust to unseen images. From each state, a decoder reconstructs the corresponding image to learn general representations. Moreover, a small reward network is trained to rank outcomes during planning. To enable planning without generating images, a predictor learns to guess the stochastic representations without access to the images from which they were computed.

Learning process of the world model used by DreamerV2. The world model maintains recurrent states (h1–h3) that receive actions (a1–a2) and incorporate information about the images (x1–x3) via stochastic representations (z1–z3). A predictor guesses the representations as (ẑ1–ẑ3) without access to the images from which they were generated.

Importantly, DreamerV2 introduces two new techniques to RSSM that lead to a substantially more accurate world model for learning successful policies. The first technique is to represent each image with multiple categorical variables instead of the Gaussian variables used by PlaNet, DreamerV1, and many more world models in the literature [1, 2, 3, 4, 5]. This leads the world model to reason about the world in terms of discrete concepts and enables more accurate predictions of future representations.

The encoder turns each image into 32 distributions over 32 classes each, the meanings of which are determined automatically as the world model learns. The one-hot vectors sampled from these distributions are concatenated to a sparse representation that is passed on to the recurrent state. To backpropagate through the samples, we use straight-through gradients that are easy to implement using automatic differentiation. Representing images with categorical variables allows the predictor to accurately learn the distribution over the one-hot vectors of the possible next images. In contrast, earlier world models that use Gaussian predictors cannot accurately match the distribution over multiple Gaussian representations for the possible next images.

Multiple categoricals that represent possible next images can be accurately predicted by a categorical predictor, whereas a Gaussian predictor is not flexible enough to accurately predict multiple possible Gaussian representations.

The second new technique of DreamerV2 is KL balancing. Many previous world models use the ELBO objective that encourages accurate reconstructions while keeping the stochastic representations (posteriors) close to their predictions (priors) to regularize the amount of information extracted from each image and facilitate generalization. Because the objective is optimized end-to-end, the stochastic representations and their predictions can be made more similar by bringing either of the two towards the other. However, bringing the representations towards their predictions can be problematic when the predictor is not yet accurate. KL balancing lets the predictions move faster toward the representations than vice versa. This results in more accurate predictions, a key to successful planning.

Long-term video predictions of the world model for holdout sequences. Each model receives 5 frames as input (not shown) and then predicts 45 steps forward given only actions. The video predictions are only used to gain insights into the quality of the world model. During planning, only compact representations are predicted, not images.

Measuring Atari Performance
DreamerV2 is the first world model that enables learning successful behaviors with human-level performance on the well-established and competitive Atari benchmark. We select the 55 games that many previous studies have in common and recommend this set of games for future work. Following the standard evaluation protocol, the agents are allowed 200M environment interactions using an action repeat of 4 and sticky actions (25% chance that an action is ignored and the previous action is repeated instead). We compare to the top model-free agents IQN and Rainbow, as well as to the well-known C51 and DQN agents implemented in the Dopamine framework.

Different standards exist for aggregating the scores across the 55 games. Ideally, a new algorithm would perform better under all conditions. For all four aggregation methods, DreamerV2 indeed outperforms all compared model-free algorithms while using the same computational budget.

DreamerV2 outperforms the top model-free agents according to four methods for aggregating scores across the 55 Atari games. We introduce and recommend the Clipped Record Mean (right-most plot) as an informative and robust performance metric.

The first three aggregation methods were previously proposed in the literature. We identify important drawbacks in each and recommend a new aggregation method, the clipped record mean to overcome their drawbacks.

  • Gamer Median. Most commonly, scores for each game are normalized by the performance of a human gamer that was assessed for the DQN paper and the median of the normalized scores of all games is reported. Unfortunately, the median ignores the scores of many simpler and harder games.
  • Gamer Mean. The mean takes the scores for all games into account but is mainly influenced by a small number of games where the human gamer performed poorly. This makes it easy for an algorithm to achieve large normalized scores on some games (e.g., James Bond, Video Pinball) that then dominate the mean.
  • Record Mean. Prior work recommends normalization based on the human world record instead, but such a metric is still overly influenced by a small number of games where it is easy for the artificial agents to outscore the human record.
  • Clipped Record Mean. We introduce a new metric that normalizes scores by the world record and clips them to not exceed the record. This yields an informative and robust metric that takes the performance on all games into account to an approximately equal amount.

While many current algorithms exceed the human gamer baseline, they are still quite far behind the human world record. As shown in the right-most plot above, DreamerV2 leads by achieving 25% of the human record on average across games. Clipping the scores at the record line lets us focus our efforts on developing methods that come closer to the human world record on all of the games rather than exceeding it on just a few games.

What matters and what doesn’t
To gain insights into the important components of DreamerV2, we conduct an extensive ablation study. Importantly, we find that categorical representations offer a clear advantage over Gaussian representations despite the fact that Gaussians have been used extensively in prior works. KL balancing provides an even more substantial advantage over the KL regularizer used by most generative models.

By preventing the image reconstruction or reward prediction gradients from shaping the model states, we study their importance for learning successful representations. We find that DreamerV2 relies completely on universal information from the high-dimensional input images and its representations enable accurate reward predictions even when they were not trained using information about the reward. This mirrors the success of unsupervised representation learning in the computer vision community.

Atari performance for various ablations of DreamerV2 (Clipped Record Mean). Categorical representations, KL balancing, and learning about the images are crucial for the success of DreamerV2. Using reward information, that is specific to narrow tasks, offers no additional benefits for learning the world model.

Conclusion
We show how to learn a powerful world model to achieve human-level performance on the competitive Atari benchmark and outperform the top model-free agents. This result demonstrates that world models are a powerful approach for achieving high performance on reinforcement learning problems and are ready to use for practitioners and researchers. We see this as an indication that the success of unsupervised representation learning in computer vision [1, 2] is now starting to be realized in reinforcement learning in the form of world models. An unofficial implementation of DreamerV2 is available on Github and provides a productive starting point for future research projects. We see world models that leverage large offline datasets, long-term memory, hierarchical planning, and directed exploration as exciting avenues for future research.

Acknowledgements
This project is a collaboration with Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. We further thank everybody on the Brain Team and beyond who commented on our paper draft and provided feedback at any point throughout the project.

Read More

Rearranging the Visual World

Posted by Andy Zeng and Pete Florence, Research Scientists, Robotics at Google

Rearranging objects (such as organizing books on a bookshelf, moving utensils on a dinner table, or pushing piles of coffee beans) is a fundamental skill that can enable robots to physically interact with our diverse and unstructured world. While easy for people, accomplishing such tasks remains an open research challenge for embodied machine learning (ML) systems, as it requires both high-level and low-level perceptual reasoning. For example, when stacking a pile of books, one might consider where the books should be stacked, and in which order, while ensuring that the edges of the books align with each other to form a neat pile.

Across many application areas in ML, simple differences in model architecture can exhibit vastly different generalization properties. Therefore, one might ask whether there are certain deep network architectures that favor simple underlying elements of the rearrangement problem. Convolutional architectures, for example, are common in computer vision as they encode translational invariance, yielding the same response even if an image is shifted, while Transformer architectures are common in language processing because they exploit self-attention to capture long-range contextual dependencies. In robotics applications, one common architectural element is to use object-centric representations such as poses, keypoints, or object descriptors inside learned models, but these representations require additional training data (often manually annotated) and struggle to describe difficult scenarios such as deformables (e.g., playdough), fluids (honey), or piles of stuff (chopped onions).

Today, we present the Transporter Network, a simple model architecture for learning vision-based rearrangement tasks, which appeared as a publication and plenary talk during CoRL 2020. Transporter Nets use a novel approach to 3D spatial understanding that avoids reliance on object-centric representations, making them general for vision-based manipulation but far more sample efficient than benchmarked end-to-end alternatives. As a consequence, they are fast and practical to train on real robots. We are also releasing an accompanying open-source implementation of Transporter Nets together with Ravens, our new simulated benchmark suite of ten vision-based manipulation tasks.

Transporter Networks: Rearranging the Visual World for Robotic Manipulation
The key idea behind the Transporter Network architecture is that one can formulate the rearrangement problem as learning how to move a chunk of 3D space. Rather than relying on an explicit definition of objects (which is bound to struggle at capturing all edge cases), 3D space is a much broader definition for what could serve as the atomic units being rearranged, and can broadly encompass an object, part of an object, or multiple objects, etc. Transporter Nets leverage this structure by capturing a deep representation of the 3D visual world, then overlaying parts of it on itself to imagine various possible rearrangements of 3D space. It then chooses the rearrangements that best match those it has seen during training (e.g., from expert demonstrations), and uses them to parameterize robot actions. This formulation allows Transporter Nets to generalize to unseen objects and enables them to better exploit geometric symmetries in the data, so that they can extrapolate to new scene configurations. Transporter Nets are applicable to a wide variety of rearrangement tasks for robotic manipulation, expanding beyond our earlier models, such as affordance-based manipulation and TossingBot, that focus only on grasping and tossing.

Transporter Nets capture a deep representation of the visual world, then overlay parts of it on itself to imagine various possible rearrangements of 3D space to find the best one and inform robot actions.

Ravens Benchmark
To evaluate the performance of Transporter Nets in a consistent environment for fair comparisons to baselines and ablations, we developed Ravens, a benchmark suite of ten simulated vision-based rearrangement tasks. Ravens features a Gym API with a built-in stochastic oracle to evaluate the sample efficiency of imitation learning methods. Ravens avoids assumptions that cannot transfer to a real setup: observation data contains only RGB-D images and camera parameters; actions are end effector poses (transposed into joint positions with inverse kinematics).

Experiments on these ten tasks show that Transporter Nets are orders of magnitude more sample efficient than other end-to-end methods, and are capable of achieving over 90% success on many tasks with just 100 demonstrations, while the baselines struggle to generalize with the same amount of data. In practice, this makes collecting enough demonstrations a more viable option for training these models on real robots (which we show examples of below).

Our new Ravens benchmark includes ten simulated vision-based manipulation tasks, including pushing and pick-and-place, for which experiments show that Transporter Nets are orders of magnitude more sample efficient than other end-to-end methods. Ravens features a Gym API with a built-in stochastic oracle to evaluate the sample efficiency of imitation learning methods.

Our new Ravens benchmark includes ten simulated vision-based manipulation tasks, including pushing and pick-and-place, for which experiments show that Transporter Nets are orders of magnitude more sample efficient than other end-to-end methods. Ravens features a Gym API with a built-in stochastic oracle to evaluate the sample efficiency of imitation learning methods.

Highlights
Given 10 example demonstrations, Transporter Nets can learn pick and place tasks such as stacking plates (surprisingly easy to misplace!), multimodal tasks like aligning any corner of a box to a marker on the tabletop, or building a pyramid of blocks.

By leveraging closed-loop visual feedback, Transporter Nets have the capacity to learn various multi-step sequential tasks with a modest number of demonstrations: such as moving disks for Tower of Hanoi, palletizing boxes, or assembling kits of new objects not seen during training. These tasks have considerably “long horizons”, meaning that to solve the task the model must correctly sequence many individual choices. Policies also tend to learn emergent recovery behaviors.

One surprising thing about these results was that beyond just perception, the models were starting to learn behaviors that resemble high-level planning. For example, to solve Towers of Hanoi, the models have to pick which disk to move next, which requires recognizing the state of the board based on the current visible disks and their positions. With a box-palletizing task, the models must locate the empty spaces of the pallet, and identify how new boxes can fit into those voids. Such behaviors are exciting because they suggest that with all the baked-in invariances, the model can focus its capacity on learning the more high-level patterns in manipulation.

Transporter Nets can also learn tasks that use any motion primitive defined by two end effector poses, such as pushing piles of small objects into a target set, or reconfiguring a deformable rope to connect the two end-points of a 3-sided square. This suggests that rigid spatial displacements can serve as useful priors for nonrigid ones.

Conclusion
Transporter Nets bring a promising approach to learning vision-based manipulation, but are not without limitations. For example, they can be susceptible to noisy 3D data, we have only demonstrated them for sparse waypoint-based control with motion primitives, and it remains unclear how to extend them beyond spatial action spaces to force or torque-based actions. But overall, we are excited about this direction of work, and we hope that it provides inspiration for extensions beyond the applications we’ve discussed. For more details, please check out our paper.

Acknowledgements
This research was done by Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Vikas Sindhwani, and Johnny Lee, with special thanks to Ken Goldberg, Razvan Surdulescu, Daniel Seita, Ayzaan Wahid, Vincent Vanhoucke, Anelia Angelova, Kendra Byrne, for helpful feedback on writing; Sean Snyder, Jonathan Vela, Larry Bisares, Michael Villanueva, Brandon Hurd for operations and hardware support; Robert Baruch for software infrastructure, Jared Braun for UI contributions; Erwin Coumans for PyBullet advice; Laura Graesser for video narration.

Read More

3D Scene Understanding with TensorFlow 3D

Posted by Alireza Fathi, Research Scientist and Rui Huang, AI Resident, Google Research

The growing ubiquity of 3D sensors (e.g., Lidar, depth sensing cameras and radar) over the last few years has created a need for scene understanding technology that can process the data these devices capture. Such technology can enable machine learning (ML) systems that use these sensors, like autonomous cars and robots, to navigate and operate in the real world, and can create an improved augmented reality experience on mobile devices. The field of computer vision has recently begun making good progress in 3D scene understanding, including models for mobile 3D object detection, transparent object detection, and more, but entry to the field can be challenging due to the limited availability tools and resources that can be applied to 3D data.

In order to further improve 3D scene understanding and reduce barriers to entry for interested researchers, we are releasing TensorFlow 3D (TF 3D), a highly modular and efficient library that is designed to bring 3D deep learning capabilities into TensorFlow. TF 3D provides a set of popular operations, loss functions, data processing tools, models and metrics that enables the broader research community to develop, train and deploy state-of-the-art 3D scene understanding models.

TF 3D contains training and evaluation pipelines for state-of-the-art 3D semantic segmentation, 3D object detection and 3D instance segmentation, with support for distributed training. It also enables other potential applications like 3D object shape prediction, point cloud registration and point cloud densification. In addition, it offers a unified dataset specification and configuration for training and evaluation of the standard 3D scene understanding datasets. It currently supports the Waymo Open, ScanNet, and Rio datasets. However, users can freely convert other popular datasets, such as NuScenes and Kitti, into a similar format and use them in the pre-existing or custom created pipelines, and can leverage TF 3D for a wide variety of 3D deep learning research and applications, from quickly prototyping and trying new ideas to deploying a real-time inference system.

An example output of the 3D object detection model in TF 3D on a frame from Waymo Open Dataset is shown on the left. An example output of the 3D instance segmentation model on a scene from ScanNet dataset is shown on the right.

Here, we will present the efficient and configurable sparse convolutional backbone that is provided in TF 3D, which is the key to achieving state-of-the-art results on various 3D scene understanding tasks. Furthermore, we will go over each of the three pipelines that TF 3D currently supports: 3D semantic segmentation, 3D object detection and 3D instance segmentation.

3D Sparse Convolutional Network
The 3D data captured by sensors often consists of a scene that contains a set of objects of interest (e.g. cars, pedestrians, etc.) surrounded mostly by open space, which is of limited (or no) interest. As such, 3D data is inherently sparse. In such an environment, standard implementation of convolutions would be computationally intensive and consume a large amount of memory. So, in TF 3D we use submanifold sparse convolution and pooling operations, which are designed to process 3D sparse data more efficiently. Sparse convolutional models are core to the state-of-the-art methods applied in most outdoor self-driving (e.g. Waymo, NuScenes) and indoor benchmarks (e.g. ScanNet).

We also use various CUDA techniques to speed up the computation (e.g., hashing, partitioning / caching the filter in shared memory, and using bit operations). Experiments on the Waymo Open dataset shows that this implementation is around 20x faster than a well-designed implementation with pre-existing TensorFlow operations.

TF 3D then uses the 3D submanifold sparse U-Net architecture to extract a feature for each voxel. The U-Net architecture has proven to be effective by letting the network extract both coarse and fine features and combining them to make the predictions. The U-Net network consists of three modules, an encoder, a bottleneck, and a decoder, each of which consists of a number of sparse convolution blocks with possible pooling or un-pooling operations.

A 3D sparse voxel U-Net architecture. Note that a horizontal arrow takes in the voxel features and applies a submanifold sparse convolution to it. An arrow that is moving down performs a submanifold sparse pooling. An arrow that is moving up will gather back the pooled features, concatenate them with the features coming from the horizontal arrow, and perform a submanifold sparse convolution on the concatenated features.

The sparse convolutional network described above is the backbone for the 3D scene understanding pipelines that are offered in TF 3D. Each of the models described below uses this backbone network to extract features for the sparse voxels, and then adds one or multiple additional prediction heads to infer the task of interest. The user can configure the U-Net network by changing the number of encoder / decoder layers and the number of convolutions in each layer, and by modifying the convolution filter sizes, which enables a wide range of speed / accuracy tradeoffs to be explored through the different backbone configurations

3D Semantic Segmentation
The 3D semantic segmentation model has only one output head for predicting the per-voxel semantic scores, which are mapped back to points to predict a semantic label per point.

3D semantic segmentation of an indoor scene from ScanNet dataset.

3D Instance Segmentation
In 3D instance segmentation, in addition to predicting semantics, the goal is to group the voxels that belong to the same object together. The 3D instance segmentation algorithm used in TF 3D is based on our previous work on 2D image segmentation using deep metric learning. The model predicts a per-voxel instance embedding vector as well as a semantic score for each voxel. The instance embedding vectors map the voxels to an embedding space where voxels that correspond to the same object instance are close together, while those that correspond to different objects are far apart. In this case, the input is a point cloud instead of an image, and it uses a 3D sparse network instead of a 2D image network. At inference time, a greedy algorithm picks one instance seed at a time, and uses the distance between the voxel embeddings to group them into segments.

3D Object Detection
The 3D object detection model predicts per-voxel size, center, and rotation matrices and the object semantic scores. At inference time, a box proposal mechanism is used to reduce the hundreds of thousands of per-voxel box predictions into a few accurate box proposals, and then at training time, box prediction and classification losses are applied to per-voxel predictions. We apply a Huber loss on the distance between predicted and the ground-truth box corners. Since the function that estimates the box corners from its size, center and rotation matrix is differentiable, the loss will automatically propagate back to those predicted object properties. We use a dynamic box classification loss that classifies a box that strongly overlaps with the ground-truth as positive and classifies the non-overlapping boxes as negative.

Our 3D object detection results on ScanNet dataset.

In our recent paper, “DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes”, we describe in detail the single-stage weakly supervised learning algorithm used for object detection in TF 3D. In addition, in a follow up work, we extended the 3D object detection model to leverage temporal information by proposing a sparse LSTM-based multi-frame model. We go on to show that this temporal model outperforms the frame-by-frame approach by 7.5% in the Waymo Open dataset.

The 3D object detection and shape prediction model introduced in the DOPS paper. A 3D sparse U-Net is used to extract a feature vector for each voxel. The object detection module uses these features to propose 3D boxes and semantic scores. At the same time, the other branch of the network predicts a shape embedding that is used to output a mesh for each object.

Ready to Get Started?
We’ve certainly found this codebase to be useful for our 3D computer vision projects, and we hope that you will as well. Contributions to the codebase are welcome and please stay tuned for our own further updates to the framework. To get started please visit our github repository.

Acknowledgements
The release of the TensorFlow 3D codebase and model has been the result of widespread collaboration among Google researchers with feedback and testing from product groups. In particular we want to highlight the core contributions by Alireza Fathi and Rui Huang (work performed while at Google), with special additional thanks to Guangda Lai, Abhijit Kundu, Pei Sun, Thomas Funkhouser, David Ross, Caroline Pantofaru, Johanna Wald, Angela Dai and Matthias Niessner.

Read More

Uncovering Unknown Unknowns in Machine Learning

Posted by Lora Aroyo and Praveen Paritosh, Research Scientists, Google Research

The performance of machine learning (ML) models depends both on the learning algorithms, as well as the data used for training and evaluation. The role of the algorithms is well studied and the focus of a multitude of challenges, such as SQuAD, GLUE, ImageNet, and many others. In addition, there have been efforts to also improve the data, including a series of workshops addressing issues for ML evaluation. In contrast, research and challenges that focus on the data used for evaluation of ML models are not commonplace. Furthermore, many evaluation datasets contain items that are easy to evaluate, e.g., photos with a subject that is easy to identify, and thus they miss the natural ambiguity of real world context. The absence of ambiguous real-world examples in evaluation undermines the ability to reliably test machine learning performance, which makes ML models prone to develop “weak spots”, i.e., classes of examples that are difficult or impossible for a model to accurately evaluate, because that class of examples is missing from the evaluation set.

To address the problem of identifying these weaknesses in ML models, we recently launched the Crowdsourcing Adverse Test Sets for Machine Learning (CATS4ML) Data Challenge at HCOMP 2020 (open until 30 April, 2021 to researchers and developers worldwide). The goal of the challenge is to raise the bar in ML evaluation sets and to find as many examples as possible that are confusing or otherwise problematic for algorithms to process. CATS4ML relies on people’s abilities and intuition to spot new data examples about which machine learning is confident, but actually misclassifies.

What are ML “Weak Spots”?
There are two categories of weak spots: known unknowns and unknown unknowns. Known unknowns are examples for which a model is unsure about the correct classification. The research community continues to study this in a field known as active learning, and has found the solution to be, in very general terms, to interactively solicit new labels from people on uncertain examples. For example, if a model is not certain whether or not the subject of a photo is a cat, a person is asked to verify; but if the system is certain, a person is not asked. While there is room for improvement in this area, what is comforting is that the confidence of the model is correlated with its performance, i.e., one can see what the model doesn’t know.

Unknown unknowns, on the other hand, are examples where a model is confident about its answer, but is actually wrong. Efforts to proactively discover unknown unknowns (e.g., Attenberg 2015 and Crawford 2019) have helped uncover a multitude of unintended machine behaviours. In contrast to such approaches for the discovery of unknown unknowns, generative adversarial networks (GANs) generate unknown unknowns for image recognition models in the form of optical illusions for computers that cause deep learning models to make mistakes beyond human perception. While GANs uncover model exploits in the event of an intentional manipulation, real-world examples can better highlight a model’s failures in its day-to-day performance. These real-world examples are the unknown unknowns of interest to CATS4ML — the challenge aims to gather unmanipulated examples that humans can reliably interpret but on which many ML models would confidently disagree.

Example illustrating how optical illusions for computers caused by adversarial noise help discover machine manipulated unknown unknowns for ML models (based on Brown 2018).

First Edition of CATS4ML Data Challenge: Open Images Dataset
The CATS4ML Data Challenge focuses on visual recognition, using images and labels from the Open Images Dataset. The target images for the challenge are selected from the Open Images Dataset along with a set of 24 target labels from the same dataset. The challenge participants are invited to invent new and creative ways to explore this existing publicly available dataset and, focussed on a list of pre-selected target labels, discover examples of unknown unknowns for ML models.

Examples from the Open Images Dataset as possible unknown unknowns for ML models.

CATS4ML is a complementary effort to FAIR’s recently introduced DynaBench research platform for dynamic data collection. Where DynaBench tackles issues with static benchmarks using ML models with humans in the loop, CATS4ML focuses on improving evaluation datasets for ML by encouraging the exploration of existing ML benchmarks for adverse examples that can be unknown unknowns. The results will help detect and avoid future errors, and also will give insights to model explainability.

In this way, CATS4ML aims to raise greater awareness of the problem by providing dataset resources that developers can use to uncover the weak spots of their algorithms. This will also inform researchers on how to create benchmark datasets for machine learning that are more balanced, diverse and socially aware.

Get Involved
We invite the global community of ML researchers and practitioners to join us in the effort of discovering interesting, difficult examples from the Open Images Dataset. Register on the challenge website, download the target images and labeled data, contribute the images you discover and join the competition for the winning participant!

To score points in this competition, participants should submit a set of image-label pairs that will be confirmed by human-in-the-loop raters, whose votes should be in disagreement with the average machine score for the label over a number of machine learning models.

An example of how a submitted image can score points. The same image can score as a false positive (Left) and as a false negative (Right) with two different labels. In both cases the human verification is in disagreement with the machine score. Participants score on submitted image-label pairs, which means that one and the same image can be an example of an ML unknown unknown for different labels.

The challenge is open until 30 April, 2021 to researchers and developers worldwide. To learn more about CATS4ML and how to join, please review these slides and visit the challenge website.

Acknowledgements
The release of the CATS4ML Data Challenge has been possible thanks to the hard work of a lot of people including, but not limited to, the following (in alphabetical order of last name): Osman Aka, Ken Burke, Tulsee Doshi, Mig Gerard, Victor Gomes, Shahab Kamali, Igor Karpov, Devi Krishna, Daphne Luong, Carey Radebaugh, Jamie Taylor, Nithum Thain, Kenny Wibowo, Ka Wong, and Tong Zhou.

Read More

TracIn — A Simple Method to Estimate Training Data Influence

Posted by Frederick Liu and Garima Pruthi, Software Engineers, Google Research

The quality of a machine learning (ML) model’s training data can have a significant impact on its performance. One measure of data quality is the notion of influence, i.e., the degree to which a given training example affects the model and its predictive performance. And while influence is a well-known concept to ML researchers, the complexity behind deep learning models, coupled with their growing size, features and datasets, have made the quantification of influence difficult.

A few methods have been proposed recently to quantify influence. Some rely on changes in accuracy when retraining with one or several data points dropped, and some use established statistical methods, e.g., influence functions that estimate the impact of perturbing input points or representer methods that decompose a prediction into an importance weighted combination of training examples. Still other approaches require use of additional estimators, such as data valuation using reinforcement learning. Though these approaches are theoretically sound, their use in products has been limited by the resources needed to run them at scale or the additional burdens they place on training.

In “Estimating Training Data Influence by Tracing Gradient Descent”, published as a spotlight paper at NeurIPS 2020, we proposed TracIn, a simple scalable approach to tackle this challenge. The idea behind TracIn is straightforward — trace the training process to capture changes in prediction as individual training examples are visited. TracIn is effective in finding mislabeled examples and outliers from a variety of datasets, and is useful in explaining predictions in terms of training examples (as opposed to features) by assigning an influence score to each training example.

The Ideas Underlying TracIn
Deep learning algorithms are typically trained using an algorithm called stochastic gradient descent (SGD), or a variant of it. SGD operates by making multiple passes over the data and making modifications to the model parameters that locally reduce the loss (i.e., the model’s objective) with each pass. An example of this is demonstrated for an image classification task in the figure below, where the model’s task is to predict the subject of the test image on the left (“zucchini”). As the model progresses through training, it is exposed to various training examples that affect the loss on the test image, where the loss is a function both of the prediction score and the actual label — the higher the prediction score for zucchini, the lower the loss.

Estimating training data influence of the images on the right by tracing the loss change of the zucchini in the seatbelt image during training.

Suppose that the test example is known at training time and that the training process visited each training example one at a time. During the training, visiting a specific training example would change the model’s parameters, and that change would then modify the prediction/loss on the test example. If one could trace the training example through the process, then the change in loss or prediction on the test example could be attributed to the training example in question, where the influence of a training example would be the cumulative attribution across visits to the training example.

There are two types of relevant training examples. Those that reduce loss, like the images of zucchinis above, are called proponents, while those that increase loss, like the images of seatbelts, are called opponents. In the example above, the image labeled “sunglasses” is also a proponent, because it has a seatbelt in the image, but is labeled as “sunglasses,” driving the model to better distinguish between zucchini and seatbelts.

In practice, the test example is unknown at training time, a limitation that can be overcome by using the checkpoints output by the learning algorithm as a sketch of the training process. Another challenge is that the learning algorithm typically visits several points at once, not individually, which requires a method to disentangle the relative contributions of each training example. This can be done by applying pointwise loss gradients. Together, these two strategies capture the TracIn method, which can be reduced to the simple form of the dot product of loss gradients of the test and training examples, weighted by the learning rate, and summed across checkpoints.

The simple expression for TracIn influence. The dot product of loss gradients of training example (z) and test example (z’) is weighted by learning rate (ηi) at different checkpoints and summed up.

Alternatively, one could instead examine the influence on the prediction score, which would be useful if the test example has no label. This form simply requires the substitution of the loss gradient at the test example with the prediction gradient.

Computing Top Influence Examples
We illustrate the utility of TracIn by first calculating the loss gradient vector for some training data and a test example for a specific classification — an image of a chameleon — and then leveraging a standard k-nearest neighbors library to retrieve the top proponents and opponents. The top opponents indicate the chameleon’s ability to blend in! For comparison, we also show the k nearest neighbors with embeddings from the penultimate layer. Proponents are images that are not only similar, but also belong to the same class, and opponents are similar images but in a different class. Note that there isn’t an explicit enforcement on whether proponents or opponents belong to the same class.

Top row: Top proponents and opponents of influence vectors. Bottom row: Most similar and dissimilar examples of embedding vectors from the penultimate layer.

Clustering
The simplistic breakdown of the loss of the test example into training example influences given by TracIn also suggests that the loss (or prediction) from any gradient descent based neural model can be expressed as a sum of similarities in the space of gradients. Recent work has demonstrated that this functional form is similar to that of a kernel, implying that this gradient similarity described here can be applied to other similarity tasks, like clustering.

In this case, TracIn can be used as a similarity function within a clustering algorithm. To bound the similarity metric so that it can be converted to a distance measure (1 – similarity), we normalize the gradient vectors to have unit norm. Below, we apply TracIn clustering on images of zucchini to obtain finer clusters.

Finer clusters within Zucchini images using TracIn similarity. Each row is a cluster with zucchini in similar forms within the cluster: cross-sectionally sliced zucchini (top), piles of zucchinis (middle), and zucchinis on pizzas (bottom).

Identifying Outliers with Self-Influence
Finally, we can also use TracIn to identify outliers that exhibit a high self-influence, i.e., the influence of a training point on its own prediction. This happens either when the example is mislabeled or rare, both of which make it difficult for the model to generalize over the example. Below are some examples with high self-influence.

Mislabeled examples. Assigned labels are striked out, correct labels are at bottom.
Left: A rare oscilloscope example with just the oscillations, and no instrument in the image gets high self-influence. Right: Other common oscilloscope images have the scope with knobs and wires. These have a low self-influence.

Applications
Having no requirement other than being trained using SGD (or related variants), TracIn is task-independent and applicable to a variety of models. For example, we have used TracIn to study training data for a deep learning model used to parse queries to the Google Assistant, queries of the kind “set my alarm for 7AM”. We were intrigued to see that the top opponent for the query “disable my alarm” with an alarm active on the device, was “disable my timer”, also with an alarm active on the device. This suggests that Assistant users often interchange the words “timer” and “alarm”. TracIn helped us interpret the Assistant data.

More examples can be found in the paper, including a regression task on structured data and a number of text classification tasks.

Conclusion
TracIn is a simple, easy-to-implement, scalable way to compute the influence of training data examples on individual predictions or to find rare and mislabeled training examples. For implementation references of the method, you can find a link to code examples for images from the github linked in the paper.

Acknowledgements
The NeurIPS paper was jointly co-authored with Satyen Kale and Mukund Sundararajan (corresponding author). A special thanks to Binbin Xiong for providing various conceptual and implementation insights. We also thank Qiqi Yan and Salem Haykal for numerous discussions. Images throughout this post sourced from Getty Images.

Read More

Machine Learning for Computer Architecture

Posted by Amir Yazdanbakhsh, Research Scientist, Google Research

One of the key contributors to recent machine learning (ML) advancements is the development of custom accelerators, such as Google TPUs and Edge TPUs, which significantly increase available compute power unlocking various capabilities such as AlphaGo, RankBrain, WaveNets, and Conversational Agents. This increase can lead to improved performance in neural network training and inference, enabling new possibilities in a broad range of applications, such as vision, language, understanding, and self-driving cars.

To sustain these advances, the hardware accelerator ecosystem must continue to innovate in architecture design and acclimate to rapidly evolving ML models and applications. This requires the evaluation of many different accelerator design points, each of which may not only improve the compute power, but also unravel a new capability. These design points are generally parameterized by a variety of hardware and software factors (e.g., memory capacity, number of compute units at different levels, parallelism, interconnection networks, pipelining, software mapping, etc.). This is a daunting optimization task, due to the fact that the search space is exponentially large1 while the objective function (e.g., lower latency and/or higher energy efficiency) is computationally expensive to evaluate through simulations or synthesis, making identification of feasible accelerator configurations challenging .

In “Apollo: Transferable Architecture Exploration”, we present the progress of our research on ML-driven design of custom accelerators. While recent work has demonstrated promising results in leveraging ML to improve the low-level floorplanning process (in which the hardware components are spatially laid out and connected in silicon), in this work we focus on blending ML into the high-level system specification and architectural design stage, a pivotal contributing factor to the overall performance of the chip in which the design elements that control the high-level functionality are established. Our research shows how ML algorithms can facilitate architecture exploration and suggest high-performing architectures across a range of deep neural networks, with domains spanning image classification, object detection, OCR and semantic segmentation.

Architecture Search Space and Workloads
The objective in architecture exploration is to discover a set of feasible accelerator parameters for a set of workloads, such that a desired objective function (e.g., the weighted average of runtime) is minimized under an optional set of user-defined constraints. However, the manifold of architecture search generally contains many points for which there is no feasible mapping from software to hardware. Some of these design points are known a priori and can be bypassed by formulating them as optimization constraints by the user (e.g., in the case of an area budget2 constraint, the total memory size must not pass over a predefined limit). However, due to the interplay of the architecture and compiler and the complexity of the search space, some of the constraints may not be properly formulated into the optimization, and so the compiler may not find a feasible software mapping for the target hardware. These infeasible points are not easy to formulate in the optimization problem, and are generally unknown until the whole compiler pass is performed. As such, one of main challenges for architecture exploration is to effectively sidestep the infeasible points for efficient exploration of the search space with a minimum number of cycle-accurate architecture simulations.

The following figure shows the overall architecture search space of a target ML accelerator. The accelerator contains a 2D array of processing elements (PE), each of which performs a set of arithmetic computations in a single instruction multiple data (SIMD) manner. The main architectural components of each PE are processing cores that include multiple compute lanes for SIMD operations. Each PE has shared memory (PE Memory) across all their compute cores, which is mainly used to store model activations, partial results, and outputs, while individual cores feature memory that is mainly used for storing model parameters. Each core has multiple compute lanes with multi-way multiply-accumulate (MAC) units. The results of model computations at each cycle are either stored back in the PE memory for further computation or are offloaded back into the DRAM.

Overview of the template-based ML accelerator used for architecture exploration.

Optimization Strategies
In this study, we explored four optimization strategies in the context of architecture exploration:

  1. Random: Samples the architecture search space uniformly at random.
  2. Vizier: Uses Bayesian optimization for the exploration in the search space in which the evaluation of the objective function is expensive (e.g. hardware simulation which can take hours to complete). Using a collection of sampled points from the search space, the Bayesian optimization forms a surrogate function, usually represented by a Gaussian process, that approximates the manifold of the search space. Guided by the value of the surrogate function, the Bayesian optimization algorithm decides, in an exploration and exploitation trade-off, whether to sample more from the promising regions in the manifold (exploitation) or sample more from the unseen regions in the search space (exploration). Then, the optimization algorithm uses these newly sampled points and further updates the surrogate function to better model the target search space. Vizier uses expected improvement as its core acquisition function.
  3. Evolutionary: Performs evolutionary search using a population of k individuals, where the genome of each individual corresponds to a sequence of discretized accelerator configurations. New individuals are generated by selecting for each individual two parents from the population using tournament selecting, recombining their genomes with some crossover rate, and mutating the recombined genome with some probability.
  4. Population-based black-box optimization (P3BO): Uses an ensemble of optimization methods, including evolutionary and model-based, which has been shown to increase sample-efficiency and robustness. The sampled data are exchanged between optimization methods in the ensemble, and optimizers are weighted by their performance history to generate new configurations. In our study, we use a variant of P3BO in which the hyper-parameters of the optimizers are dynamically updated using evolutionary search.

Accelerator Search Space Embeddings
To better visualize the effectiveness of each optimization strategy in navigating the accelerator search space, we use t-distributed stochastic neighbor embedding (t-SNE) to map the explored configurations into a two-dimensional space across the optimization horizon. The objective (reward) for all the experiments is defined as the throughput (inference/second) per accelerator area. In the figures below, the x and y axes indicate the t-SNE components (embedding 1 and embedding 2) of the embedding space. The star and circular markers show the infeasible (zero reward) and feasible design points, respectively, with the size of the feasible points corresponding to their reward.

As expected, the random strategy searches the space in a uniformly distributed way and eventually finds very few feasible points in the design space.

Visualization presenting the t-SNE of the explored design points (~4K) by random optimization strategy (max reward = 0.96). The maximum reward points (red cross markers) are highlighted at the last frame of the animation.

Compared to the random sampling approach, the Vizier default optimization strategy strikes a good balance between exploring the search space and finding the design points with higher rewards (1.14 vs. 0.96). However, this approach tends to get stuck in infeasible regions and, while it does find a few points with the maximum reward (indicated by the red cross markers), it finds few feasible points during the last iterations of exploration.

As above, with the Vizier (default) optimization strategy (max reward = 1.14). The maximum reward points (red cross markers) are highlighted at the last frame of the animation.

The evolutionary optimization strategy, on the other hand, finds feasible solutions very early in the optimization and assemble clusters of feasible points around them. As such, this approach mostly navigates the feasible regions (the green circles) and efficiently sidesteps the infeasible points. In addition, the evolutionary search is able to find more design options with maximum reward (the red crosses). This diversity in the solutions with high reward provides flexibility to the designer in exploring various architectures with different design trade-offs.

As above, with the evolutionary optimization strategy (max reward = 1.10). The maximum reward points (red cross markers) are highlighted at the last frame of the animation.

Finally, the population-based optimization method (P3BO) explores the design space in a more targeted way (regions with high reward points) in order to find optimal solutions. The P3BO strategy finds design points with the highest reward in search spaces with tighter constraints (e.g., a larger number of infeasible points), showing its effectiveness in navigating search spaces with large numbers of infeasible points.

As above, with the P3BO optimization strategy (max reward = 1.13). The maximum reward points (red cross markers) are highlighted at the last frame of the animation.

Architecture Exploration under Different Design Constraints
We also studied the benefits of each optimization strategy under different area budget constraints, 6.8 mm2, 5.8 mm2 and 4.8 mm2. The following violin plots show the full distribution of the maximum achievable reward at the end of optimization (after ten runs each with 4K trials) across the studied optimization strategies. The wider sections represent a higher probability of observing feasible architecture configurations at a particular given reward. This implies that we favor the optimization algorithm that yields increased width at the points with higher reward (higher performance).

The two top-performing optimization strategies for architecture exploration are evolutionary and P3BO, both in terms of delivering solutions with high reward and robustness across multiple runs. Looking into different design constraints, we observe that as one tightens the area budget constraint, the P3BO optimization strategy yields more high performing solutions. For example, when the area budget constraint is set to 5.8 mm2, P3BO finds design points with a reward (throughput / accelerator area) of 1.25 outperforming all the other optimization strategies. The same trend is observed when the area budget constraint is set to 4.8 mm2, a slightly better reward is found with more robustness (less variability) across multiple runs.

Violin plot showing the full distribution of the maximum achievable reward in ten runs across the optimization strategies after 4K trial evaluations under an area budget of 6.8 mm2. The P3BO and Evolutionary algorithm yield larger numbers of high-performing designs (wider sections). The x and y axes indicate the studied optimization algorithms and the geometric mean of speedup (reward) over the baseline accelerator, respectively.
As above, under an area budget of 5.8 mm2.
As above, under an area budget of 4.8 mm2.

Conclusion
While Apollo presents the first step towards better understanding of accelerator design space and building more efficient hardware, inventing accelerators with new capabilities is still an uncharted territory and a new frontier. We believe that this research is an exciting path forward to further explore ML-driven techniques for architecture design and co-optimization (e.g., compiler, mapping, and scheduling) across the computing stack to invent efficient accelerators with new capabilities for the next generation of applications.

Acknowledgments
This work was performed by Amir Yazdanbakhsh, Christof Angermueller, and Berkin Akin . We would like to also thank Milad Hashemi, Kevin Swersky, James Laudon, Herman Schmit, Cliff Young, Yanqi Zhou, Albin Jones, Satrajit Chatterjee, Ravi Narayanaswami, Ray (I-Jui) Sung, Suyog Gupta, Kiran Seshadri, Suvinay Subramanian, Matthew Denton, and the Vizier team for their help and support.


1 In our target accelerator, the total number of design points is around 5 x 108

2 The chip area is approximately the sum of total hardware components on the chip, including on-chip storage, processing engines, controllers, I/O pins, and etc.  

Read More

Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning

Posted by Mohammad Babaeizadeh, Research Engineer and Dumitru Erhan, Research Scientist, Google Research

Model-free reinforcement learning has been successfully demonstrated across a range of domains, including robotics, control, playing games and autonomous vehicles. These systems learn by simple trial and error and thus require a vast number of attempts at a given task before solving it. In contrast, model-based reinforcement learning (MBRL) learns a model of the environment (often referred to as a world model or a dynamics model) that enables the agent to predict the outcomes of potential actions, which reduces the amount of environment interaction needed to solve a task.

In principle, all that is strictly necessary for planning is to predict future rewards, which could then be used to select near-optimal future actions. Nevertheless, many recent methods, such as Dreamer, PlaNet, and SimPLe, additionally leverage the training signal of predicting future images. But is predicting future images actually necessary, or helpful? What benefit do visual MBRL algorithms actually derive from also predicting future images? The computational and representational cost of predicting entire images is considerable, so understanding whether this is actually useful is of profound importance for MBRL research.

In “Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning”, we demonstrate that predicting future images provides a substantial benefit, and is in fact a key ingredient in training successful visual MBRL agents. We developed a new open-source library, called the World Models Library, which enabled us to rigorously evaluate various world model designs to determine the relative impact of image prediction on returned rewards for each.

World Models Library
The World Models Library, designed specifically for visual MBRL training and evaluation, enables the empirical study of the effects of each design decision on the final performance of an agent across multiple tasks on a large scale. The library introduces a platform-agnostic visual MBRL simulation loop and the APIs to seamlessly define new world-models, planners and tasks or to pick and choose from the existing catalog, which includes agents (e.g., PlaNet), video models (e.g., SV2P), and a variety of DeepMind Control tasks and planners, such as CEM and MPPI.

Using the library, developers can study the effect of a varying factor in MBRL, such as the model design or representation space, on the performance of the agent on a suite of tasks. The library supports the training of the agents from scratch, or on a pre-collected set of trajectories, as well as evaluation of a pre-trained agent on a given task. The models, planning algorithms and the tasks can be easily mixed and matched to any desired combination.

To provide the greatest flexibility for users, the library is built using the NumPy interface, which enables different components to be implemented in either TensorFlow, Pytorch or JAX. Please look at this colab for a quick introduction.

Impact of Image Prediction
Using the World Models Library, we trained multiple world models with different levels of image prediction. All of these models use the same input (previously observed images) to predict an image and a reward, but they differ on what percentage of the image they predict. As the number of image pixels predicted by the agent increases, the agent performance as measured by the true reward generally improves.

The input to the model is fixed (previous observed images), but the fraction of the image predicted varies. As can be seen in the graph on the right, increasing the number of predicted pixels significantly improves the performance of the model.

Interestingly, the correlation between reward prediction accuracy and agent performance is not as strong, and in some cases a more accurate reward prediction can even result in lower agent performance. At the same time, there is a strong correlation between image reconstruction error and the performance of the agent.

Correlation between accuracy of image/reward prediction (x-axis) and task performance (y-axis). This graph clearly demonstrates a stronger correlation between image prediction accuracy and task performance.

This phenomenon is directly related to exploration, i.e., when the agent attempts more risky and potentially less rewarding actions in order to collect more information about the unknown options in the environment. This can be shown by testing and comparing models in an offline setup (i.e., learning policies from pre-collected datasets, as opposed to online RL, which learns policies by interacting with an environment). An offline setup ensures that there is no exploration and all of the models are trained on the same data. We observed that models that fit the data better usually perform better in the offline setup, and surprisingly, these may not be the same models that perform the best when learning and exploring from scratch.

Scores achieved by different visual MBRL models across different tasks. The top and bottom half of the graph visualizes the achieved score when trained in the online and offline settings for each task, respectively. Each color is a different model. It is common for a poorly-performing model in the online setting to achieve high scores when trained on pre-collected data (the offline setting) and vice versa.

Conclusion
We have empirically demonstrated that predicting images can substantially improve task performance over models that only predict the expected reward. We have also shown that the accuracy of image prediction strongly correlates with the final task performance of these models. These findings can be used for better model design and can be particularly useful for any future setting where the input space is high-dimensional and collecting data is expensive.

If you’d like to develop your own models and experiments, head to our repository and colab where you’ll find instructions on how to reproduce this work and use or extend the World Models Library.

Acknowledgement:
We would like to give special recognition to multiple researchers in the Google Brain team and co-authors of the paper: Mohammad Taghi Saffar, Danijar Hafner, Harini Kannan, Chelsea Finn and Sergey Levine.

Read More

Learning to Reason Over Tables from Less Data

Posted by Julian Eisenschlos AI Resident, Google Research, Zürich

The task of recognizing textual entailment, also known as natural language inference, consists of determining whether a piece of text (a premise), can be implied or contradicted (or neither) by another piece of text (the hypothesis). While this problem is often considered an important test for the reasoning skills of machine learning (ML) systems and has been studied in depth for plain text inputs, much less effort has been put into applying such models to structured data, such as websites, tables, databases, etc. Yet, recognizing textual entailment is especially relevant whenever the contents of a table need to be accurately summarized and presented to a user, and is essential for high fidelity question answering systems and virtual assistants.

In “Understanding tables with intermediate pre-training“, published in Findings of EMNLP 2020, we introduce the first pre-training tasks customized for table parsing, enabling models to learn better, faster and from less data. We build upon our earlier TAPAS model, which was an extension of the BERT bi-directional Transformer model with special embeddings to find answers in tables. Applying our new pre-training objectives to TAPAS yields a new state of the art on multiple datasets involving tables. On TabFact, for example, it reduces the gap between model and human performance by ~50%. We also systematically benchmark methods of selecting relevant input for higher efficiency, achieving 4x gains in speed and memory, while retaining 92% of the results. All the models for different tasks and sizes are released on GitHub repo, where you can try them out yourself in a colab Notebook.

Textual Entailment
The task of textual entailment is more challenging when applied to tabular data than plain text. Consider, for example, a table from Wikipedia with some sentences derived from its associated table content. Assessing if the content of the table entails or contradicts the sentence may require looking over multiple columns and rows, and possibly performing simple numeric computations, like averaging, summing, differencing, etc.

A table together with some statements from TabFact. The content of the table can be used to support or contradict the statements.

Following the methods used by TAPAS, we encode the content of a statement and a table together, pass them through a Transformer model, and obtain a single number with the probability that the statement is entailed or refuted by the table.

The TAPAS model architecture uses a BERT model to encode the statement and the flattened table, read row by row. Special embeddings are used to encode the table structure. The vector output of the first token is used to predict the probability of entailment.

Because the only information in the training examples is a binary value (i.e., “correct” or “incorrect”), training a model to understand whether a statement is entailed or not is challenging and highlights the difficulty in achieving generalization in deep learning, especially when the provided training signal is scarce. Seeing isolated entailed or refuted examples, a model can easily pick-up on spurious patterns in the data to make a prediction, for example the presence of the word “tie” in “Greg Norman and Billy Mayfair tie in rank”, instead of truly comparing their ranks, which is what is needed to successfully apply the model beyond the original training data.

Pre-training Tasks
Pre-training tasks can be used to “warm-up” models by providing them with large amounts of readily available unlabeled data. However, pre-training typically includes primarily plain text and not tabular data. In fact, TAPAS was originally pre-trained using a simple masked language modelling objective that was not designed for tabular data applications. In order to improve the model performance on tabular data, we introduce two novel pretraining binary-classification tasks called counterfactual and synthetic, which can be applied as a second stage of pre-training (often called intermediate pre-training).

In the counterfactual task, we source sentences from Wikipedia that mention an entity (person, place or thing) that also appears in a given table. Then, 50% of the time, we modify the statement by swapping the entity for another alternative. To make sure the statement is realistic, we choose a replacement among the entities in the same column in the table. The model is trained to recognize whether the statement was modified or not. This pre-training task includes millions of such examples, and although the reasoning about them is not complex, they typically will still sound natural.

For the synthetic task, we follow a method similar to semantic parsing in which we generate statements using a simple set of grammar rules that require the model to understand basic mathematical operations, such as sums and averages (e.g., “the sum of earnings”), or to understand how to filter the elements in the table using some condition (e.g.,”the country is Australia”). Although these statements are artificial, they help improve the numerical and logical reasoning skills of the model.

Example instances for the two novel pre-training tasks. Counterfactual examples swap entities mentioned in a sentence that accompanies the input table for a plausible alternative. Synthetic statements use grammar rules to create new sentences that require combining the information of the table in complex ways.

Results
We evaluate the success of the counterfactual and synthetic pre-training objectives on the TabFact dataset by comparing to the baseline TAPAS model and to two prior models that have exhibited success in the textual entailment domain, LogicalFactChecker (LFC) and Structure Aware Transformer (SAT). The baseline TAPAS model exhibits improved performance relative to LFC and SAT, but the pre-trained model (TAPAS+CS) performs significantly better, achieving a new state of the art.

We also apply TAPAS+CS to question answering tasks on the SQA dataset, which requires that the model find answers from the content of tables in a dialog setting. The inclusion of CS objectives improves the previous best performance by more than 4 points, demonstrating that this approach also generalizes performance beyond just textual entailment.

Results on TabFact (left) and SQA (right). Using the synthetic and counterfactual datasets, we achieve new state-of-the-art results in both tasks by a large margin.

Data and Compute Efficiency
Another aspect of the counterfactual and synthetic pre-training tasks is that since the models are already tuned for binary classification, they can be applied without any fine-tuning to TabFact. We explore what happens to each of the models when trained only on a subset (or even none) of the data. Without looking at a single example, the TAPAS+CS model is competitive with a strong baseline Table-Bert, and when only 10% of the data are included, the results are comparable to the previous state-of-the-art.

Dev accuracy on TabFact relative to the fraction of the training data used.

A general concern when trying to use large models such as this to operate on tables, is that their high computational requirements makes it difficult for them to parse very large tables. To address this, we investigate whether one can heuristically select subsets of the input to pass through the model in order to optimize its computational efficiency.

We conducted a systematic study of different approaches to filter the input and discovered that simple methods that select for word overlap between a full column and the subject statement give the best results. By dynamically selecting which tokens of the input to include, we can use fewer resources or work on larger inputs at the same cost. The challenge is doing so without losing important information and hurting accuracy. 

For instance, the models discussed above all use sequences of 512 tokens, which is around the normal limit for a transformer model (although recent efficiency methods like the Reformer or Performer are proving effective in scaling the input size). The column selection methods we propose here can allow for faster training while still achieving high accuracy on TabFact. For 256 input tokens we get a very small drop in accuracy, but the model can now be pre-trained, fine-tuned and make predictions up to two times faster. With 128 tokens the model still outperforms the previous state-of-the-art model, with an even more significant speed-up — 4x faster across the board.

Accuracy on TabFact using different sequence lengths, by shortening the input with our column selection method.

Using both the column selection method we proposed and the novel pre-training tasks, we can create table parsing models that need fewer data and less compute power to obtain better results.

We have made available the new models and pre-training techniques at our GitHub repo, where you can try it out yourself in colab. In order to make this approach more accessible, we also shared models of varying sizes all the way down to “tiny”. It is our hope that these results will help spur development of table reasoning among the broader research community.

Acknowledgements
This work was carried out by Julian Martin Eisenschlos, Syrine Krichene and Thomas Müller from our Language Team in Zürich. We would like to thank Jordan Boyd-Graber, Yasemin Altun, Emily Pitler, Benjamin Boerschinger, Srini Narayanan, Slav Petrov, William Cohen and Jonathan Herzig for their useful comments and suggestions.

Read More