Improving Sparse Training with RigL

Improving Sparse Training with RigL

Posted by Utku Evci and Pablo Samuel Castro, Research Engineers, Google Research, Montreal

Modern deep neural network architectures are often highly redundant [1, 2, 3], making it possible to remove a significant fraction of connections without harming performance. The sparse neural networks that result have been shown to be more parameter and compute efficient compared to dense networks, and, in many cases, can significantly decrease wall clock inference times.

By far the most popular method for training sparse neural networks is pruning, (dense-to-sparse training) which usually requires first training a dense model, and then “sparsifying” it by cutting out the connections with negligible weights. However, this process has two limitations.

  1. The size of the largest trainable sparse model is limited by that of the largest trainable dense model. Even if sparse models are more parameter efficient, one cannot use pruning to train models that are larger and more accurate than the largest possible dense models.
  2. Pruning is inefficient, meaning that large amounts of computation must be performed for parameters that are zero valued or that will be zero during inference. Additionally, it remains unknown if the performance of the current best pruning algorithms are an upper bound on the quality of sparse models.

Training sparse networks from scratch, on the other hand, is efficient, however often achieves inferior performance compared to pruning.

In “Rigging the Lottery: Making All Tickets Winners”, presented at ICML 2020, we introduce RigL, an algorithm for training sparse neural networks that uses a fixed parameter count and computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods. The algorithm identifies which neurons should be active during training, which helps the optimization process to utilize the most relevant connections and results in better sparse solutions. An example of this is shown below, where, during the training of a multilayer perceptron (MLP) network on MNIST, our sparse network trained with RigL learns to focus on the center of the images, discarding the uninformative pixels from the edges. A Tensorflow implementation of our method along with three other baselines (SET, SNFS, SNIP) can be found at github.com/google-research/rigl.

Left: Average MNIST image. Right: Evolution of the connectivity of the input throughout the training of a 98% sparse, 2-layer MLP on MNIST. Training starts from a random sparse mask, where each input pixel has roughly six outgoing connections. Connections that originate from the edges do not exhibit meaningful gradients and are therefore replaced by more informative connections that originate from the center pixels.

RigL Overview
The RigL method starts with a network initialized with a random sparse topology. At regularly spaced intervals we remove a fraction of the connections with the smallest weight magnitudes. Such a strategy has been shown to have very little effect on the loss. RigL then activates new connections using instantaneous gradient information, i.e., without using past gradient information. After updating the connectivity, training continues with the updated network until the next scheduled update. Next, the system activates connections with large gradients, since these connections are expected to decrease the loss most quickly.

RigL begins with a random sparse initialization of the network. It then trains the network and trims out those connections with weak activations. Based on the gradients calculated for the new configuration, it grows new connections and trains again, repeating the cycle.

Evaluating Performance
By changing the connectivity of the neurons dynamically during training, RigL helps optimize to find better solutions. To demonstrate this, we restart training from a bad solution that exhibits poor accuracy and show that RigL’s mask updates help the optimization achieve better loss compared to static training, in which connectivity of the sparse network remains the same.

Training loss of RigL and Static methods starting from the same static sparse solution, shown together with their final test accuracies.

The figure below summarizes the performance of various methods on training an 80% sparse ResNet-50 architecture. We compare RigL with two recent sparse training methods, SET and SNFS and three baseline training methods: Static, Small-Dense and Pruning. Two of these methods (SNFS and Pruning) require dense resources as they need to either train a large network or store the gradients of it. Overall, we observe that the performance of all methods improves with additional training time; thus, for each method we run extended training with up to 5x the training steps of the original 100 epochs.

As noted in a number of studies [4, 5, 6, 7], training a network with fixed sparsity from scratch (Static) leads to inferior performance compared to solutions found by pruning. Training a small, dense network (Small-Dense) with the same number of parameters gets better results than Static, but fails to match the performance of dynamic sparse models. Similarly, SET improves the performance over Small-Dense, but saturates at around 75% accuracy, revealing the limits of growing new connections randomly. Methods that use gradient information to grow new connections (RigL and SNFS) obtain higher accuracy in general, but RigL achieves the highest accuracy, while also consistently requiring fewer FLOPs (and memory footprint) than the other methods.

Performance of sparse training methods on training an 80% sparse ResNet-50 architecture with uniform sparsity distribution. Points at each curve correspond to the individual training runs with increasing training length. The number of FLOPs required to train a standard dense ResNet-50 along with its performance is indicated with a dashed red line. RigL matches the standard ResNet-50 performance, even though it is 5x smaller in size.

Observing the trend between extended training and performance, we compare the results using longer training runs. Within the interval considered (i.e., 1x-100x) RigL’s performance constantly improves with additional training. RigL achieves state of art performance of 68.07% Top-1 accuracy at training with a 99% sparse ResNet-50 architecture. Similarly extended training of a 90% sparse MobileNet-v1 architecture with RigL achieves 70.55% Top-1 accuracy. Obtaining the same results with fewer training iterations is an exciting future research direction.

Effect of training time on RigL accuracy at training 99% sparse ResNet-50 (left) and 90% sparse MobileNets-v1 (right) architectures.

Other experiments include image classification on CIFAR-10 datasets and character-based language modelling using RNNs with the WikiText-103 dataset and can be found in the full paper.

Future Work
RigL is useful in three different scenarios:

  1. Improving the accuracy of sparse models intended for deployment.
  2. Improving the accuracy of large sparse models that can only be trained for a limited number of iterations.
  3. Combining with sparse primitives to enable training of extremely large sparse models which otherwise would not be possible.

The third scenario is unexplored due to the lack of hardware and software support for sparsity. Nonetheless, work continues [8, 9, 10] to improve the performance of sparse networks on current hardware and new types of hardware accelerators are expected to have better support for parameter sparsity [11, 12]. We hope RigL provides the tools to take advantage of, and motivation for, such advances.

AcknowledgementsWe would like to thank Eleni Triantafillou, Hugo Larochelle, Bart van Merrienboer, Fabian Pedregosa, Joan Puigcerver, Danny Tarlow, Nicolas Le Roux, Karen Simonyan for giving feedback on the preprint of the paper; Namhoon Lee for helping us verify and debug our SNIP implementation; Chris Jones for helping us discover and solve the distributed training bug; and Tom Small for creating the visualization of the algorithm.

Read More

Imitation Learning in the Low-Data Regime

Imitation Learning in the Low-Data Regime

Posted by Robert Dadashi, Research Software Engineer, and Léonard Hussenot, Student Researcher, Google Research

Reinforcement Learning (RL) is a paradigm for using trial-and-error to train agents to sequentially make decisions in complex environments, which has had great success in a number of domains, including games, robotics manipulation and chip design. Agents typically aim at maximizing the sum of the reward they collect in an environment, which can be based on a variety of parameters, including speed, curiosity, aesthetics and more. However, designing a specific RL reward function is a challenge since it can be hard to specify or too sparse. In such cases, imitation learning (IL) methods offer an alternative as they learn how to solve a task from expert demonstrations, rather than a carefully designed reward function. However, state-of-the-art IL methods rely on adversarial training, which uses min/max optimization procedures, making them algorithmically unstable and difficult to deploy.

In “Primal Wasserstein Imitation Learning” (PWIL), we introduce a new IL method, based on the primal form of the Wasserstein distance, also known as the earth mover’s distance, which does not rely on adversarial training. Using the MuJoCo suite of tasks, we demonstrate the efficacy of the PWIL method by imitating a simulated expert with a limited number of demonstrations (even a single example) and limited interactions with the environment.

Left: Demonstration of the algorithmic Humanoid “expert”, trained on the true reward of the task (which relates to speed). Right: Agent trained using PWIL on the expert demonstration.

Adversarial Imitation Learning
State-of-the-art adversarial IL methods operate similarly to generative adversarial networks (GANs) in which a generator (the policy) is trained to maximize the confusion of a discriminator (the reward) that itself is trained to differentiate between the agent’s state-action pairs and the expert’s. Adversarial IL methods boil down to a distribution matching problem, i.e., the problem of minimizing a distance between probability distributions in a metric space. However, just as GANs, adversarial IL methods rely on a min/max optimization problem and hence come with a number of training stability challenges.

Imitation Learning as Distribution Matching
The PWIL method is based on the formulation of IL as a distribution matching problem, in this case, the Wasserstein distance. The first step consists of inferring from the demonstrations a state-action distribution of the expert, the collection of relationships between the actions taken by the expert and the corresponding state of the environment. The goal is then to minimize the distance between the agent’s and the expert’s state-action distributions, through interactions with the environment. In contrast, PWIL is a non-adversarial method, enabling it to bypass the min/max optimization problem and directly minimize the Wasserstein distance between the agent’s and the expert’s state-action pair distributions.

Primal Wasserstein Imitation Learning
Computing the exact Wasserstein distance can be restrictive since one must wait until the end of a trajectory of the agent to calculate it, meaning that the rewards can be computed only when the agent is done interacting with the environment. To avoid this restriction, we use an upper bound on the distance instead, from which we can define a reward that we optimize using RL. We show that by doing so, we indeed recover expert behaviour and minimize the Wasserstein distance between the agent and the expert on a number of locomotion tasks of the MuJoCo simulator. While adversarial IL methods use a reward function from a neural network that must be optimized and re-estimated continuously as the agent interacts with the environment, PWIL defines a reward function offline from demonstrations, which does not change and is based on substantially fewer hyperparameters than adversarial IL approaches.

Training curves for PWIL on Humanoid. In green, the Wasserstein distance to the state-action distribution of the expert. In blue, the return (the sum of rewards collected) by the agent.

A Measure of Similarity for the True Imitation Learning Setting
As in numerous challenges in ML, a number of IL methods are evaluated on synthetic tasks, where one usually has access to the underlying reward function of the task and can measure similarity between the expert’s and the agent’s behaviour in terms of performance, which is the expected sum of rewards. A byproduct of PWIL is the creation of a metric that can compare expert behavior to an agent’s behavior for any IL method, without access to the true reward of the task. In this sense, we can use the Wasserstein distance in the true IL setting, not only on synthetic tasks.

Conclusion
In environments where interacting is costly (e.g., a real robot or a complex simulator), PWIL is a prime candidate not only because it can recover expert behaviour, but also because the reward function it defines is easy to tune and is defined without interactions with the environment. This opens multiple opportunities for future exploration, including deployment to real systems, extending PWIL to the setup where we have only access to demonstration states (rather than states and actions), and finally applying PWIL to visual based observations.

Acknowledgements
We thank our co-authors, Matthieu Geist and Olivier Pietquin; as well as Zafarali Ahmed, Adrien Ali Taïga, Gabriel Dulac-Arnold, Johan Ferret, Alexis Jacq and Saurabh Kumar for their feedback on the manuscript.

Read More

The Technology Behind our Recent Improvements in Flood Forecasting

The Technology Behind our Recent Improvements in Flood Forecasting

Posted by Sella Nevo, Senior Software Engineer, Google Research, Tel Aviv

Flooding is the most common natural disaster on the planet, affecting the lives of hundreds of millions of people around the globe and causing around $10 billion in damages each year. Building on our work in previous years, earlier this week we announced some of our recent efforts to improve flood forecasting in India and Bangladesh, expanding coverage to more than 250 million people, and providing unprecedented lead time, accuracy and clarity.

To enable these breakthroughs, we have devised a new approach for inundation modeling, called a morphological inundation model, which combines physics-based modeling with machine learning (ML) to create more accurate and scalable inundation models in real-world settings. Additionally, our new alert-targeting model allows identifying areas at risk of flooding at unprecedented scale using end-to-end machine learning models and data that is publicly available globally. In this post, we also describe developments for the next generation of flood forecasting systems, called HydroNets (presented at ICLR AI for Earth Sciences and EGU this year), which is a new architecture specially built for hydrologic modeling across multiple basins, while still optimizing for accuracy at each location.

Forecasting Water Levels
The first step in a flood forecasting system is to identify whether a river is expected to flood. Hydrologic models (or gauge-to-gauge models) have long been used by governments and disaster management agencies to improve the accuracy and extend the lead time of their forecasts. These models receive inputs like precipitation or upstream gauge measurements of water level (i.e., the absolute elevation of the water above sea level) and output a forecast for the water level (or discharge) in the river at some time in the future.

The hydrologic model component of the flood forecasting system described in this week’s Keyword post doubled the lead time of flood alerts for areas covering more than 75 million people. These models not only increase lead time, but also provide unprecedented accuracy, achieving an R2 score of more than 99% across all basins we cover, and predicting the water level within a 15 cm error bound more than 90% of the time. Once a river is predicted to reach flood level, the next step in generating actionable warnings is to convert the river level forecast into a prediction for how the floodplain will be affected.

Morphological Inundation Modeling
In prior work, we developed high quality elevation maps based on satellite imagery, and ran physics-based models to simulate water flow across these digital terrains, which allowed warnings with unprecedented resolution and accuracy in data-scarce regions. In collaboration with our satellite partners, Airbus, Maxar and Planet, we have now expanded the elevation maps to cover hundreds of millions of square kilometers. However, in order to scale up the coverage to such a large area while still retaining high accuracy, we had to re-invent how we develop inundation models.

Inundation modeling estimates what areas will be flooded and how deep the water will be. This visualization conceptually shows how inundation could be simulated, how risk levels could be defined (represented by red and white colors), and how the model could be used to identify areas that should be warned (green dots).

Inundation modeling at scale suffers from three significant challenges. Due to the large areas involved and the resolution required for such models, they necessarily have high computational complexity. In addition, most global elevation maps don’t include riverbed bathymetry, which is important for accurate modeling. Finally, the errors in existing data, which may include gauge measurement errors, missing features in the elevation maps, and the like, need to be understood and corrected. Correcting such problems may require collecting additional high-quality data or fixing erroneous data manually, neither of which scale well.

Our new approach to inundation modeling, which we call a morphological model, addresses these issues by using several innovative tricks. Instead of modeling the complex behaviors of water flow in real time, we compute modifications to the morphology of the elevation map that allow one to simulate the inundation using simple physical principles, such as those describing hydrostatic systems.

First, we train a pure-ML model (devoid of physics-based information) to estimate the one-dimensional river profile from gauge measurements. The model takes as input the water level at a specific point on the river (the stream gauge) and outputs the river profile, which is the water level at all points in the river. We assume that if the gauge increases, the water level increases monotonically, i.e., the water level at other points in the river increases as well. We also assume that the absolute elevation of the river profile decreases downstream (i.e., the river flows downhill).

We then use this learned model and some heuristics to edit the elevation map to approximately “cancel out” the pressure gradient that would exist if that region were flooded. This new synthetic elevation map provides the foundation on which we model the flood behavior using a simple flood-fill algorithm. Finally, we match the resulting flooded map to the satellite-based flood extent with the original stream gauge measurement.

This approach abandons some of the realistic constraints of classical physics-based models, but in data scarce regions where existing methods currently struggle, its flexibility allows the model to automatically learn the correct bathymetry and fix various errors to which physics-based models are sensitive. This morphological model improves accuracy by 3%, which can significantly improve forecasts for large areas, while also allowing for much more rapid model development by reducing the need for manual modeling and correction.

Alert targeting
Many people reside in areas that are not covered by the morphological inundation models, yet access to accurate predictions are still urgently needed. To reach this population and to increase the impact of our flood forecasting models, we designed an end-to-end ML-based approach, using almost exclusively data that is globally publicly available, such as stream gauge measurements, public satellite imagery, and low resolution elevation maps. We train the model to use the data it is receiving to directly infer the inundation map in real time.

A direct ML approach from real-time measurements to inundation.

This approach works well “out of the box” when the model only needs to forecast an event that is within the range of events previously observed. Extrapolating to more extreme conditions is much more challenging. Nevertheless, proper use of existing elevation maps and real-time measurements can enable alerts that are more accurate than presently available for those in areas not covered by the more detailed morphological inundation models. Because this model is highly scalable, we were able to launch it across India after only a few months of work, and we hope to roll it out to many more countries soon.

Improving Water Levels Forecasting
In an effort to continue improving flood forecasting, we have developed HydroNets — a specialized deep neural network architecture built specifically for water levels forecasting — which allows us utilize some exciting recent advances in ML-based hydrology in a real-world operational setting. Two prominent features distinguish it from standard hydrologic models. First, it is able to differentiate between model components that generalize well between sites, such as the modeling of rainfall-runoff processes, and those that are specific to a given site, like the rating curve, which converts a predicted discharge volume into an expected water level. This enables the model to generalize well to different sites, while still fine-tuning its performance to each location. Second, HydroNets takes into account the structure of the river network being modeled, by training a large architecture that is actually a web of smaller neural networks, each representing a different location along the river. This allows neural networks that are modeling upstream sites to pass information encoded in embeddings to models of downstream sites, so that every model can know everything it needs without a drastic increase in parameters.

The animation below illustrates the structure and flow of information in HydroNets. The output from the modeling of upstream sub-basins is combined into a single representation of a given basin state. It is then processed by the shared model component, which is informed by all basins in the network, and passed on to the label prediction model, which calculates the water level (and the loss function). The output from this iteration of the network is then passed on to inform downstream models, and so on.

An illustration of the HydroNets architecture.

We’re incredibly excited about this progress, and are working hard on improving our systems further.

Acknowledgements
This work is a collaboration between many research teams at Google, and is part of our AI for Social Good efforts. We’d also like to thank our Geo and Policy teams, as well as Google.org.

Read More

KeyPose: Estimating the 3D Pose of Transparent Objects from Stereo

KeyPose: Estimating the 3D Pose of Transparent Objects from Stereo

Posted by Kurt Konolige, Software Engineer, Robotics at Google

Estimating the position and orientation of 3D objects is one of the core problems in computer vision applications that involve object-level perception, such as augmented reality and robotic manipulation. In these applications, it is important to know the 3D position of objects in the world, either to directly affect them, or to place simulated objects correctly around them. While there has been much research on this topic using machine learning (ML) techniques, especially Deep Nets, most have relied on the use of depth sensing devices, such as the Kinect, which give direct measurements of the distance to an object. For objects that are shiny or transparent, direct depth sensing does not work well. For example, the figure below includes a number of objects (left), two of which are transparent stars. A depth device does not find good depth values for the stars, and gives a very poor reconstruction of the actual 3D points (right).

Left: RGB image of transparent objects.  Right: A four-panel image showing the reconstructed depth for the scene on the left.The top row includes depth images and the bottom row presents the 3D point cloud. The left panels were reconstructed using a depth camera and the right panels are output from the ClearGrasp model.  Note that although ClearGrasp inpaints the depth of the stars, it mistakes the actual depth of the rightmost one.

One solution to this problem, such as that proposed by ClearGrasp, is to use a deep neural network to inpaint the corrupted depth map of the transparent objects. Given a single RGB-D image of transparent objects, ClearGrasp uses deep convolutional networks to infer surface normals, masks of transparent surfaces, and occlusion boundaries, which it uses to refine the initial depth estimates for all transparent surfaces in the scene (far right in the figure above). This approach is very promising, and allows scenes with transparent objects to be processed by pose-estimation methods that rely on depth.  But inpainting can be tricky, especially when trained completely with synthetic images, and can still result in errors in depth.

In “KeyPose: Multi-View 3D Labeling and Keypoint Estimation for Transparent Objects”, presented at CVPR 2020 in collaboration with the Stanford AI Lab, we describe an ML system that estimates the depth of transparent objects by directly predicting 3D keypoints. To train the system we gather a large real-world dataset of images of transparent objects in a semi-automated way, and efficiently label their pose using 3D keypoints selected by hand. We then train deep models (called KeyPose) to estimate the 3D keypoints end-to-end from monocular or stereo images, without explicitly computing depth. The models work on objects both seen and unseen during training, for both individual objects and categories of objects. While KeyPose can work with monocular images, the extra information available from stereo images allows it to improve its results by a factor of two over monocular image input, with typical errors from 5 mm to 10 mm, depending on the objects. It substantially improves over state-of-the-art in pose estimation for these objects, even when competing methods are provided with ground truth depth. We are releasing the dataset of keypoint-labeled transparent objects for use by the research community.

Real-World Transparent Object Dataset with 3D Keypoint Labels
To facilitate gathering large quantities of real-world images, we set up a robotic data-gathering system in which a robot arm moves through a trajectory while taking video with two devices, a stereo camera and the Kinect Azure depth camera.

Automated image sequence capture using a robot arm with a stereo camera and an Azure Kinect device.

The AprilTags on the target enable accurate tracing of the pose of the cameras. By hand-labelling only a few images in each video with 2D keypoints, we can extract 3D keypoints for all frames of the video using multi-view geometry, thus increasing the labelling efficiency by a factor of 100.

We captured imagery for 15 different transparent objects in five categories, using 10 different background textures and four different poses for each object, yielding a total of 600 video sequences comprising 48k stereo and depth images. We also captured the same images with an opaque version of the object, to provide accurate ground truth depth images. All the images are labelled with 3D keypoints. We are releasing this dataset of real-world images publicly, complementing the synthetic ClearGrasp dataset with which it shares similar objects.

KeyPose Algorithm Using Early Fusion Stereo
The idea of using stereo images directly for keypoint estimation was developed independently for this project; it has also appeared recently in the context of hand-tracking. The diagram below shows the basic idea: the two images from a stereo camera are cropped around the object and fed to the KeyPose network, which predicts a sparse set of 3D keypoints that represent the 3D pose of the object. The network is trained using supervision from the labelled 3D keypoints.

One of the key aspects of stereo KeyPose is the use of early fusion to intermix the stereo images, and allow the network to implicitly compute disparity, in contrast to late fusion, in which keypoints are predicted for each image separately, and then combined. As shown in the diagram below, the output of KeyPose is a 2D keypoint heatmap in the image plane along with a disparity (i.e., inverse depth) heatmap for each keypoint. The combination of these two heatmaps yields the 3D coordinate of the keypoint, for each keypoint.

Keypose system diagram. Stereo images are passed to a CNN model to produce a probability heatmap for each keypoint.  This heatmap yields 2D image coordinates U,V for the keypoint.  The CNN model also produces a disparity (inverse depth) heatmap for each keypoint, which when combined with the U,V coordinates, gives a 3D position (X,Y,Z).

When compared to late fusion or to monocular input, early fusion stereo typically is twice as accurate.

Results
The images below show qualitative results of KeyPose on individual objects. On the left is one of the original stereo images; in the middle are the predicted 3D keypoints projected onto the image. On the right, we visualize points from a 3D model of the bottle, placed at the pose determined by the predicted 3D keypoints. The network is efficient and accurate, predicting keypoints with an MAE of 5.2 mm for the bottle and 10.1 mm for the mug using just 5 ms on a standard GPU.

The following table shows results for KeyPose on category-level estimation. The test set used a background texture not seen by the training set. Note that the MAE varies from 5.8 mm to 9.9 mm, showing the accuracy of the method.

Quantitative comparison of KeyPose with the state-of-the-art DenseFusion system, on category-level data. We provide DenseFusion with two versions of depth, one from the transparent objects, and one from opaque objects. <2cm is the percent of estimates with errors less than 2 cm. MAE is the mean absolute error of the keypoints, in mm.

For a complete accounting of quantitative results, as well as, ablation studies, please see the paper and supplementary materials and the KeyPose website.

Conclusion
This work shows that it is possible to accurately estimate the 3D pose of transparent objects from RGB images without reliance on depth images. It validates the use of stereo images as input to an early fusion deep net, where the network is trained to extract sparse 3D keypoints directly from the stereo pair. We hope the availability of an extensive, labelled dataset of transparent objects will help to advance the field. Finally, while we used semi-automatic methods to efficiently label the dataset, we hope to employ self-supervision methods in future work to do away with manual labelling.

Acknowledgements
I want to thank my co-authors, Xingyu Liu of Stanford University, and Rico Jonschkowski and Anelia Angelova; as well the many who helped us through discussions during the project and paper writing, including Andy Zheng, Shuran Song, Vincent Vanhoucke, Pete Florence, and Jonathan Tompson.

Read More

Partnering with NSF on human-AI collaboration

Today, in partnership with the U.S. National Science Foundation (NSF), we are announcing a National AI Research Institute for Human-AI Interaction and Collaboration. Google will provide $5 million in funding to support the Institute. We will also offer AI expertise, research collaborations, and Cloud support for Institute researchers and educators as they advance knowledge and progress in the field of AI.

Studies have shown that humans and AI systems operating together can make smarter decisions than either acting alone. In the past few years we’ve seen the increasing use of AI to support people and their decision making in sectors like medicine, education, transportation and agriculture. Conversely, people also support AI systems and their decision making through training data and model design, testing and operation, and continued feedback and iteration on system performance. People and AI systems shape each other, and in order to realize the full potential of AI for social benefit, positive and productive human-AI interaction and collaboration is critical.

Google has been working in this area over the last several years, publishing hundreds of research papers in human-computer interaction and visualization; bringing industry and academic experts together at events like the PAIR Symposium and top research conferences; designing tools like Facets, the What-If Tool, and the Language Interpretability Tool to better understand datasets and models; creating the Model Card Toolkit for model transparency and a People + AI guidebookto support human-centered AI development; building better human-AI interfaces in our products like smart predictions in Gboard and auto-captioning with the Live Transcribe app; and enabling anyone to help make AI-powered products more useful through efforts like Crowdsource and Translate Community.

The Institute we are announcing with NSF will support interdisciplinary research on a variety of modes of interaction between people and AI—like speech, written language, visuals and gestures—and how to make these interactions more effective. Importantly, the research, tools and techniques from the Institute will be developed with human-centered principles in mind: social benefit, inclusive design, safety and robustness, privacy, and high standards of scientific excellence, consistent with the Google AI Principles. Research projects in the Institute will engage a diverse set of experts, educate the next generation and promote workforce development, and broaden participation from underrepresented groups and institutions across the country. All research outcomes will be published to advance knowledge and progress in the field.

U.S. universities and research institutions, individually and in collaboration, are welcome to apply. We are proud to partner with NSF in our ongoing efforts to promote innovation and technology leadership, and look forward to supporting many brilliant and creative ideas.

Read More

A big step for flood forecasts in India and Bangladesh

A big step for flood forecasts in India and Bangladesh

For several years, the Google Flood Forecasting Initiative has been working with governments to develop systems that predict when and where flooding will occur—and keep people safe and informed. 

Much of this work is centered on India, where floods are a serious risk for hundreds of millions of people. Today, we’re providing an update on how we’re expanding and improving these efforts, as well as a new partnership we’ve formed with the International Federation of the Red Cross and Red Crescent Societies.

Expanding our forecasting reach

In recent months, we’ve been expanding our forecasting models and services in partnership with the Indian Central Water Commission. In June, just in time for the monsoon season, we reached an important milestone: our systems now extend to the whole of India, with Google technology being used to improve the targeting of every alert the government sends. This means we can help better protect more than 200 million people across more than 250,000 square kilometers—more than 20 times our coverage last year. To date, we’ve sent out around 30 million notifications to people in flood-affected areas. 

In addition to expanding in India, we’ve partnered with the Bangladesh Water Development Board to bring our warnings and services to Bangladesh, which experiences more flooding than any other country in the world. We currently cover more than 40 million people in Bangladesh, and we’re working to extend this to the whole country. 

Flood forecasting map

Coverage areas of our current operational flood forecasting systems. In these areas we use our models to help government alerts reach the right people. In some areas we have also increased lead time and spatial accuracy.

Better protection for vulnerable communities

In collaboration with Yale, we’ve been visiting flood-affected areas and doing research to better understand what information people need, how they use it to protect themselves, and what we can do to make that information more accessible. One survey we conducted found that 65 percent of people who receive flood warnings before the flooding begins take action to protect themselves or their assets (such as evacuating or moving their belongings). But we’ve also found there’s a lot more we could be doing to help—including getting alerts to people faster, and providing additional information about the severity of floods.

pasted image 0 (8).png

Checking how our flood warnings match conditions on the ground. This photo was taken during a field survey in Bihar during monsoon 2019.

This year, we’ve launched a new forecasting model that will allow us to double the lead time of many of our alerts—providing more notice to governments and giving tens of millions of people an extra day or so to prepare. 

We’re providing people with information about flood depth: when and how much flood waters are likely to rise. And in areas where we can produce depth maps throughout the floodplain, we’re sharing information about depth in the user’s village or area.

We’ve also overhauled the way our alerts look and function to make sure they’re useful and accessible for everyone. We now provide the information in different formats, so that people can both read their alerts and see them presented visually; we’ve added support for Hindi, Bengali and seven other local languages; we’ve made the alert more localized and accurate; and we now allow for easy changes to language or location.

The technology behind some of these improvements is explained in more detail at the Google AI Blog.  

flood forecasting alerts.png

Alerts for flood forecasting

Partnering for greater impact 

In addition to improving our alerts, Google.org has started a collaboration with the International Federation of Red Cross and Red Crescent Societies. This partnership aims to build local networks that can get disaster alert information to people who wouldn’t otherwise receive smartphone alerts directly. 

Of course, for all the progress we’ve made with alert technology, there are still a lot of challenges to overcome. With the flood season still in full swing in India and Bangladesh, COVID-19 has delayed critical infrastructure work, added to the immense pressure on first responders and medical authorities, and disrupted the in-person networks that many people still rely on for advance notice when a flood is on the way.

There’s much more work ahead to strengthen the systems that so many vulnerable people rely on—and expand them to reach more people in flood-affected areas. Along with our partners around the world, we will continue developing, maintaining and improving technologies and digital tools to help protect communities and save lives.

Read More

Using Machine Learning to Detect Deficient Coverage in Colonoscopy Screenings

Using Machine Learning to Detect Deficient Coverage in Colonoscopy Screenings

Posted by Daniel Freedman and Ehud Rivlin, Research Scientists, Google Health

Colorectal cancer (CRC) is a global health problem and the second deadliest cancer in the United States, resulting in an estimated 900K deaths per year. While deadly, CRC can be prevented by removing small precancerous lesions in the colon, called polyps, before they become cancerous. In fact, it is estimated that a 1% increase in the adenoma detection rate (ADR, defined as the fraction of procedures in which a physician discovers at least one polyp) can lead to a 6% decrease in the rate of interval CRCs (a CRC that is diagnosed within 60 months of a negative colonoscopy).

Colonoscopy is considered the gold standard procedure for the detection and removal of polyps. Unfortunately, the literature indicates that endoscopists miss on average 22%-28% of polyps during colonoscopies; furthermore, 20% to 24% of polyps that have the potential to become cancerous (adenomas) are missed. Two major factors that may cause an endoscopist to miss a polyp are (1) the polyp appears in the field of view, but the endoscopist misses it, perhaps due to its small size or flat shape; and (2) the polyp does not appear in the field of view, as the endoscopist has not fully covered the relevant area during the procedure.

In “Detecting Deficient Coverage in Colonoscopies”, we introduce the Colonoscopy Coverage Deficiency via Depth algorithm, or C2D2, a machine learning-based approach to improving colonoscopy coverage. The C2D2 algorithm performs a local 3D reconstruction of the colon as images are captured during the procedure, and on that basis, identifies which areas of the colon were covered and which remained outside of the field of view. C2D2 can then indicate in real time whether a particular area of the colon has suffered from deficient coverage so the endoscopist can return to that area. Our work proposes a novel approach to compute coverage in real time, for which 3D reconstruction is done using a calibration-free, unsupervised learning method, and evaluate it in a large scale way.

The C2D2 Algorithm
When considering colon coverage, it is important to estimate the coverage fraction — what percentage of the relevant regions were covered by a complete procedure. While a retrospective analysis is useful for the physician and could provide general guidance for future procedures, it is more useful to have real-time estimation of coverage fraction, on a segment by segment basis, i.e. knowledge of what fraction of the current segment has been covered while traversing the colon. The helpfulness of such functionality is clear: during the procedure itself, a physician may be alerted to segments with deficient coverage, and can immediately return to review these areas. Higher coverage will result in a higher proportion of polyps being seen.

The C2D2 algorithm is designed to compute such a segment-by-segment coverage in two phases: computing depth maps for each frame of the colonoscopy video, followed by computation of coverage based on these depth maps.

C2D2 computes a depth image from a single RGB image. Then, based on the computed depth images for a video sequence, C2D2 calculates local coverage, so it can detect where the coverage has been deficient and a second look is required.

Depth map creation consists of both depth estimation as well as pose estimation — the localization of where the endoscope is in space, as well as the direction it is pointing. In addition to the detection of deficient coverage, depth and pose estimation are useful for a variety of other interesting tasks. For example, depth can be used for improved detection of flat polyps, while pose estimation can be used for relocalizing areas of the colon (including polyps) that the endoscopist wishes to revisit, and both together can be used for visualization and navigation.

Top row: RGB image, from which the depth is computed. Bottom row: Depth image as computed by C2D2. Yellow is deeper, blue is shallower. Note that the “tunnel” structure is captured, as well as the Haustral ridges.

In order to compute coverage fractions from these depth maps, we trained C2D2 on two sources of data: synthetic sequences and real sequences. We generated the synthetic videos using a graphical model of a colon. For each synthetic video, ground truth coverage is available in the form of a number between 0 (completely uncovered) and 1 (completely covered). For real sequences, we analyzed de-identified colonoscopy videos, for which ground truth coverage is unavailable.

Performance on Synthetic Videos
When using synthetic videos, the availability of ground truth coverage enables the direct measurement of C2D2’s performance. We quantify this using the mean absolute error (MAE), which indicates how much the algorithm’s prediction differs, on average, from the ground truth. We find that C2D2’s MAE = 0.075; meaning that, on average, the prediction of C2D2 is within 7.5% of the ground truth. By contrast, a group of physicians given the same task achieved MAE = 0.177, i.e., within 17.7% of the ground truth. Thus, the C2D2 attained an accuracy rate 2.4 times higher on synthetic sequences.

Performance on Real Videos
Of course, what matters most is performance on videos of real colonoscopies. The challenge in this case is the absence of ground truth labelling: we don’t know what the actual coverage is. Additionally, one cannot use labels provided by experts directly as they are not always accurate, due to the challenges described earlier. However, C2D2 can still perform inference on real colonoscopy videos. Indeed, the learning pipeline is designed to perform equally well on synthetic and real colonoscopy videos.

To verify performance on real sequences, we used a variant of a technique common in the generative modelling literature, which involves providing video sequences to human experts along with C2D2’s coverage scores for those sequences. We then ask the experts to assess whether C2D2’s score is correct. The idea is that while it is difficult for experts to assign a score directly, the task of verifying a given score is considerably easier. (This is similar to the fact that verifying a proposed solution to an algorithmic problem is generally much easier than computing that solution.) Using this methodology, experts verified C2D2’s score 93% of the time. And in a more qualitative sense, C2D2’s output seems to pass the “eyeball test”, see the figure below.

Coverage on real colonoscopy sequences. Top row: Frames from a well covered sequence — the entire “tunnel” down the lumen may be seen; C2D2 coverage = 0.931. Middle row: A partially covered sequence — the bottom may be seen, but the top is not as visible; C2D2 coverage = 0.427. Bottom row: A poorly covered sequence, much of what is seen is the wall; C2D2 coverage = 0.227.

Next steps
By alerting physicians to missed regions of the colon wall, C2D2 promises to lead to the discovery of more adenomas, thereby increasing the ADR and concomitantly decreasing the rate of interval CRC. This would be of tremendous benefit to patients.

In addition to this work that addresses colonoscopy coverage, we are concurrently conducting research to improve polyp detection by combining C2D2 with an automatic, real-time polyp detection algorithm. This study adds to the mounting evidence that physicians may use machine learning methods to augment their efforts, especially during procedures, to improve the quality of care for patients.

Acknowledgements
This research was conducted by Daniel Freedman, Yochai Blau, Liran Katzir, Amit Aides, Ilan Shimshoni, Danny Veikherman, Tomer Golany, Ariel Gordon, Greg Corrado, Yossi Matias, and Ehud Rivlin, with support from Verily. We would like to thank all of our team members and collaborators who worked on this project with us, including: Nadav Rabani, Chen Barshai, Nia Stoykova, David Ben-Shimol, Jesse Lachter, and Ori Segol, 3D-Systems and many others. We’d also like to thank Yossi Matias for support and guidance. The research was conducted by teams from Google Health and Google Research, Israel.

Read More

Scaling Up Fundamental Quantum Chemistry Simulations on Quantum Hardware

Scaling Up Fundamental Quantum Chemistry Simulations on Quantum Hardware

Posted by Nicholas Rubin and Charles Neill, Research Scientists, Google AI Quantum

Accurate computational prediction of chemical processes from the quantum mechanical laws that govern them is a tool that can unlock new frontiers in chemistry, improving a wide variety of industries. Unfortunately, the exact solution of quantum chemical equations for all but the smallest systems remains out of reach for modern classical computers, due to the exponential scaling in the number and statistics of quantum variables. However, by using a quantum computer, which by its very nature takes advantage of unique quantum mechanical properties to handle calculations intractable to its classical counterpart, simulations of complex chemical processes can be achieved. While today’s quantum computers are powerful enough for a clear computational advantage at some tasks, it is an open question whether such devices can be used to accelerate our current quantum chemistry simulation techniques.

In “Hartree-Fock on a Superconducting Qubit Quantum Computer”, appearing today in Science, the Google AI Quantum team explores this complex question by performing the largest chemical simulation performed on a quantum computer to date. In our experiment, we used a noise-robust variational quantum eigensolver (VQE) to directly simulate a chemical mechanism via a quantum algorithm. Though the calculation focused on the Hartree-Fock approximation of a real chemical system, it was twice as large as previous chemistry calculations on a quantum computer, and contained ten times as many quantum gate operations. Importantly, we validate that algorithms being developed for currently available quantum computers can achieve the precision required for experimental predictions, revealing pathways towards realistic simulations of quantum chemical systems. Furthermore, we have released the code for the experiment, which uses OpenFermion, our open source repository for quantum computations of chemistry.

Google’s Sycamore processor mounted in a cryostat, recently used to demonstrate quantum supremacy and the largest quantum chemistry simulation on a quantum computer. Photo Credit: Rocco Ceselin

Developing an Error Robust Quantum Algorithm for Chemistry
There are a number of ways to use a quantum computer to simulate the ground state energy of a molecular system. In this work we focused on a quantum algorithm “building block”, or circuit primitive, and perfect its performance through a VQE (more on that later). In the classical setting this circuit primitive is equivalent to the Hartree-Fock model and is an important circuit component of an algorithm we previously developed for optimal chemistry simulations. This allows us to focus on scaling up without incurring exponential simulation costs to validate our device. Therefore, robust error mitigation on this component is crucial for accurate simulations when scaling to the “beyond classical” regime.

Errors in quantum computation emerge from interactions of the quantum circuitry with the environment, causing erroneous logic operations — even minor temperature fluctuations can cause qubit errors. Algorithms for simulating chemistry on near-term quantum devices must account for these errors with low overhead, both in terms of the number of qubits or additional quantum resources, such as implementing a quantum error correcting code. The most popular method to account for errors (and why we used it for our experiment) is to use a VQE. For our experiment, we selected the VQE we developed a few years ago, which treats the quantum processor like an neural network and attempts to optimize a quantum circuit’s parameters to account for noisy quantum logic by minimizing a cost function. Just like how classical neural networks can tolerate imperfections in data by optimization, a VQE dynamically adjusts quantum circuit parameters to account for errors that occur during the quantum computation.

Enabling High Accuracy with Sycamore
The experiment was run on the Sycamore processor that was recently used to demonstrate quantum supremacy. Though our experiment required fewer qubits, even higher quantum gate fidelity was needed to resolve chemical bonding. This led to the development of new, targeted calibration techniques that optimally amplify errors so they can be diagnosed and corrected.

Energy predictions of molecular geometries by the Hartree-Fock model simulated on 10 qubits of the Sycamore processor.

Errors in the quantum computation can originate from a variety of sources in the quantum hardware stack. Sycamore has 54-qubits and consists of over 140 individually tunable elements, each controlled with high-speed, analog electrical pulses. Achieving precise control over the whole device requires fine tuning more than 2,000 control parameters, and even small errors in these parameters can quickly add up to large errors in the total computation.

To accurately control the device, we use an automated framework that maps the control problem onto a graph with thousands of nodes, each of which represent a physics experiment to determine a single unknown parameter. Traversing this graph takes us from basic priors about the device to a high fidelity quantum processor, and can be done in less than a day. Ultimately, these techniques along with the algorithmic error mitigation enabled orders of magnitude reduction in the errors.

Left: The energy of a linear chain of Hydrogen atoms as the bond distance between each atom is increased. The solid line is the Hartree-Fock simulation with a classical computer while the points are computed with the Sycamore processor. Right: Two accuracy metrics (infidelity and mean absolute error) for each point computed with Sycamore. “Raw” is the non-error-mitigated data from Sycamore. “+PS” is data from a type of error mitigation correcting the number of electrons. “+Purification” is a type of error mitigation correcting for the right kind of state. “+VQE” is the combination of all the error mitigation along with variational relaxation of the circuit parameters. Experiments on H8, H10, and H12 show similar performance improvements upon error mitigation.

Pathways Forward
We hope that this experiment serves as a blueprint for how to run chemistry calculations on quantum processors, and as a jumping off point on the path to physical simulation advantage. One exciting prospect is that it is known how to modify the quantum circuits used in this experiment in a simple way such that they are no longer efficiently simulable, which would determine new directions for improved quantum algorithms and applications. We hope that the results from this experiment can be used to explore this regime by the broader research community. To run these experiments, you can find the code here.

Read More

Axial-DeepLab: Long-Range Modeling in All Layers for Panoptic Segmentation

Axial-DeepLab: Long-Range Modeling in All Layers for Panoptic Segmentation

Posted by Huiyu Wang, Student Researcher, and Yukun Zhu, Software Engineer, Google Research

The success of convolutional neural networks (CNNs) mainly comes from two properties of convolution: translation equivariance and locality. Translation equivariance, although not exact, ensures that the model functions well for objects at different positions in an image or for images of different sizes. Locality ensures efficient computation, but at the cost of making the modeling of long-range spatial relations challenging for panoptic segmentation of large images. For example, segmenting a large object requires modeling the shape of it, which could potentially cover a very large pixel area, and context that could be helpful for segmenting the object may come from farther away. In such cases, the inability to inform the model from context far from the convolution kernel could negatively impact the performance.

A rich set of literature has discussed approaches to solving the limitation of locality and enabling long-range interactions in CNNs. Some employ atrous convolutions, or image pyramids, which expand the receptive field somewhat, but it is still limited to a small local region. Another line of work adopts self-attention mechanisms, e.g., non-local neural networks, which allow the receptive field to cover the entire input image, as opposed to local convolutions. Unfortunately, such approaches are computationally expensive, especially for large inputs. Recent works enable building fully attentional models, but at a cost of applying local constraints to non-local neural networks. These restrictions limit the model receptive field, which is harmful to tasks such as segmentation, especially on high-resolution inputs.

In our recent ECCV 2020 paper, “Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation”, we propose to adopt axial-attention (or criss-cross attention), which recovers large receptive field in fully attentional models. The core idea is to separate 2D attention into two steps that apply 1D attention in the height and width axes sequentially. The efficiency of this approach enables attention over large regions, allowing models that learn long-range, or even global, interactions. Additionally, we propose a novel formulation for self-attention modules, which is more sensitive to the position of relevant context in a large receptive field with marginal costs. We evaluate our position-sensitive axial-attention method on panoptic segmentation by applying it to Panoptic-DeepLab, a simple and efficient method for panoptic segmentation. The effectiveness of our model is demonstrated on ImageNet, COCO, and Cityscapes. Axial-DeepLab achieves state-of-the-art results on panoptic segmentation and semantic segmentation, outperforming Panoptic-DeepLab by a large margin.

Axial-Attention Architecture
Axial-DeepLab consists of an Axial-ResNet backbone and Panoptic-DeepLab output heads, which produce panoptic segmentation results. Our Axial-ResNet is built on a ResNet architecture, in which all the 3×3 local convolutions in the ResNet bottleneck blocks are replaced by our proposed global position-sensitive axial-attention, thus enabling both a large receptive field and precise positional information.

An axial-attention block consists of two position-sensitive axial-attention layers operating along height- and width-axis sequentially.

The Axial-DeepLab height axial attention layer provides 1-dimensional self-attention globally, propagating information within individual columns — it does not transfer information between columns. The second 1D attention layer operating in the horizontal direction allows one to capture both column-wise and row-wise information. This separation reduces the complexity of self-attention from quadratic (2D) to linear (1D), which enables using a much larger (65×65 vs. previously 3×3) or even global context in all layers for long-range modeling in panoptic segmentation.

A message can be passed globally with two hops.

Note that a message or feature vector at (x1, y1) can always be passed globally on a 2D lattice to any position (x2, y2), with one hop on the height-axis (x1, y1 →x1, y2), followed by another hop on the width axis (x1, y2 → x2, y2). In this way, we are able to model 2D long-range relations in a single residual block. This axial-attention design also reduces the complexity from quadratic to linear and enables global receptive fields in all layers of a model.

Position-Sensitive Self-Attention
Additionally, we propose a position-sensitive formulation for self-attention. Previous self-attention formulations enabled a given pixel A to aggregate long-range context B, but provided no information about where in the receptive field the context originated. For example, perhaps the feature at pixel A represents the eye of a cat, and the context B might be the nose and another eye. In this case, the aggregated feature at pixel A would be a nose and two eyes, regardless of the geometric structure of a face. This could cause a false indication of the presence of a face when the two eyes are on the bottom-left of an image and the nose is on the top-right. A recently proposed solution is to impose a positional bias on where in the receptive field the context can originate. This bias depends on the feature at A only, (an eye), but not the feature at B, which contains important contextual information.

In this work, we let this bias also depend on the context feature at B (i.e., the nose and another eye). This change enables a more accurate positional bias when a pixel and the context informing it are far away from one another and thus contains different information about the bias. In addition, when pixel A aggregates the context feature B, we also include a feature that indicates the relative position from A to B. This change enables A to know precisely where B originated. These two changes make self-attention position-sensitive, especially in the situation of long-range modeling.

Results
We have tested Axial-DeepLab on COCO, and Cityscapes for panoptic segmentation. Improvements over the state-of-the-art Panoptic-DeepLab for each dataset can be seen in the table below. In particular, our Axial-DeepLab outperforms Panoptic-DeepLab by 2.8% Panoptic Quality (PQ) on the COCO test-dev set. Our single-scale small model performs better than multi-scale Panoptic-DeepLab while improving computational efficiency by 27x and using only 1/4 the number of parameters. We also show state-of-the-art results on Cityscapes. Moreover, we find that the performance increases as the block receptive field increases from 5 × 5 to 65 × 65. Our model is also more robust to out-of-distribution scales, on which the model was not trained.

Model     COCO     Citiscapes
Panoptic-DeepLab     39.7     65.3
Axial-DeepLab (ours)     43.4 (+3.7)     66.5 (+1.2)
Single scale comparison with Panoptic-DeepLab on validation sets

Besides our main results on panoptic segmentation, our full axial-attention model, Axial-ResNet, also performs better than the previous best stand-alone self-attention model on ImageNet.

Model     Params     M-Adds     Top-1
ResNet-50     25.6M     4.1B     76.9
Stand-Alone     18.0M     3.6B     77.6
Full Axial-Attention (ours)     12.5M     3.3B     78.1
Full Axial-Attention also works well on ImageNet.

Conclusion
We have proposed and demonstrated the effectiveness of position-sensitive axial-attention on image classification and panoptic segmentation. On ImageNet, our Axial-ResNet, formed by stacking axial-attention blocks, achieves state-of-the-art results among stand-alone self-attention models. We further convert Axial-ResNet to Axial-DeepLab for bottom-up panoptic segmentation, and also show state-of-the-art performance on several benchmarks, including COCO, and Cityscapes. We hope our promising results could establish that axial-attention is an effective building block for modern computer vision models.

Acknowledgements
This post reflects the work of the authors as well as Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. We also thank Niki Parmar for discussion and support; Ashish Vaswani, Xuhui Jia, Raviteja Vemulapalli, Zhuoran Shen for their insightful comments and suggestions; Maxwell Collins and Blake Hechtman for technical support.

Read More

An Analysis of Online Datasets Using Dataset Search (Published, in Part, as a Dataset)

An Analysis of Online Datasets Using Dataset Search (Published, in Part, as a Dataset)

Posted by Natasha Noy, Research Scientist; and Omar Benjelloun, Software Engineer, Google Research

There are tens of millions of datasets on the web, with content ranging from sensor data and government records, to results of scientific experiments and business reports. Indeed, there are datasets for almost anything one can imagine, be it diets of emperor penguins or where remote workers live. More than two years ago, we undertook an effort to design a search engine that would provide a single entry point to these millions of datasets and thousands of repositories. The result is Dataset Search, which we launched in beta in 2018 and fully launched in January 2020. In addition to facilitating access to data, Dataset Search reconciles and indexes datasets using the metadata descriptions that come directly from the dataset web pages using schema.org structure.

As of today, the complete Dataset Search corpus contains more than 31 million datasets from more than 4,600 internet domains. About half of these datasets come from .com domains, but .org and governmental domains are also well represented. The graph below shows the growth of the corpus over the last two years, and while we still don’t know what fraction of datasets on the web are currently in Dataset Search, the number continues to grow steadily.

Growth in the number of datasets indexed by Dataset Search

To better understand the breadth and utility of the datasets made available through Dataset Search, we published “Google Dataset Search by the Numbers”, accepted at the 2020 International Semantic Web Conference. Here we provide an overview of the available datasets, present metrics and insights originating from their analysis, and suggest best practices for publishing future scientific datasets. In order to enable other researchers to build analysis and tools using the metadata, we are also making a subset of the data publicly available.

A Range of Dataset Topics
In order to determine the distribution of topics covered by the datasets, we infer the research category based on dataset titles and descriptions, as well as other text on the dataset Web pages. The two most common topics are geosciences and social sciences, which account for roughly 45% of the datasets. Biology is a close third at ~15%, followed by a roughly even distribution for other topics, including computer science, agriculture, and chemistry, among others.

Distribution of dataset topics

In our initial efforts to launch Dataset Search, we reached out to specific communities, which was key to bootstrapping widespread use of the corpus. Initially, we focused on geosciences and social sciences, but since then, we have allowed the corpus to grow organically. We were surprised to see that the fields associated with the communities we reached out to early on are still dominating the corpus. While their early involvement certainly contributes to their prevalence, there may be other factors involved, such as differences in culture across communities. For instance, geosciences have been particularly successful in making their data findable, accessible, interoperable, and reusable (FAIR), a core component to reducing barriers for access.

Making Data Easily Citable and Reusable
There is a growing consensus among researchers across scientific disciplines that it is important to make datasets available, to publish details relevant to their use, and to cite them when they are used. Many funding agencies and academic publishers require proper publication and citation of data.

Peer-reviewed journals such as Nature Scientific Data are dedicated to publishing valuable datasets, and efforts such as DataCite provide digital object identifiers (DOIs) for them. Resolution services (e.g., identifiers.org) also provide persistent, de-referenceable identifiers, allowing for easy citation, which is key to making datasets widely available in scientific discourse. Unfortunately, we found that only about 11% of the datasets in the corpus (or ~3M) have DOIs. We chose this subset from the dataset corpus to be included in our open-source release. From this collection, about 2.3M datasets come from two sites, datacite.org and figshare.com:

Domain Datasets with DOIs
figshare.com 1,301K
datacite.org 1,070K
narcis.nl 118K
openaire.eu 100K
datadiscoverystudio.org 72K
osti.gov 63K
zenodo.org 50K
researchgate.net 41K
da-ra.de 40K

Publishers can specify access requirements for a dataset via schema.org metadata properties, including details of the license and information indicating whether or not the dataset is accessible for free. Only 34% of datasets specify license information, but when no license is specified, users cannot make any assumptions on whether or not they are allowed to reuse the data. Thus, adding licensing information, and, ideally, adding as open a license as possible, will greatly improve the reusability of the data.

Among the datasets that did specify a license, we were able to recognize a known license in 72% of cases. Those licenses include Open Government licenses for the UK and Canada, Creative Commons licenses, and several Public Domain licenses (e.g., Public Domain Mark 1.0). We found 89.5% of these datasets to either be accessible for free or use a license that allows redistribution, or both. And of these open datasets, 5.6M (91%) allow commercial reuse.

Another critical component of data reusability is providing downloadable data, yet only 44% of datasets specify download information in their metadata. A possible explanation for this surprisingly low value is that webmasters (or dataset-hosting platforms) fear that exposing the data download link through schema.org metadata may lead search engines or other applications to give their users direct access to download the data, thus “stealing” traffic from their website. Another concern may be that data needs the proper context to be used appropriately (e.g., methodology, footnotes, and license information), and providers feel that only their web pages can give the complete picture. In Dataset Search, we do not show download links as part of dataset metadata so that users must go to the publisher’s website to download the data, where they will see the full context for the dataset.

What Do Users Access?
Finally, we examine how Dataset Search is being used. Overall, 2.1M unique datasets from 2.6K domains appeared in the top 100 Dataset Search results over 14 days in May 2020. We find that the distribution of topics being queried is different from that of the corpus as a whole. For instance, geoscience takes up a much smaller fraction, and conversely, biology and medicine represent a larger fraction relative to their share of the corpus. This result is likely explained by the timing of our analysis, as it was performed during the first weeks of the COVID-19 pandemic.

Distribution of topics covered by datasets that appear in search results

Best Practices for Publishing Scientific Datasets
Based on our analysis, we have identified a set of best practices that can improve how datasets are discovered, reused and cited.

  • Discoverability
    Dataset metadata should be on pages that are accessible to web crawlers and that provide metadata in machine-readable formats in order to improve discoverability.

  • Persistence
    Publishing metadata on sites that are likely to be more persistent than personal web pages will facilitate data reuse and citation. Indeed, during our analysis of Dataset Search, we noted a very high rate of turnover — many URLs that hosted a dataset one day did not have it a few weeks or months later. Data repositories, such as Figshare, Zenodo, DataDryad, Kaggle Datasets and many others, are a good way to ensure dataset persistence. Many of these repositories have agreements with libraries to preserve data in perpetuity.

  • Provenance
    With datasets often published in multiple repositories, it would be useful for repositories to describe the provenance information more explicitly in the metadata. The provenance information helps users understand who collected the data, where the primary source of the dataset is, or how it might have changed.

  • Licensing
    Datasets should include licensing information, ideally in a machine-readable format. Our analysis indicates that when dataset providers select a license, they tend to choose a fairly open one. So, encouraging and enabling scientists to choose licenses for their data will result in many more datasets being openly available.

  • Assigning persistent identifiers (such as DOIs)
    DOIs are critical for long-term tracking and useability. Not only do these identifiers allow for much easier citation of datasets and version tracking, they are also dereferenceable: if a dataset moves, the identifier can point to a different location.

Releasing Metadata for Datasets with Persistent Identifiers
As part of the announcement today, we are also releasing a subset of our corpus for others to use. It contains the metadata for more than three million datasets that have DOIs and other types of persistent identifiers –- these are the datasets that are the most easily citable. Researchers can use this metadata to perform deeper analysis or to build their own applications using this data. For example, much of the growth of DOI usage appears to have been within the last decade. How does this timeframe relate to the datasets covered in the corpus? Is the DOI usage distribution uniform across datasets, or are there significant differences between research communities?

We will update the dataset on a regular basis. Finally, we hope that focusing this data release on datasets with persistent citable identifiers will encourage more data providers to describe their datasets in more detail and to make them more easily citable.

In conclusion, we hope that having data more discoverable through tools such as Google’s Dataset Search will encourage scientists to share their data more broadly and do it in a way that makes data truly FAIR.

Acknowledgments
This post reflects the work of the entire Dataset Search team. We are grateful to Shiyu Chen, Dimitris Paparas, Katrina Sostek, Yale Cong, Marc Najork, and Chris Gorgolewski for their contributions. We would also like to thank Hal Varian for suggesting this analysis and for many helpful ideas.

Read More