SmartReply for YouTube Creators

SmartReply for YouTube Creators

Posted by Rami Al-Rfou, Research Scientist, Google Research

It has been more than 4 years since SmartReply was launched, and since then, it has expanded to more users with the Gmail launch and Android Messages and to more devices with Android Wear. Developers now use SmartReply to respond to reviews within the Play Developer Console and can set up their own versions using APIs offered within MLKit and TFLite. With each launch there has been a unique challenge in modeling and serving that required customizing SmartReply for the task requirements.

We are now excited to share an updated SmartReply built for YouTube and implemented in YouTube Studio that helps creators engage more easily with their viewers. This model learns comment and reply representation through a computationally efficient dilated self-attention network, and represents the first cross-lingual and character byte-based SmartReply model. SmartReply for YouTube is currently available for English and Spanish creators, and this approach simplifies the process of extending the SmartReply feature to many more languages in the future.

YouTube creators receive a large volume of responses to their videos. Moreover, the community of creators and viewers on YouTube is diverse, as reflected by the creativity of their comments, discussions and videos. In comparison to emails, which tend to be long and dominated by formal language, YouTube comments reveal complex patterns of language switching, abbreviated words, slang, inconsistent usage of punctuation, and heavy utilization of emoji. Following is a sample of comments that illustrate this challenge:

Deep Retrieval
The initial release of SmartReply for Inbox encoded input emails word-by-word with a recurrent neural network, and then decoded potential replies with yet another word-level recurrent neural network. Despite the expressivity of this approach, it was computationally expensive. Instead, we found that one can achieve the same ends by designing a system that searches through a predefined list of suggestions for the most appropriate response.

This retrieval system encoded the message and its suggestion independently. First, the text was preprocessed to extract words and short phrases. This preprocessing included, but was not limited to, language identification, tokenization, and normalization. Two neural networks then simultaneously and independently encoded the message and the suggestion. This factorization allowed one to pre-compute the suggestion encodings and then search through the set of suggestions using an efficient maximum inner product search data structure. This deep retrieval approach enabled us to expand SmartReply to Gmail and since then, it has been the foundation for several SmartReply systems including the current YouTube system.

Beyond Words
The previous SmartReply systems described above relied on word level preprocessing that is well tuned for a limited number of languages and narrow genres of writing. Such systems face significant challenges in the YouTube case, where a typical comment might include heterogeneous content, like emoji, ASCII art, language switching, etc. In light of this, and taking inspiration from our recent work on byte and character language modeling, we decided to encode the text without any preprocessing. This approach is supported by research demonstrating that a deep Transformer network is able to model words and phrases from the ground up just by feeding it text as a sequence of characters or bytes, with comparable quality to word-based models.

Although initial results were promising, especially for processing comments with emoji or typos, the inference speed was too slow for production due to the fact that character sequences are longer than word equivalents and the computational complexity of self-attention layers grows quadratically as a function of sequence length. We found that shrinking the sequence length by applying temporal reduction layers at each layer of the network, similar to the dilation technique applied in WaveNet, provides a good trade-off between computation and quality.

The figure below presents a dual encoder network that encodes both the comment and the reply to maximize the mutual information between their latent representations by training the network with a contrastive objective. The encoding starts with feeding the transformer a sequence of bytes after they have been embedded. The input for each subsequent layer will be reduced by dropping a percentage of characters at equal offsets. After applying several transformer layers the sequence length is greatly truncated, significantly reducing the computational complexity. This sequence compression scheme could be substituted by other operators such as average pooling, though we did not notice any gains from more sophisticated methods, and therefore, opted to use dilation for simplicity.

A dual encoder network that maximizes the mutual information between the comments and their replies through a contrastive objective. Each encoder is fed a sequence of bytes and is implemented as a computationally efficient dilated transformer network.

A Model to Learn Them All
Instead of training a separate model for each language, we opted to train a single cross-lingual model for all supported languages. This allows the support of mixed-language usage in the comments, and enables the model to utilize the learning of common elements in one language for understanding another, such as emoji and numbers. Moreover, having a single model simplifies the logistics of maintenance and updates. While the model has been rolled out to English and Spanish, the flexibility inherent in this approach will enable it to be expanded to other languages in the future.

Inspecting the encodings of a multilingual set of suggestions produced by the model reveals that the model clusters appropriate replies, regardless of the language to which they belong. This cross-lingual capability emerged without exposing the model during training to any parallel corpus. We demonstrate in the figure below for three languages how the replies are clustered by their meaning when the model is probed with an input comment. For example, the English comment “This is a great video,” is surrounded by appropriate replies, such as “Thanks!” Moreover, inspection of the nearest replies in other languages reveal them also to be appropriate and similar in meaning to the English reply. The 2D projection also shows several other cross-lingual clusters that consist of replies of similar meaning. This clustering demonstrates how the model can support a rich cross-lingual user experience in the supported languages.

A 2D projection of the model encodings when presented with a hypothetical comment and a small list of potential replies. The neighborhood surrounding English comments (black color) consists of appropriate replies in English and their counterparts in Spanish and Arabic. Note that the network learned to align English replies with their translations without access to any parallel corpus.

When to Suggest?
Our goal is to help creators, so we have to make sure that SmartReply only makes suggestions when it is very likely to be useful. Ideally, suggestions would only be displayed when it is likely that the creator would reply to the comment and when the model has a high chance of providing a sensible and specific response. To accomplish this, we trained auxiliary models to identify which comments should trigger the SmartReply feature.

Conclusion
We’ve launched YouTube SmartReply, starting with English and Spanish comments, the first cross-lingual and character byte-based SmartReply. YouTube is a global product with a diverse user base that generates heterogeneous content. Consequently, it is important that we continuously improve comments for this global audience, and SmartReply represents a strong step in this direction.

Acknowledgements
SmartReply for YouTube creators was developed by Golnaz Farhadi, Ezequiel Baril, Cheng Lee, Claire Yuan, Coty Morrison‎, Joe Simunic‎, Rachel Bransom‎, Rajvi Mehta, Jorge Gonzalez‎, Mark Williams, Uma Roy and many more. We are grateful for the leadership support from Nikhil Dandekar, Eileen Long, Siobhan Quinn, Yun-hsuan Sung, Rachel Bernstein, and Ray Kurzweil.

SpineNet: A Novel Architecture for Object Detection Discovered with Neural Architecture Search

SpineNet: A Novel Architecture for Object Detection Discovered with Neural Architecture Search

Posted by Xianzhi Du, Software Engineer and Jaeyoun Kim, Technical Program Manager, Google Research

Convolutional neural networks created for image tasks typically encode an input image into a sequence of intermediate features that capture the semantics of an image (from local to global), where each subsequent layer has a lower spatial dimension. However, this scale-decreased model may not be able to deliver strong features for multi-scale visual recognition tasks where recognition and localization are both important (e.g., object detection and segmentation). Several works including FPN and DeepLabv3+ propose multi-scale encoder-decoder architectures to address this issue, where a scale-decreased network (e.g., a ResNet) is taken as the encoder (commonly referred to as a backbone model). A decoder network is then applied to the backbone to recover the spatial information.

While this architecture has yielded improved success for image recognition and localization tasks, it still relies on a scale-decreased backbone that throws away spatial information by down-sampling, which the decoder then must attempt to recover. What if one were to design an alternate backbone model that avoids this loss of spatial information, and is thus inherently well-suited for simultaneous image recognition and localization?

In our recent CVPR 2020 paper “SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization”, we propose a meta architecture called a scale-permuted model that enables two major improvements on backbone architecture design. First, the spatial resolution of intermediate feature maps should be able to increase or decrease anytime so that the model can retain spatial information as it grows deeper. Second, the connections between feature maps should be able to go across feature scales to facilitate multi-scale feature fusion. We then use neural architecture search (NAS) with a novel search space design that includes these features to discover an effective scale-permuted model. We demonstrate that this model is successful in multi-scale visual recognition tasks, outperforming networks with standard, scale-reduced backbones. To facilitate continued work in this space, we have open sourced the SpineNet code to the Tensorflow TPU GitHub repository in Tensorflow 1 and TensorFlow Model Garden GitHub repository in Tensorflow 2.

A scale-decreased backbone is shown on the left and a scale-permuted backbone is shown on the right. Each rectangle represents a building block. Colors and shapes represent different spatial resolutions and feature dimensions. Arrows represent connections among building blocks.

Design of SpineNet Architecture
In order to efficiently design the architecture for SpineNet, and avoid a time-intensive manual search of what is optimal, we leverage NAS to determine an optimal architecture. The backbone model is learned on the object detection task using the COCO dataset, which requires simultaneous recognition and localization. During architecture search, we learn three things:

  • Scale permutations: The orderings of network building blocks are important because each block can only be built from those that already exist (i.e., with a “lower ordering”). We define the search space of scale permutations by rearranging intermediate and output blocks, respectively.
  • Cross-scale connections: We define two input connections for each block in the search space. The parent blocks can be any block with a lower ordering or a block from the stem network.
  • Block adjustments (optional): We allow the block to adjust its scale level and type.
The architecture search process from a scale-decreased backbone to a scale-permuted backbone.

Taking the ResNet-50 backbone as the seed for the NAS search, we first learn scale-permutation and cross-scale connections. All candidate models in the search space have roughly the same computation as ResNet-50 since we just permute the ordering of feature blocks to obtain candidate models. The learned scale-permuted model outperforms ResNet-50-FPN by +2.9% average precision (AP) in the object detection task. The efficiency can be further improved (-10% FLOPs) by adding search options to adjust scale and type (e.g., residual block or bottleneck block, used in the ResNet model family) of each candidate feature block.

We name the learned 49-layer scale-permuted backbone architecture SpineNet-49. SpineNet-49 can be further scaled up to SpineNet-96/143/190 by repeating blocks two, three, or four times and increasing the feature dimension. An architecture comparison between ResNet-50-FPN and the final SpineNet-49 is shown below.

The architecture comparison between a ResNet backbone (left) and the SpineNet backbone (right) derived from it using NAS.

Performance
We demonstrate the performance of SpineNet models through comparison with ResNet-FPN. Using similar building blocks, SpineNet models outperform their ResNet-FPN counterparts by ~3% AP at various scales while using 10-20% fewer FLOPs. In particular, our largest model, SpineNet-190, achieves 52.1% AP on COCO for a single model without multi-scale testing during inference, significantly outperforming prior detectors. SpineNet also transfers to classification tasks, achieving 5% top-1 accuracy improvement on the challenging iNaturalist fine-grained dataset.

Performance comparisons of SpineNet models and ResNet-FPN models adopting the RetinaNet detection framework on COCO bounding box detection.
Performance comparisons of SpineNet models and ResNet models on ImageNet classification and iNaturalist fine-grained image classification.

Conclusion
In this work, we identify that the conventional scale-decreased model, even with a decoder network, is not effective for simultaneous recognition and localization. We propose the scale-permuted model, a new meta-architecture, to address the issue. To prove the effectiveness of scale-permuted models, we learn SpineNet by Neural Architecture Search in object detection and demonstrate it can be used directly in image classification. In the future, we hope the scale-permuted model will become the meta-architecture design of backbones across many visual tasks beyond detection and classification.

Acknowledgements
Special thanks to the co-authors of the paper: Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, and Xiaodan Song. We also would like to acknowledge Yeqing Li, Youlong Cheng, Jing Li, Jianwei Xie, Russell Power, Hongkun Yu, Chad Richards, Liang-Chieh Chen, Anelia Angelova, and the larger Google Brain Team for their help.

Leveraging Temporal Context for Object Detection

Leveraging Temporal Context for Object Detection

Posted by Sara Beery, Student Researcher, and Jonathan Huang, Research Scientist, Google Research

Ecological monitoring helps researchers to understand the dynamics of global ecosystems, quantify biodiversity, and measure the effects of climate change and human activity, including the efficacy of conservation and remediation efforts. In order to monitor effectively, ecologists need high-quality data, often expending significant efforts to place monitoring sensors, such as static cameras, in the field. While it is increasingly cost effective to build and operate networks of such sensors, the manual data analysis of global biodiversity data remains a bottleneck to accurate, global, real-time ecological monitoring. While there are ways to automate this analysis via machine learning, the data from static cameras, widely used to monitor the world around us for purposes ranging from mountain pass road conditions to ecosystem phenology, still pose a strong challenge for traditional computer vision systems — due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger.

In order to perform well in this setting, computer vision models must be robust to objects of interest that are often off-center, out of focus, poorly lit, or at a variety of scales. In addition, a static camera will always take images of the same scene unless it is moved, which causes the data from any one camera to be highly repetitive. Without sufficient data variability, machine learning models may learn to focus on correlations in the background, leading to poor generalization to novel deployments. The machine learning and ecological communities have been working together through venues like LILA BC and Wildlife Insights to curate expert-labeled training data from many research groups, each of which may operate anywhere from one to hundreds of camera traps, in order to increase data variability. This process of data collection and annotation is slow, and is confounded by the need to have diverse, representative data across geographic regions and taxonomic groups.

What’s in this image? Objects in images from static cameras can be very challenging to detect and categorize. Here, a foggy morning has made it very difficult to see a herd of wildebeest walking along the crest of a hill. [Image from Snapshot Serengeti]

In Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection, we present a complementary approach that increases global scalability by improving generalization to novel camera deployments algorithmically. This new object detection architecture leverages contextual clues across time for each camera deployment in a network, improving recognition of objects in novel camera deployments without relying on additional training data from a large number of cameras. Echoing the approach a person might use when faced with challenging images, Context R-CNN leverages up to a month’s worth of images from the same camera for context to determine what objects might be present and identify them. Using this method, the model outperforms a single-frame Faster R-CNN baseline by significant margins across multiple domains, including wildlife camera traps. We have open sourced the code and models for this work as part of the TF Object Detection API to make it easy to train and test Context R-CNN models on new static camera datasets.

Here, we can see how additional examples from the same scene help experts determine that the object is an animal and not background. Context such as the shape & size of the object, its attachment to a herd, and habitual grazing at certain times of day help determine that the species is a wildebeest. Useful examples occur throughout the month.

The Context R-CNN Model
Context R-CNN is designed to take advantage of the high degree of correlation within images taken by a static camera to boost performance on challenging data and improve generalization to new camera deployments without additional human data labeling. It is an adaptation of Faster R-CNN, a popular two-stage object detection architecture. To extract context for a camera, we first use a frozen feature extractor to build up a contextual memory bank from images across a large time horizon (up to a month or more). Next, objects are detected in each image using Context R-CNN which aggregates relevant context from the memory bank to help detect objects under challenging conditions (such as the heavy fog obscuring the wildebeests in our previous example). This aggregation is performed using attention, which is robust to the sparse and irregular sampling rates often seen in static monitoring cameras.

High-level architecture diagram, showing how Context R-CNN incorporates long-term context within the Faster R-CNN model architecture.

The first stage of Faster R-CNN proposes potential objects, and the second stage categorizes each proposal as either background or one of the target classes. In Context R-CNN, we take the proposed objects from the first stage of Faster R-CNN, and for each one we use similarity-based attention to determine how relevant each of the features in our memory bank (M) is to the current object, and construct a per-object context feature by taking a relevance-weighted sum over M and adding it back to the original object features. Then each object, now with added contextual information, is finally categorized using the second stage of Faster R-CNN.

Context R-CNN is able to leverage context (spanning up to 1 month) to correctly categorize the challenging wildebeest example we saw above. The green values are the corresponding attention weights for each boxed object.
Compared to a Faster R-CNN baseline (left), Context R-CNN (right) is able to capture challenging objects such as an elephant occluded by a tree, two poorly-lit impala, and a vervet monkey leaving the frame. [Images from Snapshot Serengeti]

Results
We have tested Context R-CNN on Snapshot Serengeti (SS) and Caltech Camera Traps (CCT), both ecological datasets of animal species in camera traps but from highly different geographic regions (Tanzania vs. the Southwestern United States). Improvements over the Faster R-CNN baseline for each dataset can be seen in the table below. Notably, we see a 47.5% relative increase in mean average precision (mAP) on SS, and a 34.3% relative mAP increase on CCT. We also compare Context R-CNN to S3D (a 3D convolution based baseline) and see performance improve from 44.7% mAP to 55.9% mAP (a 25.1% relative increase). Finally, we find that the performance increases as the contextual time horizon increases, from a minute of context to a month.

Comparison to a single frame Faster R-CNN baseline, showing both mean average precision (mAP) and average recall (AR) detection metrics.

Ongoing and Future Work
We are working to implement Context R-CNN within the Wildlife Insights platform, to facilitate large-scale, global ecological monitoring via camera traps. We also host competitions such as the yearly iWildCam species identification competition at the CVPR Fine-Grained Visual Recognition Workshop to help bring these challenges to the attention of the computer vision community. The challenges seen in automatic species identification in static cameras are shared by numerous applications of static cameras outside of the ecological monitoring domain, as well as other static sensors used to monitor biodiversity, such as audio and sonar devices. Our method is general, and we anticipate the per-sensor context approach taken by Context R-CNN would be beneficial for any static sensor.

Acknowledgements
This post reflects the work of the authors as well as the following group of core contributors: Vivek Rathod, Guanhang Wu, Ronny Votel. We are also grateful to Zhichao Lu, David Ross, Tanya Birch and the Wildlife Insights AI team, and Pietro Perona and the Caltech Computational Vision Lab.

Sensing Force-Based Gestures on the Pixel 4

Sensing Force-Based Gestures on the Pixel 4

Posted by Philip Quinn and Wenxin Feng, Research Scientists, Android UX

Touch input has traditionally focussed on two-dimensional finger pointing. Beyond tapping and swiping gestures, long pressing has been the main alternative path for interaction. However, a long press is sensed with a time-based threshold where a user’s finger must remain stationary for 400–500 ms. By its nature, a time-based threshold has negative effects for usability and discoverability as the lack of immediate feedback disconnects the user’s action from the system’s response. Fortunately, fingers are dynamic input devices that can express more than just location: when a user touches a surface, their finger can also express some level of force, which can be used as an alternative to a time-based threshold.

While a variety of force-based interactions have been pursued, sensing touch force requires dedicated hardware sensors that are expensive to design and integrate. Further, research indicates that touch force is difficult for people to control, and so most practical force-based interactions focus on discrete levels of force (e.g., a soft vs. firm touch) — which do not require the full capabilities of a hardware force sensor.

For a recent update to the Pixel 4, we developed a method for sensing force gestures that allowed us to deliver a more expressive touch interaction experience By studying how the human finger interacts with touch sensors, we designed the experience to complement and support the long-press interactions that apps already have, but with a more natural gesture. In this post we describe the core principles of touch sensing and finger interaction, how we designed a machine learning algorithm to recognise press gestures from touch sensor data, and how we integrated it into the user experience for Pixel devices.

Touch Sensor Technology and Finger Biomechanics
A capacitive touch sensor is constructed from two conductive electrodes (a drive electrode and a sense electrode) that are separated by a non-conductive dielectric (e.g., glass). The two electrodes form a tiny capacitor (a cell) that can hold some charge. When a finger (or another conductive object) approaches this cell, it ‘steals’ some of the charge, which can be measured as a drop in capacitance. Importantly, the finger doesn’t have to come into contact with the electrodes (which are protected under another layer of glass) as the amount of charge stolen is inversely proportional to the distance between the finger and the electrodes.

Left: A finger interacts with a touch sensor cell by ‘stealing’ charge from the projected field around two electrodes. Right: A capacitive touch sensor is constructed from rows and columns of electrodes, separated by a dielectric. The electrodes overlap at cells, where capacitance is measured.

The cells are arranged as a matrix over the display of a device, but with a much lower density than the display pixels. For instance, the Pixel 4 has a 2280 × 1080 pixel display, but a 32 × 15 cell touch sensor. When scanned at a high resolution (at least 120 Hz), readings from these cells form a video of the finger’s interaction.

Slowed touch sensor recordings of a user tapping (left), pressing (middle), and scrolling (right).

Capacitive touch sensors don’t respond to changes in force per se, but are tuned to be highly sensitive to changes in distance within a couple of millimeters above the display. That is, a finger contact on the display glass should saturate the sensor near its centre, but will retain a high dynamic range around the perimeter of the finger’s contact (where the finger curls up).

When a user’s finger presses against a surface, its soft tissue deforms and spreads out. The nature of this spread depends on the size and shape of the user’s finger, and its angle to the screen. At a high level, we can observe a couple of key features in this spread (shown in the figures): it is asymmetric around the initial contact point, and the overall centre of mass shifts along the axis of the finger. This is also a dynamic change that occurs over some period of time, which differentiates it from contacts that have a long duration or a large area.

Touch sensor signals are saturated around the centre of the finger’s contact, but fall off at the edges. This allows us to sense small deformations in the finger’s contact shape caused by changes in the finger’s force.

However, the differences between users (and fingers) makes it difficult to encode these observations with heuristic rules. We therefore designed a machine learning solution that would allow us to learn these features and their variances directly from user interaction samples.

Machine Learning for Touch Interaction
We approached the analysis of these touch signals as a gesture classification problem. That is, rather than trying to predict an abstract parameter, such as force or contact spread, we wanted to sense a press gesture — as if engaging a button or a switch. This allowed us to connect the classification to a well-defined user experience, and allowed users to perform the gesture during training at a comfortable force and posture.

Any classification model we designed had to operate within users’ high expectations for touch experiences. In particular, touch interaction is extremely latency-sensitive and demands real-time feedback. Users expect applications to be responsive to their finger movements as they make them, and application developers expect the system to deliver timely information about the gestures a user is performing. This means that classification of a press gesture needs to occur in real-time, and be able to trigger an interaction at the moment the finger’s force reaches its apex.

We therefore designed a neural network that combined convolutional (CNN) and recurrent (RNN) components. The CNN could attend to the spatial features we observed in the signal, while the RNN could attend to their temporal development. The RNN also helps provide a consistent runtime experience: each frame is processed by the network as it is received from the touch sensor, and the RNN state vectors are preserved between frames (rather than processing them in batches). The network was intentionally kept simple to minimise on-device inference costs when running concurrently with other applications (taking approximately 50 µs of processing per frame and less than 1 MB of memory using TensorFlow Lite).

An overview of the classification model’s architecture.

The model was trained on a dataset of press gestures and other common touch interactions (tapping, scrolling, dragging, and long-pressing without force). As the model would be evaluated after each frame, we designed a loss function that temporally shaped the label probability distribution of each sample, and applied a time-increasing weight to errors. This ensured that the output probabilities were temporally smooth and converged towards the correct gesture classification.

User Experience Integration
Our UX research found that it was hard for users to discover force-based interactions, and that users frequently confused a force press with a long press because of the difficulty in coordinating the amount of force they were applying with the duration of their contact. Rather than creating a new interaction modality based on force, we therefore focussed on improving the user experience of long press interactions by accelerating them with force in a unified press gesture. A press gesture has the same outcome as a long press gesture, whose time threshold remains effective, but provides a stronger connection between the outcome and the user’s action when force is used.

A user long pressing (left) and firmly pressing (right) on a launcher icon.

This also means that users can take advantage of this gesture without developers needing to update their apps. Applications that use Android’s GestureDetector or View APIs will automatically get these press signals through their existing long-press handlers. Developers that implement custom long-press detection logic can receive these press signals through the MotionEvent classification API introduced in Android Q.

Through this integration of machine-learning algorithms and careful interaction design, we were able to deliver a more expressive touch experience for Pixel users. We plan to continue researching and developing these capabilities to refine the touch experience on Pixel, and explore new forms of touch interaction.

Acknowledgements
This project is a collaborative effort between the Android UX, Pixel software, and Android framework teams.

Presenting a Challenge and Workshop in Efficient Open-Domain Question Answering

Presenting a Challenge and Workshop in Efficient Open-Domain Question Answering

Posted by Eunsol Choi, Visiting Faculty Researcher and Tom Kwiatkowski, Research Scientist, Google Research

One of the primary goals of natural language processing is to build systems that can answer a user’s questions. To do this, computers need to be able to understand questions, represent world knowledge, and reason their way to answers. Traditionally, answers have been retrieved from a collection of documents or a knowledge graph. For example, to answer the question, “When was the declaration of independence officially signed?” a system might first find the most relevant article from Wikipedia, and then locate a sentence containing the answer, “August 2, 1776”. However, more recent approaches, like T5, have also shown that neural models, trained on large amounts of web-text, can also answer questions directly, without retrieving documents or facts from a knowledge graph. This has led to significant debate about how knowledge should be stored for use by our question answering systems — in human readable text and structured formats, or in the learned parameters of a neural network.

Today, we are proud to announce the EfficientQA competition and workshop at NeurIPS 2020, organized in cooperation with Princeton University and the University of Washington. The goal is to develop an end-to-end question answering system that contains all of the knowledge required to answer open-domain questions. There are no constraints on how the knowledge is stored — it could be in documents, databases, the parameters of a neural network, or any other form — but entries will be evaluated based on the number of bytes used to access this knowledge, including code, corpora, and model parameters. There will also be an unconstrained track, in which the goal is to achieve the best possible question answering performance regardless of system size. To build small, yet robust systems, participants will have to explore new methods of knowledge representation and reasoning.

An illustration of how the memory budget changes as a neural network and retrieval corpus grow and shrink. It is possible that successful systems will also use other resources such as a knowledge graph.

Competition Overview
The competition will be evaluated using the open-domain variant of the Natural Questions dataset. We will also provide further human evaluation of all the top performing entries to account for the fact that there are many correct ways to answer a question, not all of which will be covered by any set of reference answers. For example, for the question “What type of car is a Jeep considered?” both “off-road vehicles” and “crossover SUVs” are valid answers.

The competition is divided between four separate tracks: best performing system under 500 Mb; best performing system under 6 Gb; smallest system to get at least 25% accuracy; and the best performing system with no constraints. The winners of each of these tracks will be invited to present their work during the competition track at NeurIPS 2020, which will be hosted virtually. We will also put each of the winning systems up against human trivia experts (the 2017 NeurIPS Human-Computer competition featured Jeopardy! and Who Wants to Be a Millionaire champions) in a real-time contest at the virtual conference.

Participation
To participate, go to the competition site where you will find the data and evaluation code available for download, as well as dates and instructions on how to participate, and a sign-up form for updates. Along with our academic collaborators, we have provided some example systems to help you get started.

We believe that the field of natural language processing will benefit from a greater exploration and comparison of small system question answering options. We hope that by encouraging the development of very small systems, this competition will pave the way for on-device question answering.

Acknowledgements
Creating this challenge and workshop has been a large team effort including Adam Roberts, Colin Raffel, Chris Alberti, Jordan Boyd-Graber, Jennimaria Palomaki, Kenton Lee, Kelvin Guu, and Michael Collins from Google; as well as Sewon Min and Hannaneh Hajishirzi from the University of Washington; and Danqi Chen from Princeton University.

RepNet: Counting Repetitions in Videos

RepNet: Counting Repetitions in Videos

Posted by Debidatta Dwibedi, Research Scientist, Robotics at Google

Repeating processes ranging from natural cycles, such as phases of the moon or heartbeats and breathing, to artificial repetitive processes, like those found on manufacturing lines or in traffic patterns, are commonplace in our daily lives. Beyond just their prevalence, repeating processes are of interest to researchers for the variety of insights one can tease out of them. It may be that there is an underlying cause behind something that happens multiple times, or there may be gradual changes in a scene that may be useful for understanding. Sometimes, repeating processes provide us with unambiguous “action units”, semantically meaningful segments that make up an action. For example, if a person is chopping an onion, the action unit is the manipulation action that is repeated to produce additional slices. These units may be indicative of more complex activity and may allow us to analyze more such actions automatically at a finer time-scale without having a person annotate these units. For the above reasons, perceptual systems that aim to observe and understand our world for an extended period of time will benefit from a system that understands general repetitions.

In “Counting Out Time: Class Agnostic Video Repetition Counting in the Wild”, we present RepNet, a single model that can understand a broad range of repeating processes, ranging from people exercising or using tools, to animals running and birds flapping their wings, pendulums swinging, and a wide variety of others. In contrast to our previous work, which used cycle-consistency constraints across different videos of the same action to understand them at a fine-grained level, in this work we present a system that can recognize repetitions within a single video. Along with this model, we are releasing a dataset to benchmark class-agnostic counting in videos and a Colab notebook to run RepNet.

RepNet
RepNet is a model that takes as input a video that contains periodic action of a variety of classes (including those unseen during training) and returns the period of repetitions found therein. In the past the problem of repetition counting has been addressed by directly comparing pixel intensities in frames, but real world videos have camera motion, occlusion by objects in the field, drastic scale difference and changes in form, which necessitates learning of features invariant to such noise. To accomplish this we train a machine learning model in an end-to-end manner to directly estimate the period of the repetitions. The model consists of three parts: a frame encoder, an intermediate representation, called a temporal self-similarity matrix (which we will describe below), and a period predictor.

First, the frame encoder uses the ResNet architecture as a per-frame model to generate embeddings of each frame of the video The ResNet architecture was chosen since it has been successful for a number of image and video tasks. Passing each frame of a video through a ResNet-based encoder yields a sequence of embeddings.

At this point we calculate a temporal self-similarity matrix (TSM) by comparing each frame’s embedding with every other frame in the video, returning a matrix that is easy for subsequent modules to analyze for counting repetitions. This process surfaces self-similarities in the stream of video frames that enable period estimation, as demonstrated in the video below.

Demonstration of how the TSM processes images of the Earth’s day-night cycle.

For each frame, we then use Transformers to predict the period of repetition and the periodicity (i.e., whether or not a frame is part of the periodic process) directly from the sequence of similarities in the TSM. Once we have the period, we obtain the per-frame count by dividing the number of frames captured in a periodic segment by the period length. We sum this up to predict the number of repetitions in the video.

Overview of the RepNet model.

Temporal Self-Similarity Matrix
The example of the TSM from the day-night cycle, shown above, is derived from an idealized scenario with fixed period repetitions. TSMs from real videos often reveal fascinating structures in the world, as demonstrated in the three examples below. Jumping jacks are close to the ideal periodic action with a fixed period, while in contrast, the period of a bouncing ball declines as the ball loses energy through repeated bounces. The video of someone mixing concrete demonstrates repetitive action that is preceded and followed by a period without motion. These three behaviors are clearly distinguished in the learned TSM, which requires that the model pay attention to fine changes in the scene.

Jumping Jacks (constant period; video from Kinetics), Bouncing ball (decreasing period; Kinetics), Mixing concrete (aperiodic segments present in video; PERTUBE dataset).

One advantage of using the TSM as an intermediate layer in RepNet is that the subsequent processing by the transformers is done in the self-similarity space and not in the feature space. This encourages generalization to unseen classes. For example, the TSMs produced by actions as different as jumping jacks or swimming are similar as long as the action was repeated at a similar pace. This allows us to train on some classes and yet expect generalization to unseen classes.

Data
One way to train the above model would be to collect a large dataset of videos that capture repetitive activities and label them with the repetition count. The challenge in this is two-fold. First, it requires one to examine a large number of videos to identify those with repeated actions. Following that, each video must be annotated with the number of times an action was repeated. While for certain tasks annotators can skip frames (for example, to classify a video as showing jumping jacks), they still need to see the entire video in order to count how many jumping jacks were performed.

We overcome this challenge by introducing a process for synthetic data generation that produces videos with repetitions using videos that may not contain repeating actions at all. This is accomplished by randomly selecting a segment of the video to repeat an arbitrary number of times, bookended by the original video context.

Our synthetic data generation pipeline that produces videos with repetitions from any video.

While this process generates a video that resembles a natural-looking video with repeating processes, it is still too simple for deep learning methods, which can learn to cheat by looking for artifacts, instead of learning to recognize repetitions. To address this, we perform extreme data augmentation, which we call camera motion augmentation. In this method, we modify the video to simulate a camera that smoothly moves around using 2D affine motion as the video progresses.

Left: An example of a synthetic repeating video generated from a random video. Right: An example of a video with camera motion augmentation, which is tougher for the model, but results in better generalization to real repeating videos (both from Kinetics).

Evaluation
Even though we can train a model on synthetic repeating videos, the resulting models must be able to generalize to real video of repeating processes. In order to evaluate the performance of the trained models on real videos, we collect a dataset of ~9000 videos from the Kinetics dataset. These videos span many action classes and capture diverse scenes, arising from the diversity of data seen on Youtube. We annotate these videos with the count of the action being repeated in the video. To encourage further research in this field, we are releasing the count annotations for this dataset, which we call Countix.

Applications
A class-agnostic counting model has many useful applications. RepNet serves as a single model that can count repetitions from many different domains:

RepNet can count repeated activities from a range of domains, such as slicing onions (left; video from Kinetics dataset), Earth’s diurnal cycle (middle; Himawari satellite data), or even a cheetah in motion (right; video from imgur.com).

RepNet could be used to estimate heartbeat rates from echocardiogram videos even though it has not seen such videos in training:

Predicted heart rates: 45 bpm (left) and 75 bpm (right). True heart rates 46-50 bpm and 78-79 bpm, respectively. RepNet’s prediction of the heart rate across different devices is encouragingly close to the rate measured by the device. (Source for left and right)

RepNet can also be used to monitor repeating activities for any changes in speed. Below we show how the Such changes in speed can also be used in other settings for quality or process control.

In this video, we see RepNet counting accelerating cellular oscillations observed under a laser microscope even though it has never seen such a video during training, (from Nature article).
Left: Person performing a “mountain climber” exercise. Right: The 1D projection of the RepNet embeddings using principal component analysis, capturing the moment that the person changes their speed during the exercise. (Video from Kinetics)

Release
We are releasing Countix annotations for the community to work on the problem of repetition counting. We are also releasing a Colab notebook for running RepNet. Using this you can run RepNet on your videos or even using your webcam to detect periodic activities in video and count repetitions automatically in videos.

Acknowledgements
This is joint work with Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Special thanks to Tom Small for designing the visual explanation of TSM. The authors thank Anelia Angelova, Relja Arandjelović, Sourish Chaudhuri, Aishwarya Gomatam, Meghana Thotakuri, and Vincent Vanhoucke for their help with this project.

Improving Speech Representations and Personalized Models Using Self-Supervision

Improving Speech Representations and Personalized Models Using Self-Supervision

Posted by Joel Shor, Software Engineer and Oran Lang, Software Engineer, Google Research, Israel

There are many tasks within speech processing that are easier to solve by having large amounts of data. For example automatic speech recognition (ASR) translates spoken audio into text. In contrast, “non-semantic” tasks focus on the aspects of human speech other than its meaning, encompassing “paralinguistic” tasks, like speech emotion recognition, as well as other kinds of tasks, such as speaker identification, language identification, and certain kinds of voice-based medical diagnoses. In training systems to accomplish these tasks, one common approach is to utilize the largest datasets possible to help ensure good results. However, machine learning techniques that directly rely on massive datasets are often less successful when trained on small datasets.

One way to bridge the performance gap between large and small datasets is to train a representation model on a large dataset, then transfer it to a setting with less data. Representations can improve performance in two ways: they can make it possible to train small models by transforming high-dimensional data (like images and audio) to a lower dimension, and the representation model can also be used as pre-training. In addition, if the representation model is small enough to be run or trained on-device, it can improve performance in a privacy-preserving way by giving users the benefits of a personalized model where the raw data never leaves their device. While representation learning is commonly used in the text domain (e.g. BERT and ALBERT) and in the images domain (e.g. Inception layers and SimCLR), such approaches are underutilized in the speech domain.

Bottom:A large speech dataset is used to train a model, which is then rolled out to other environments. Top Left: On-device personalization — personalized, on-device models combine security and privacy. Top Middle: Small model on embeddings — general-use representations transform high-dimensional, few-example datasets to a lower dimension without sacrificing accuracy; smaller models train faster and are regularized. Top Right: Full model fine-tuning — large datasets can use the embedding model as pre-training to improve performance

Unambiguously improving generally-useful representations, for non-semantic speech tasks in particular, is difficult without a standard benchmark to compare “speech representation usefulness.” While the T5 framework systematically evaluates text embeddings and the Visual Task Adaptation Benchmark (VTAB) standardizes image embedding evaluation, both leading to progress in representation learning in those respective fields, there has been no such benchmark for non-semantic speech embeddings.

In “Towards Learning a Universal Non-Semantic Representation of Speech“, we make three contributions to representation learning for speech-related applications. First, we present a NOn-Semantic Speech (NOSS) benchmark for comparing speech representations, which includes diverse datasets and benchmark tasks, such as speech emotion recognition, language identification, and speaker identification. These datasets are available in the “audio” section of TensorFlow Datasets. Second, we create and open-source TRIpLet Loss network (TRILL), a new model that is small enough to be executed and fine-tuned on-device, while still outperforming other representations. Third, we perform a large-scale study comparing different representations, and open-source the code used to compute the performance on new representations.

A New Benchmark for Speech Embeddings
For a benchmark to usefully guide model development, it must contain tasks that ought to have similar solutions and exclude those that are significantly different. Previous work either dealt with the variety of possible speech-based tasks independently, or lumped semantic and non-semantic tasks together. Our work improves performance on non-semantic speech tasks, in part, by focusing on neural network architectures that perform well specifically on this subset of speech tasks.

The tasks were selected for the NOSS benchmark on the basis of their 1) diversity — they need to cover a range of use-cases; 2) complexity — they should be challenging; and 3) availability, with particular emphasis on those tasks that are open-source. We combined six datasets of different sizes and tasks.

Datasets for downstream benchmark tasks. *VoxCeleb results in our study were computed using a subset of the dataset that was filtered according to internal policy.

We also introduce three additional intra-speaker tasks to test performance in the personalization scenario. In some datasets with k speakers, we can create k different tasks consisting of training and testing on just a single speaker. Overall performance is averaged across speakers. These three additional intra-speaker tasks measure the ability of an embedding to adapt to a particular speaker, as would be necessary for personalized, on-device models, which are becoming more important as computation moves to smart phones and the internet of things.

To help enable researchers to compare speech embeddings, we’ve added the six datasets in our benchmark to TensorFlow Datasets (in the “audio” section) and open sourced the evaluation framework.

TRILL: A New State of the Art in Non-semantic Speech Classification
Learning an embedding from one dataset and applying it to other tasks is not as common in speech as in other modalities. However, transfer learning, the more general technique of using data from one task to help another (not necessarily with embeddings), has some compelling applications, such as personalizing speech recognizers and voice imitation text-to-speech from few samples. There have been many previously proposed representations of speech, but most of these have been trained on a smaller and less diverse data, have been tested primarily on speech recognition, or both.

To create a data-derived representation of speech that was useful across environments and tasks, we started with AudioSet, a large and diverse dataset that includes about 2500 hours of speech. We then trained an embedding model on a simple, self-supervised criteria derived from previous work on metric learning — embeddings from the same audio should be closer in embedding space than embeddings from different audio. Like BERT and the other text embeddings, the self-supervised loss function doesn’t require labels and only relies on the structure of the data itself. This form of self-supervision is the most appropriate for non-semantic speech, since non-semantic phenomena are more stable in time than ASR and other sub-second speech characteristics. This simple, self-supervised criteria captures a large number of acoustic properties that are leveraged in downstream tasks.

TRILL loss: Embeddings from the same audio are closer in embedding space than embeddings from different audio.

TRILL architecture is based on MobileNet, making it fast enough to run on mobile devices. To achieve high accuracy on this small architecture, we distilled the embedding from a larger ResNet50 model without performance degradation.

Benchmark Results
We compared the performance of TRILL against other deep learning representations that are not focused on speech recognition and were trained on similarly diverse datasets. In addition, we compared TRILL to the popular OpenSMILE feature extractor, which uses pre-deep learning techniques (e.g., a fourier transform coefficients, “pitch tracking” using a time-series of pitch measurements, etc.), and randomly initialized networks, which have been shown to be strong baselines. To aggregate the performance across tasks that have different performance characteristics, we first train a small number of simple models, for a given task and embedding. The best result is chosen. Then, to understand the effect that a particular embedding has across all tasks, we calculate a linear regression on the observed accuracies, with both the model and task as the explanatory variables. The effect a model has on the accuracy is the coefficient associated with the model in the regression. For a given task, when changing from one model to another, the resulting change in accuracy is expected to be the difference in y-values in the figure below.

Effect of model on accuracy.

TRILL outperforms the other representations in our study. Factors that contribute to TRILL’s success are the diversity of the training dataset, the large context window of the network, and the generality of the TRILL training loss that broadly preserves acoustic characteristics instead of prematurely focusing on certain aspects. Note that representations from intermediate network layers are often more generally useful. The intermediate representations are larger, have finer temporal granularity, and in the case of the classification networks they retain more general information that isn’t as specific to the classes on which they were trained.

Another benefit of a generally-useful model is that it can be used to initialize a model on a new task. When the sample size of a new task is small, fine-tuning an existing model may lead to better results than training the model from scratch. We achieved a new state-of-the-art result on three out of six benchmark tasks using this technique, despite doing no dataset-specific hyperparameter tuning.

To compare our new representation, we also tested it on the mask sub-challenge of the Interspeech 2020 Computational Paralinguistics Challenge (ComParE). In this challenge, models must predict whether a speaker is wearing a mask, which would affect their speech. The mask effects are sometimes subtle, and audio clips are only one second long. A linear model on TRILL outperformed the best baseline model, which was a fusion of many models on different kinds of features including traditional spectral and deep-learned features.

Summary
The code to evaluate NOSS is available on GitHub, the datasets are on TensorFlow Datasets, and the TRILL models are available on AI Hub.

The NOn-Semantic Speech benchmark helps researchers create speech embeddings that are useful in a wide range of contexts, including for personalization and small-dataset problems. We provide the TRILL model to the research community as a baseline embedding to surpass.

Acknowledgements
The core team behind this work includes Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Felix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, and Yinnon Haviv. We’d also like to thank Avinatan Hassidim and Yossi Matias for technical guidance.

Using Selective Attention in Reinforcement Learning Agents

Using Selective Attention in Reinforcement Learning Agents

Posted by Yujin Tang, Research Software Engineer and David Ha, Staff Research Scientist, Google Research, Tokyo

Inattentional blindness is the psychological phenomenon that causes one to miss things in plain sight, and is a consequence of the selective attention that enables you to remain focused on important parts of the world without distraction from irrelevant details. It is believed that this selective attention mechanism enables people to condense broad sensory information into a form that is compact enough to be used for future decision making. While this may seem to be a limitation, such “bottlenecks” observed in nature can also inspire the design of machine learning systems that hope to mimic the success and efficiency of biological organisms. For example, while most methods presented in the deep reinforcement learning (RL) literature allow an agent to access the entire visual input, and even incorporating modules for predicting future sequences of visual inputs, perhaps reducing an agent’s access to its visual inputs via an attention constraint could be beneficial to an agent’s performance?

In our recent GECCO 2020 paper, “Neuroevolution of Self-Interpretable Agents” (AttentionAgent), we investigate the properties of such agents that employ a self-attention bottleneck. We show that not only are they able to solve challenging vision-based tasks from pixel inputs with 1000x fewer learnable parameters compared to conventional methods, they are also better at generalization to unseen modifications of their tasks, simply due to its ability to “not see details” that can confuse it. Furthermore, looking at where the agent is focusing its attention provides visual interpretability to its decision making process. The following diagram illustrates how the agent learned to deal with its attention bottleneck:

AttentionAgent learned to attend to task critical regions in its visual inputs. In a car driving task (CarRacing, top row), the agent mostly attends to the road borders, but shifts its focus to the turns before it changes heading directions. In a fireball dodging game (DoomTakeCover, bottom row), the agent focuses on fireballs and enemy monsters. Left: Visual inputs to the agent. Center: Agent’s attention overlaid on the visual inputs, the white patches indicate where the agent focuses its attention. Right: Visual cues based on which the agent makes decisions.

Agent with Artificial Attention
While there have been several works that explore how constraints such as sparsity may play a role in actually shaping the abilities of reinforcement learning agents, AttentionAgent takes inspiration from concepts related to inattentional blindness — when the brain is involved in effort-demanding tasks, it assigns most of its attention capacity only to task-relevant elements and is temporarily blind to other signals. To achieve this, we segment the input image into several patches and then rely on a modified self-attention architecture to simulate voting between patches to elect a subset to be considered important. The patches of interest are elected at each time step and, once determined, AttentionAgent makes decisions solely on these patches, ignoring the rest.

In addition to extracting key factors from visual inputs, the ability to contextualize these factors as they change in time is just as crucial. For example, a batter in the game of baseball must use visual signals to continuously keep track of the baseball’s location in order to predict its position and be able to hit it. In AttentionAgent, a long short-term memory (LSTM) model accepts information from the important patches and generates an action at each time step. LSTM keeps track of the changes in the input sequence, and can thus utilize the information to track how critical factors evolve over time.

It is conventional to optimize a neural network with backpropagation. However, because AttentionAgent contains non-differentiable operations for the generation of important patches, like sorting and slicing, it is not straightforward to apply such techniques for training. We therefore turn to derivative-free optimization algorithms to overcome this difficulty.

Overview of our method and illustration of data processing flow in AttentionAgent. Top: Input transformation — A sliding window segments an input image into smaller patches, and then “flattens” them for future processing. Middle: Patch election — The modified self-attention module holds votes between patches to generate a patch importance vector. Bottom: Action generation — AttentionAgent picks the patches of the highest importance, extracts corresponding features and makes decisions based on them.

Generalization to Unseen Modifications of the Environment
We demonstrate that Attention Agent learned to attend to a variety of regions in the input images. Visualization of the important patches provides a peek into how the agent is making decisions, illustrating that most selections make sense and are consistent with human intuition, and is a powerful tool for analyzing and debugging an agent in development. Furthermore, since the agent learned to ignore information non-critical to the core task, it can generalize to tasks where small environmental modifications are applied.

Here, we show that restricting the agent’s decision-making controller’s access to important patches only while ignoring the rest of the scene can result in better generalization, simply due to how the agent is restricted from “seeing things” that can confuse it. Our agent is trained to survive in the VizDoom TakeCover environment only, but it can also survive in unseen settings with higher walls, different floor textures, or when confronted with a distracting sign.

DoomTakeCover Generalization: The AttentionAgent is trained in the environment with no modifications (left). It is able to adapt to changes in the environment, such as a higher wall (middle, left), a different floor texture (middle, right), or floating text (right).

When one learns to drive during a sunny day, one also can transfer those skills (to some extent) to driving at night, on a rainy day, in a different car, or in the presence of bird droppings on the windshield. AttentionAgent is not only able to solve CarRacing-v0, it can also achieve similar performance in unseen conditions, such as brighter or darker scenery, or having its vision modified by artifacts such as side bars or background blobs, while requiring 1000x fewer parameters than conventional methods that fail to generalize.

CarRacing Generalization: No modification (left); color perturbation (middle, left); vertical bars on left and right (middle, right); added red blob (right).

Limitations and Future Work
While AttentionAgent is able to cope with various modifications of the environment, there are limitations to this approach, and much more work to be done to further enhance the generalization capabilities of the agent. For example, AttentionAgent does not generalize to cases where dramatic background changes are involved. The agent trained on the original car racing environment with the green grass background fails to generalize when the background is replaced with distracting YouTube videos. When we take this one step further and replace the background with pure uniform noise, we observe that the agent’s attention module breaks down and attends only to random patches of noise, rather than to the road-related patches. If we train an agent from scratch in the noisy background environment, it manages to get around the track, although the performance is mediocre. Interestingly, the agent still attends only to the noise, rather than to the road, it appears to have learned to drive by estimating where the lane is based on the number of selected patches on the left and right of the screen.

AttentionAgent fails to generalize to drastically modified environments. Left: The background suddenly becomes a cat (Creative Commons video). Middle: The background suddenly becomes an arcade game (Creative Commons video). Right: AttentionAgent learned to drive on pure noise background by avoiding noise patches.

The simplistic method we use to extract information from important patches may be inadequate for more complicated tasks. How we can learn more meaningful features, and perhaps even extract symbolic information from the visual input will be an exciting future direction. In addition to open sourcing the code to the research community, we have also released CarRacingExtension, a suite of car racing tasks that involve various environmental modifications, as testbeds and benchmark for ML researchers who are interested in agent generalizations.

Acknowledgements
This research was conducted by Yujin Tang, Duong Nguyen, and David Ha. We would like to thank Yingtao Tian, Lana Sinapayen, Shixin Luo, Krzysztof Choromanski, Sherjil Ozair, Ben Poole, Kai Arulkumaran, Eric Jang, Brian Cheung, Kory Mathewson, Ankur Handa, and Jeff Dean for valuable discussions.

Machine Learning-based Damage Assessment for Disaster Relief

Machine Learning-based Damage Assessment for Disaster Relief

Posted by Joseph Xu, Senior Software Engineer and Pranav Khaitan, Engineering Lead, Google Research

Natural disasters, such as earthquakes, hurricanes, and floods, affect large areas and millions of people, but responding to such disasters is a massive logistical challenge. Crisis responders, including governments, NGOs, and UN organizations, need fast access to comprehensive and accurate assessments in the aftermath of disasters to plan how best to allocate limited resources.To this end, very high resolution (VHR) satellite imagery, with up to 0.3 meter resolution, is becoming an increasingly important tool for crisis response, giving responders an unprecedented breadth of visual information about how terrain, infrastructure, and populations are changed by disasters.

However, intensive manual labor is still required to extract operationally-relevant information — collapsed buildings, cracks in bridges, where people have set up temporary shelters — from the raw satellite imagery. As an example, for the 2010 Haiti earthquake, analysts manually examined over 90,000 buildings in the Port-au-Prince area alone, rating the damage each one incurred on a five point scale. Many of these manual analyses take teams of experts many weeks to complete, whereas they are most needed within 48-72 hours after the disaster, when the most urgent decisions are made.

To help mitigate the impact of such disasters, we present “Building Damage Detection in Satellite Imagery Using Convolutional Neural Networks“, which details a machine learning (ML) approach to automatically process satellite data to generate building damage assessments. Developed in partnership with the United Nations World Food Program (WFP) Innovation Accelerator, we believe this work has the potential to drastically reduce the time and effort required for crisis workers to produce damage assessment reports. In turn, this would reduce the turnaround times needed to deliver timely disaster aid to the most severely affected areas, while increasing the overall coverage of such critical services.

The Approach
The automatic damage assessment process is split into two steps: building detection and damage classification. In the building detection step, our approach uses an object detection model to draw bounding boxes around each building in the image. We then extract pre-disaster and post-disaster images centered on each detected building and use a classification model to determine whether the building is damaged.

The classification model consists of a convolutional neural network to which is input two 161 pixel x 161 pixel RGB images, corresponding to a 50 m x 50 m ground footprint, centered on a given building. One image is from before the disaster event, and the other image is from after the disaster event. The model analyzes differences in the two images and outputs a score from 0.0 to 1.0, where 0.0 means the building was not damaged, and 1.0 means the building was damaged.

Because the before and after images are taken on different dates, at different times of day, and in some cases by different satellites altogether, there can be a host of different problems that arise. For example, the brightness, contrast, color saturation, and lighting conditions of the images may differ significantly, and the pixels in the image may be misaligned.

To correct for differences in color and illumination, we use histogram equalization to normalize the colors in the before and after images. We also make the model more robust to insignificant color differences by using standard data augmentation techniques, such as randomly perturbing the contrast and saturation of the images, during training.

Training Data
One of the main challenges of this work is assembling a training data set. Data availability in this application is inherently limited because there are only a handful of disasters that have high resolution satellite images and an even smaller number that have existing damage assessments. For labels, we use publicly available damage assessments manually generated by humanitarian organizations operating in this space, such as UNOSAT and REACH. We obtain the original satellite images on which the manual assessments are performed and then use Google Earth Engine to spatially join the damage assessment labels with the satellite images in order to produce the final training examples. All images used to train the model were sourced from commercially available sources.

Examples of individual image patches that capture before and after images of damaged and undamaged buildings from different disasters.

Results
We evaluated this technology for 3 major past earthquakes: the 2010 earthquake in Haiti (magnitude 7.0), the 2017 event in Mexico City (magnitude 7.1), and the series of earthquakes occuring in Indonesia in 2018 (magnitudes 5.9 – 7.5). For each event, we trained the model on buildings in one part of the region affected by the quake and tested it on buildings in another part of the region. We used human expert damage assessments performed by UNOSAT and REACH as the ground truth for evaluation. We measure the model’s quality using both true accuracy (compared to expert assessment) and the area under the ROC curve (AUROC), which captures the trade-off between the model’s true positive and false positive rates of detection, and is a common way to measure quality when the number of positive and negative examples in the test dataset is imbalanced. An AUROC value of 0.5 means that the model’s predictions are random, while a value of 1.0 means the model is perfectly accurate. According to crisis responder feedback, 70% accuracy is the threshold needed for making high-level decisions in the first 72 hours after the disaster.

Area under the
Event Accuracy ROC curve
2010 Haiti earthquake 77% 0.83
2017 Mexico City earthquake 71% 0.79
2018 Indonesia earthquake 78% 0.86
Evaluation of model predictions against human expert assessments (higher is better).
Example model predictions from the 2010 Haiti earthquake. Prediction values closer to 1.0 means the model is more confident that the building is damaged. Values closer to 0.0 means the building is not damaged. A threshold value of 0.5 is typically used to distinguish between damaged/undamaged predictions, but this can be tuned to make the predictions more or less sensitive.

Future Work
While the current model works reasonably well when trained and tested on buildings from the same regions (e.g., same city or country), the ultimate goal is to have a model that can accurately assess building damage for disasters that happen anywhere in the world, and not just those that look similar to the ones the model has been trained on. This is challenging because the variety of the available training data for past disasters is inherently limited to a handful of events that occurred in a few geographic locations. Generalizing to future disasters that will likely occur in new locations is therefore still a challenge for our model and is the focus of ongoing work. We envision a system that can be interactively trained, validated, and deployed by expert analysts so that important aid distribution decisions are always verified by experienced crisis responders. Our hope is that this technology can help communities get the aid that they need in times of most critical need in a timely fashion.

Acknowledgements
This post reflects the work of our co-authors Wenhan Lu and Zebo Li. We would also like to thank Maolin Zuo for his contributions to the project. In tackling this problem, we have had a very productive partnership with the United Nations World Food Programme (WFP) Innovation Accelerator, an organization that identifies, funds, and supports startups and innovative projects to disrupt world hunger.

A competition to identify bird calls using machine learning

A competition to identify bird calls using machine learning

Do you hear the birds chirping outside your window? There are more than 10,000 bird species in the world, and they can be found in nearly every environment, from untouched rainforests to suburbs and cities. Birds play an essential role in nature. They are high up in the food chain and integrate changes occurring at low levels. As such, birds are excellent indicators of deteriorating habitat quality and environmental pollution. However, it’s often easier to hear birds than see them. With proper sound detection and classification, researchers could automatically intuit factors about an area’s quality of life based on a changing bird population.

There are already many projects underway to extensively monitor birds by recording natural soundscapes over long periods. However the analysis of these datasets is often done manually, is painstakingly slow, and results are incomplete. Data science may be able to assist, so researchers have turned to large crowdsourced databases of vocal recordings of birds to train AI models.

To fully take advantage of these extensive and information-rich sound archives, researchers need good machine listeners to reliably extract as much information as possible to aid data-driven conservation.

In partnership with the Cornell Lab of Ornithology, Google’s bioacoustics team—part of ourAI for Social Good initiative—is announcing a competition to use machine learning to identify bird calls. In this competition, data scientists will identify a wide variety of bird vocalizations in soundscape recordings. Training audio comes from the Xeno-Canto project, a crowd-sourced collection of thousands of hours of bird sounds from around the world. We’re offering $25,000 in prizes for the best entries, and hosting the competition on Kaggle, the world’s largest data science competition community with more than 4 million members from 194 countries. The competition kicks off today and will last until September 2—check out the competition page for more details.

If successful, winners of this competition will help researchers better understand changes in habitat quality, levels of pollution, and the effectiveness of restoration efforts. The eventual conservation outcomes could greatly improve the quality of life for many living organisms—birds and human beings included.

Attribution for image at the top of the post: Red-winged Blackbird © Drew Weber / Macaulay Library at the Cornell Lab of Ornithology (ML227768151)

Read More