Live HDR+ and Dual Exposure Controls on Pixel 4 and 4a

Posted by Jiawen Chen and Sam Hasinoff, Software Engineers, Google Research

High dynamic range (HDR) imaging is a method for capturing scenes with a wide range of brightness, from deep shadows to bright highlights. On Pixel phones, the engine behind HDR imaging is HDR+ burst photography, which involves capturing a rapid burst of deliberately underexposed images, combining them, and rendering them in a way that preserves detail across the range of tones. Until recently, one challenge with HDR+ was that it could not be computed in real time (i.e., at 30 frames per second), which prevented the viewfinder from matching the final result. For example, bright white skies in the viewfinder might appear blue in the HDR+ result.

Starting with Pixel 4 and 4a, we have improved the viewfinder using a machine-learning-based approximation to HDR+, which we call Live HDR+. This provides a real-time preview of the final result, making HDR imaging more predictable. We also created dual exposure controls, which generalize the classic “exposure compensation” slider into two controls for separately adjusting the rendition of shadows and highlights. Together, Live HDR+ and dual exposure controls provide HDR imaging with real-time creative control.

Live HDR+ on Pixel 4 and 4a helps the user compose their shot with a WYSIWYG viewfinder that closely resembles the final result. You can see individual images here. Photos courtesy of Florian Kainz.

The HDR+ Look
When the user presses the shutter in the Pixel camera app, it captures 3-15 underexposed images. These images are aligned and merged to reduce noise in the shadows, producing a 14-bit intermediate “linear RGB image” with pixel values proportional to the scene brightness. What gives HDR+ images their signature look is the “tone mapping” of this image, reducing the range to 8 bits and making it suitable for display.

Consider the backlit photo of a motorcyclist, below. While the linear RGB image contains detail in both the dark motorcycle and bright sky, the dynamic range is too high to see it. The simplest method to reveal more detail is to apply a “global curve”, remapping all pixels with a particular brightness to some new value. However, for an HDR scene with details in both shadows and highlights, no single curve is satisfactory.

Different ways to tone-map a linear RGB image. (a) The original, “un-tone-mapped” image. (b) Global curve optimizing for the sky. (c) Global curve optimizing for the subject. (d) HDR+, which preserves details everywhere. In the 2D histogram, brighter areas indicate where more pixels of a given input brightness are mapped to the same output. The overlapping shapes show that the relationship cannot be modeled using a single curve. Photo courtesy of Nicholas Wilson.

In contrast to applying a single curve, HDR+ uses a local tone mapping algorithm to ensure that the final result contains detail everywhere, while keeping edges and textures looking natural. Effectively, this applies a different curve to different regions, depending on factors such as overall brightness, local texture, and amount of noise. Unfortunately, HDR+ is too slow to run live in the viewfinder, requiring an alternative approach for Live HDR+.

Local Curve Approximation for Live HDR+
Using a single tone curve does not produce a satisfying result for the entire image — but how about for a small region? Consider the small red patch in the figure below. Although the patch includes both shadows and highlights, the relationship between input and output brightness follows a smooth curve. Furthermore, the curve varies gradually. For the blue patch, shifted ten pixels to the right, both the image content and curve are similar. But while the curve approximation works well for small patches, it breaks down for larger patches. For the larger yellow patch, the input/output relationship is more complicated, and not well approximated by a single curve.

(a) Input and HDR+ result. (b) The effect of HDR+ on a small patch (red) is approximately a smooth curve. (c) The relationship is nearly identical for the nearby blue patch. (d) However, if the patch is too big, a single curve will no longer provide a good fit.

To address this challenge, we divide the input image into “tiles” of size roughly equal to the red patch in the figure above, and approximate HDR+ using a curve for each tile. Since these curves vary gradually, blending between curves is a good way to approximate the optimal curve at any pixel. To render a pixel we apply the curves from each of the four nearest tiles, then blend the results according to the distances to the respective tile centers.

Compared to HDR+, this algorithm is particularly well suited for GPUs. Since the tone mapping of each pixel can be computed independently, the algorithm can also be parallelized. Moreover, the representation is memory-efficient: only a small number of tiles is enough to represent HDR+ local tone mapping for the viewfinder.

To compute local curves, we use a machine learning algorithm called HDRnet, a deep neural network that predicts, from a linear image, per-tile curves that approximate the HDR+ look of that image. It’s also fast, due to its compact architecture and the way that low-resolution input images can be used to predict the curves for the high-resolution viewfinder. We train HDRnet on thousands of images to ensure it works well on all kinds of scenes.

HDRnet vs. HDR+ on a challenging scene with extreme brights and darks. The results are very similar at viewfinder resolution. Photo courtesy of Nicholas Wilson.

Dual Exposure Controls
HDR+ is designed to produce pleasing HDR images automatically, without the need for manual controls or post-processing. But sometimes the HDR+ rendition may not match the photographer’s artistic vision. While image editing tools are a partial remedy, HDR images can be challenging to edit, because some decisions are effectively baked into the final JPG. To maximize latitude for editing, it’s possible to save RAW images for each shot (an option in the app). However, this process takes the photographer out of the moment and requires expertise with RAW editing tools as well as additional storage.

Another approach to artistic control is to provide it live in the viewfinder. Many photographers are familiar with the exposure compensation slider, which brightens or darkens the image. But overall brightness is not expressive enough for HDR photography. At a minimum two controls are needed in order to control the highlights and shadows separately.

To address this, we introduce dual exposure controls. When the user taps on the Live HDR+ viewfinder, two sliders appear. The “Brightness” slider works like traditional exposure compensation, changing the overall exposure. This slider is used to recover more detail in bright skies, or intentionally blow out the background and make the subject more visible. The “Shadows” slider affects only dark areas — it operates by changing the tone mapping, not the exposure. This slider is most useful for high-contrast scenes, letting the user boost shadows to reveal details, or suppress them to create a silhouette.

Screen capture of dual exposure controls in action on an outdoor HDR scene with HDR+ results below. You can see individual images here. Photos courtesy of Florian Kainz.

Here are some of the dramatic renditions we were able to achieve using dual exposure controls.

Different renditions using Dual Exposure Controls. You can see individual images here. Photo credits: Jiawen Chen, Florian Kainz, Alexander Schiffhauer.

Dual Exposure Controls gives you the flexibility to capture dramatically different versions of the same subject. They are not limited to tough HDR scenes, so don’t be afraid to experiment with different subjects and lighting. You may be surprised at how much these sliders will change how you shoot!

Acknowledgements
Live HDR+ and Dual Exposure Controls is the result of a collaboration between Google Research, Android, Hardware, and UX Design teams. Key contributors include: Francois Bleibel, Sean Callanan, Yulun Chang, Eric Chen, Michelle Chen, Kourosh Derakshan, Ryan Geiss, Zhijun He, Joy Hsu, Liz Koh, Marc Levoy, Chia-Kai Liang, Diane Liang, Timothy Lin, Gaurav Malik, Hossein Mohtasham, Nandini Mukherjee, Sushil Nath, Gabriel Nava, Karl Rasche, YiChang Shih, Daniel Solomon, Gary Sun, Kelly Tsai, Sung-fang Tsai, Ted Tsai, Ruben Velarde, Lida Wang, Tianfan Xue, Junlan Yang.

Read More

Introducing the Model Card Toolkit for Easier Model Transparency Reporting

Introducing the Model Card Toolkit for Easier Model Transparency Reporting

Posted by Huanming Fang and Hui Miao, Software Engineers, Google Research

Machine learning (ML) model transparency is important across a wide variety of domains that impact peoples’ lives, from healthcare to personal finance to employment. The information needed by downstream users will vary, as will the details that developers need in order to decide whether or not a model is appropriate for their use case. This desire for transparency led us to develop a new tool for model transparency, Model Cards, which provide a structured framework for reporting on ML model provenance, usage, and ethics-informed evaluation and give a detailed overview of a model’s suggested uses and limitations that can benefit developers, regulators, and downstream users alike.

Over the past year, we’ve launched Model Cards publicly and worked to create Model Cards for open-source models released by teams across Google. For example, the MediaPipe team creates state-of-the-art computer vision models for a number of common tasks, and has included Model Cards for each of their open-source models in their GitHub repository. Creating Model Cards like these takes substantial time and effort, often requiring a detailed evaluation and analysis of both data and model performance. In many cases, one needs to additionally evaluate how a model performs on different subsets of data, noting any areas where the model underperforms. Further, Model Card creators may want to report on the model’s intended uses and limitations, as well as any ethical considerations potential users might find useful, compiling and presenting the information in a format that’s accessible and understandable.

To streamline the creation of Model Cards for all ML practitioners, we are sharing the Model Card Toolkit (MCT), a collection of tools that support developers in compiling the information that goes into a Model Card and that aid in the creation of interfaces that will be useful for different audiences. To demonstrate how the MCT can be used in practice, we have also released a Colab tutorial that builds a Model Card for a simple classification model trained on the UCI Census Income dataset.

Introducing the MCT
To guide the Model Card creator to organize model information, we provide a JSON schema, which specifies the fields to include in the Model Card. Using the model provenance information stored with ML Metadata (MLMD), the MCT automatically populates the JSON with relevant information, such as class distributions in the data and model performance statistics. We also provide a ModelCard data API to represent an instance of the JSON schema and visualize it as a Model Card. The Model Card creator can choose which metrics and graphs to display in the final Model Card, including metrics that highlight areas where the model’s performance might deviate from its overall performance.

Once the MCT has populated the Model Card with key metrics and graphs, the Model Card creator can supplement this with information regarding the model’s intended usage, limitations, trade-offs, and any other ethical considerations that would otherwise be unknown to people using the model. If a model underperforms for certain slices of data, the limitations section would be another place to acknowledge this, along with suggested mitigation strategies to help developers address these issues. This type of information is critical in helping developers decide whether or not a model is suitable for their use case, and helps Model Card creators provide context so that their models are used appropriately. Right now, we’re providing one UI template to visualize the Model Card, but you can create different templates in HTML should you want to visualize the information in other formats.

Currently, the MCT is available to anyone using TensorFlow Extended (TFX) in open source or on Google Cloud Platform. Users who are not serving their ML models via TFX can still leverage the JSON schema and the methods to visualize via the HTML template.

Here is an example of the completed Model Card from the Colab tutorial, which leverages the MCT and the provided UI template.

Conclusion
Currently, the MCT includes a standard template for reporting on ML models broadly, but we’re continuing to create UI templates for more specific applications of ML. If you’d like to join the conversation about what fields are important and how best to leverage the MCT for different use cases, you can get started here or with the Colab tutorial. Let us know how you’ve leveraged the MCT for your use case by emailing us at model-cards@google.com. You can learn more about Google’s efforts to promote responsible AI in the TensorFlow ecosystem on our TensorFlow Responsible AI page.

Acknowledgements
Huanming Fang, Hui Miao, Karan Shukla, Dan Nanas, Catherina Xu, Christina Greer, Tulsee Doshi, Tiffany Deng, Margaret Mitchell, Timnit Gebru, Andrew Zaldivar, Mahima Pushkarna, Meena Natarajan, Roy Kim, Parker Barnes, Tom Murray, Susanna Ricco, Lucy Vasserman, and Simone Wu

Read More

Announcing ScaNN: Efficient Vector Similarity Search

Announcing ScaNN: Efficient Vector Similarity Search

Posted by Philip Sun, Software Engineer, Google Research

Suppose one wants to search through a large dataset of literary works using queries that require an exact match of title, author, or other easily machine-indexable criteria. Such a task would be well suited for a relational database using a language such as SQL. However, if one wants to support more abstract queries, such as “Civil War poem,” it is no longer possible to rely on naive similarity metrics such as the number of words in common between two phrases. For example, the query “science fiction” is more related to “future” than it is to “earth science” despite the former having zero, and the latter having one, word in common with the query.

Machine learning (ML) has greatly improved computers’ abilities to understand language semantics and therefore answer these abstract queries. Modern ML models can transform inputs such as text and images into embeddings, high dimensional vectors trained such that more similar inputs cluster closer together. For a given query, we can therefore compute its embedding, and find the literary works whose embeddings are closest to the query’s. In this manner, ML has transformed an abstract and previously difficult-to-specify task into a rigorous mathematical one. However, a computational challenge remains: for a given query embedding, how does one quickly find the nearest dataset embeddings? The set of embeddings is often too large for exhaustive search and its high dimensionality makes pruning difficult.

In our ICML 2020 paper, “Accelerating Large-Scale Inference with Anisotropic Vector Quantization,” we address this problem by focusing on how to compress the dataset vectors to enable fast approximate distance computations, and propose a new compression technique that significantly boosts accuracy compared to prior works. This technique is utilized in our recently open-sourced vector similarity search library (ScaNN), and enables us to outperform other vector similarity search libraries by a factor of two, as measured on ann-benchmarks.com.

The Importance of Vector Similarity Search
Embedding-based search is a technique that is effective at answering queries that rely on semantic understanding rather than simple indexable properties. In this technique, machine learning models are trained to map the queries and database items to a common vector embedding space, such that the distance between embeddings carries semantic meaning, i.e., similar items are closer together.

The two-tower neural network model, illustrated above, is a specific type of embedding-based search where queries and database items are mapped to the embedding space by two respective neural networks. In this example the model responds to natural-language queries for a hypothetical literary database.

To answer a query with this approach, the system must first map the query to the embedding space. It then must find, among all database embeddings, the ones closest to the query; this is the nearest neighbor search problem. One of the most common ways to define the query-database embedding similarity is by their inner product; this type of nearest neighbor search is known as maximum inner-product search (MIPS).

Because the database size can easily be in the millions or even billions, MIPS is often the computational bottleneck to inference speed, and exhaustive search is impractical. This necessitates the use of approximate MIPS algorithms that exchange some accuracy for a significant speedup over brute-force search.

A New Quantization Approach for MIPS
Several state-of-the-art solutions for MIPS are based on compressing the database items so that an approximation of their inner product can be computed in a fraction of the time taken by brute-force. This compression is commonly done with learned quantization, where a codebook of vectors is trained from the database and is used to approximately represent the database elements.

Previous vector quantization schemes quantized database elements with the aim of minimizing the average distance between each vector x and its quantized form . While this is a useful metric, optimizing for this is not equivalent to optimizing nearest-neighbor search accuracy. The key idea behind our paper is that encodings with higher average distance may actually result in superior MIPS accuracy.

The intuition for our result is illustrated below. Suppose we have two database embeddings x1 and x2, and must quantize each to one of two centers: c1 or c2. Our goal is to quantize each xi to i such that the inner product <q, i> is as similar to the original inner product <q, xi> as possible. This can be visualized as making the magnitude of the projection of i onto q as similar as possible to the projection of xi onto q. In the traditional approach to quantization (left), we would pick the closest center for each xi, which leads to an incorrect relative ranking of the two points: <q, 1> is greater than <q, 2>, even though <q, x1> is less than <q, x2>! If we instead assign x1 to c1 and x2 to c2, we get the correct ranking. This is illustrated in the figure below.

The goal is to quantize each xi to i = c1 or i = c2. Traditional quantization (left) results in the incorrect ordering of x1 and x2 for this query. Even though our approach (right) chooses centers farther away from the data points, this in fact leads to lower inner product error and higher accuracy.

It turns out that direction matters as well as magnitude–even though c1 is farther from x1 than c2, c1 is offset from x1 in a direction almost entirely orthogonal to x1, while c2’s offset is parallel (for x2, the same situation applies but flipped). Error in the parallel direction is much more harmful in the MIPS problem because it disproportionately impacts high inner products, which by definition are the ones that MIPS is trying to estimate accurately.

Based on this intuition, we more heavily penalize quantization error that is parallel to the original vector. We refer to our novel quantization technique as anisotropic vector quantization due to the directional dependence of its loss function. The ability of this technique to trade increased quantization error of lower inner products in exchange for superior accuracy for high inner products is the key innovation and the source of its performance gains.

In the above diagrams, ellipses denote contours of equal loss. In anisotropic vector quantization, error parallel to the original data point x is penalized more.

Anisotropic Vector Quantization in ScaNN
Anisotropic vector quantization allows ScaNN to better estimate inner products that are likely to be in the top-k MIPS results and therefore achieve higher accuracy. On the glove-100-angular benchmark from ann-benchmarks.com, ScaNN outperformed eleven other carefully tuned vector similarity search libraries, handling roughly twice as many queries per second for a given accuracy as the next-fastest library.*

Recall@k is a commonly used metric for nearest neighbor search accuracy, which measures the proportion of the true nearest k neighbors that are present in an algorithm’s returned k neighbors. ScaNN (upper purple line) consistently achieves superior performance across various points of the speed-accuracy trade-off.

ScaNN is open-source software and you can try it yourself at GitHub. The library can be directly installed via Pip and has interfaces for both TensorFlow and Numpy inputs. Please see the GitHub repository for further instructions on installing and configuring ScaNN.

Conclusion
By modifying the vector quantization objective to align with the goals of MIPS, we achieve state-of-the-art performance on nearest neighbor search benchmarks, a key indicator of embedding-based search performance. Although anisotropic vector quantization is an important technique, we believe it is just one example of the performance gains achievable by optimizing algorithms for the end goal of improving search accuracy rather than an intermediate goal such as compression distortion.

Acknowledgements
This post reflects the work of the entire ScaNN team: David Simcha, Erik Lindgren, Felix Chern, Nathan Cordeiro, Ruiqi Guo, Sanjiv Kumar, and Zonglin Li. We’d also like to thank Dan Holtmann-Rice, Dave Dopson, and Felix Yu.



* ScaNN performs similarly well on the other datasets of ann-benchmarks.com, but the website currently shows outdated, lower numbers. See this pull request for more representative performance figures on other datasets.

Read More

Closing data gaps with Lacuna Fund

Closing data gaps with Lacuna Fund

Machine learning has shown enormous promise for social good, whether in helping respond to global health pandemics or reach citizens before natural disasters hit. But even as machine learning technology becomes increasingly accessible, social innovators still face significant barriers in their efforts to use this technology to unlock new solutions. From languages to health and agriculture, there is a lack of relevant, labeled data to represent and address the challenges that face much of the world’s population.

To help close this gap, Google.org is making a $2.5 million grant alongside The Rockefeller Foundation, Canada’s International Development Resource Center (IDRC) and Germany’s GiZ FAIR Forward to launch Lacuna Fund, the world’s first collaborative nonprofit effort to directly address this missing data. The Fund aims to unlock the power of machine learning by providing data scientists, researchers, and social entrepreneurs in low- and middle-income communities around the world with resources to produce labeled datasets that address urgent problems.  

Labeled data is a particular type of data that is useful in generating machine learning models: This data provides the “ground truth” that a model can use to guess about cases that it hasn’t yet seen. To create a labeled dataset, example data is systematically “tagged” by knowledgeable humans with one or more concepts or entities each one represents. For example, a researcher might label short videos of insects with their type; images of fungi with whether or not they are harmful to plants around them; or passages of Swahili text with the parts of speech that each word represents. In turn, these datasets could enable biologists to track insect migration; farmers to accurately identify threats to their crops; and Swahili speakers to use an automated text messaging service to get vital health information.  

Guided by committees of domain and machine learning experts and facilitated by Meridian Institute, the Fund will provide resources and support to produce new labeled datasets, as well as augment or update existing ones to be more representative, relevant and sustainable. The Fund’s initial work will focus on agriculture and underrepresented languages, but we welcome additional collaborators and anticipate the fund will grow in the years to come. And our work is bigger than just individual datasets: Lacuna Fund will focus explicitly on growing the capacity of local organizations to be data collectors, curators and owners. While following best practices for responsible collection, publication and use, we endeavor to make all datasets as broadly available as possible.

Thanks in part to the rise of cloud computing, in particular services like Cloud AutoML and libraries like TensorFlow, AI is increasingly able to help address society’s most pressing issues. Yet we’ve seen firsthand in our work on the Google AI Impact Challenge the gap between the potential of AI and the ability to successfully implement it. The need for data is quickly becoming one of the most salient barriers to progress. It’s our hope that the Fund provides not only a way for social sector organizations to fund high-impact, immediately-applicable data collection and labeling, but also a foundation from which changemakers can build a better future.

Image at top: A team from AI Challenge Grantee Wadhwani Institute for Artificial Intelligence in India is working with local farmers to manage pest damage to crop.

Read More

Using AI to identify the aggressiveness of prostate cancer

Using AI to identify the aggressiveness of prostate cancer

Prostate cancer diagnoses are common, with 1 in 9 men developing prostate cancer in their lifetime. A cancer diagnosis relies on specialized doctors, called pathologists, looking at biological tissue samples under the microscope for signs of abnormality in the cells. The difficulty and subjectivity of pathology diagnoses led us to develop an artificial intelligence (AI) system that can identify the aggressiveness of prostate cancer.

Since many prostate tumors are non-aggressive, doctors first obtain small samples (biopsies) to better understand the tumor for the initial cancer diagnosis. If signs of tumor aggressiveness are found, radiation or invasive surgery to remove the whole prostate may be recommended. Because these treatments can have painful side effects, understanding tumor aggressiveness is important to avoid unnecessary treatment.

Grading the biopsies

One of the most crucial factors in this process is to “grade” any cancer in the sample for how abnormal it looks, through a process called Gleason grading. Gleason grading involves first matching each cancerous region to one of three Gleason patterns, followed by assigning an overall “grade group” based on the relative amounts of each Gleason pattern in the whole sample. Gleason grading is a challenging task that relies on subjective visual inspection and estimation, resulting in pathologists disagreeing on the right grade for a tumor as much as 50 percent of the time. To explore whether AI could assist in this grading, we previously developed an algorithm that Gleason grades large samples (i.e. surgically-removed prostates) with high accuracy, a step that confirms the original diagnosis and informs patient prognosis.

Our research

In our recent work, “Development and Validation of a Deep Learning Algorithm for Gleason Grading of Prostate Cancer from Biopsy Specimens”, published in JAMA Oncology, we explored whether an AI system could accurately Gleason grade smaller prostate samples (biopsies). Biopsies are done during the initial part of prostate cancer care to get the initial cancer diagnosis and determine patient treatment, and so are more commonly performed than surgeries. However, biopsies can be more difficult to grade than surgical samples due to the smaller amount of tissue and unintended changes to the sample from tissue extraction and preparation process. The AI system we developed first “grades” each region of biopsy, and then summarizes the region-level classifications into an overall biopsy-level score.

Gleason grading

The first stage of the deep learning system Gleason grades every region in a biopsy. In this biopsy, green indicates Gleason pattern 3 while yellow indicates Gleason pattern 4.

Our results 

Given the complexity of Gleason grading, we worked with six experienced expert pathologists to evaluate the AI system. These experts, who have specialized training in prostate cancer and an average of 25 years of experience, determined the Gleason grades of 498 tumor samples. Highlighting how difficult Gleason grading is, a cohort of 19 “general” pathologists (without specialist training in prostate cancer) achieved an average accuracy of 58 percent on these samples. By contrast, our AI system’s accuracy was substantially higher at 72 percent. Finally, some prostate cancers have ambiguous appearances, resulting in disagreements even amongst experts. Taking this uncertainty into account, the deep learning system’s agreement rate with experts was comparable to the agreement rate between the experts themselves.

Cancer pathology workflow

Potential cancer pathology workflow augmented with AI-based assistive tools: a tumor sample is first collected and digitized using a high-magnification scanner. Next, the AI system provides a grade group for each sample.

These promising results indicate that the deep learning system has the potential to support expert-level diagnoses and expand access to high-quality cancer care. To evaluate if it could improve the accuracy and consistency of prostate cancer diagnoses, this technology needs to be validated as an assistive tool in further clinical studies and on larger and more diverse patient groups. However, we believe that AI-based tools could help pathologists in their work, particularly in situations where specialist expertise is limited.

Our research advancements in both prostate and breast cancer were the result of collaborations with the Naval Medical Center San Diego and support from Verily. Our appreciation also goes to several institutions that provided access to de-identified data, and many pathologists who provided advice or reviewed prostate cancer samples. We look forward to future research and investigation into how our technology can be best validated, designed and used to improve patient care and cancer outcomes.

Read More

Improving Holistic Scene Understanding with Panoptic-DeepLab

Improving Holistic Scene Understanding with Panoptic-DeepLab

Posted by Bowen Cheng, Student Researcher, and Liang-Chieh Chen, Research Scientist, Google Research

Real-world computer vision applications, such as self-driving cars and robotics, rely on two core tasks — instance segmentation and semantic segmentation. Instance segmentation identifies the class and extent of individual “things” in an image (i.e., countable objects such as people, animals, cars, etc.) and assigns unique identifiers to each (e.g., car_1 and car_2). This is complemented by semantic segmentation, which labels all pixels in an image, including the “things” that are present as well as the surrounding “stuff” (e.g., amorphous regions of similar texture or material, such as grass, sky or road). This latter task, however, does not differentiate between pixels of the same class that belong to different instances of that class.

Panoptic segmentation represents the unification of these two approaches with the goal of assigning a unique value to every pixel in an image that encodes both semantic label and instance ID. Most existing panoptic segmentation algorithms are based on Mask R-CNN, which treats semantic and instance segmentation separately. The instance segmentation step identifies objects in an image, but it often produces object instance masks that overlap one another. To settle the conflict between overlapping instance masks, one commonly employs an heuristic that resolves the discrepancy either based on the mask with a higher confidence score or by use of a pre-defined pairwise relationship between categories (e.g., a tie should always be worn on a person’s front). Additionally, the discrepancies between semantic and instance segmentation results are sorted out by favoring the instance predictions. While these methods generally produce good results, they also introduce heavy latency, which makes it challenging to apply them in real-time applications.

Driven by the need of a real-time panoptic segmentation model, we propose “Panoptic-DeepLab: a simple, fast and strong system for panoptic segmentation”, accepted to CVPR 2020. In this work, we extend the commonly used modern semantic segmentation model, DeepLab, to perform panoptic segmentation using only a small number of additional parameters with the addition of marginal computation overhead. The resulting model, Panoptic-DeepLab, produces semantic and instance segmentation in parallel and without overlap, avoiding the need for the manually designed heuristics adopted by other methods. Additionally, we develop a computationally efficient operation that merges the semantic and instance segmentation results, enabling near real-time end-to-end panoptic segmentation prediction. Unlike methods based on Mask R-CNN, Panoptic-DeepLab does not generate bounding box predictions and requires only three loss functions during training, significantly fewer than current state-of-the-art methods, such as UPSNet, which can have up to eight. Finally, Panoptic-DeepLab has demonstrated state-of-the-art performance on several academic datasets.

Panoptic segmentation results obtained by Panoptic-DeepLab. Left: Video frames used as input to the panoptic segmentation model. Right: Results overlaid on video frames. Each object instance has a unique label, e.g., car_1, car_2, etc.

Overview
Panoptic-DeepLab is simple both conceptually and architecturally. At a high-level, it predicts three outputs. The first is semantic segmentation, in which it assigns a semantic class (e.g., car or grass) to each pixel. However, it does not differentiate between multiple instances of the same class. So, for example, if one car is partly behind another, the pixels associated with both would have the same associated class and would be indistinguishable from one another. This can be addressed by the second two outputs from the model: a center-of-mass prediction for each instance and instance center regression, where the model learns to regress each instance pixel to its center of mass. This latter step ensures that the model associates pixels of a given class to the appropriate instance. The class-agnostic instance segmentation, obtained by grouping predicted foreground pixels to their closest predicted instance centers, is then fused with semantic segmentation by majority-vote rule to generate the final panoptic segmentation.

Overview of Panoptic-DeepLab. Semantic segmentation associates pixels in the image with general classes, while the class-agnostic instance segmentation step identifies the pixels associated with an individual object, regardless of the class. Taken together one gets the final panoptic segmentation image.

Neural Network Design
Panoptic-DeepLab consists of four components: (1) an encoder backbone pre-trained on ImageNet, shared by both the semantic segmentation and instance segmentation branches of the architecture; (2) atrous spatial pyramid pooling (ASPP) modules, similar to that used by DeepLab, which are deployed independently in each branch in order to perform segmentation at a range of spatial scales; (3) similarly decoupled decoder modules specific to each segmentation task; and (4) task-specific prediction heads.

The encoder backbone (1), which has been pre-trained on ImageNet, extracts feature maps that are shared by both the semantic segmentation and instance segmentation branches of the architecture. Typically, the feature map is generated by the backbone model using a standard convolution, which reduces the resolution of the output map to 1/32nd that of the input image and is too coarse for accurate image segmentation. In order to preserve the details of object boundaries, we instead employ atrous convolution, which better retains important features like edges, to generate a feature map with a resolution of 1/16th the original. This is then followed by two ASPP modules (2), one for each branch, which captures multi-scale information for segmentation.

The light-weight decoder modules (3) follow those used in the most recent DeepLab version (DeepLabV3+), but with two modifications. First, we reintroduce an additional low-level feature map (1/8th scale) to the decoder, which helps to preserve spatial information from the original image (e.g., object boundaries) that can be significantly degraded in the final feature map output by the backbone. Second, instead of using the typical 3 × 3 kernel, the decoder employs a 5 × 5 depthwise-separable convolution, which yields somewhat better performance at only a minimal cost in additional overhead.

The two prediction heads (4) are tailored to their task. The semantic segmentation head employs a weighted version of the standard bootstrapped cross entropy loss function, which weights each pixel differently and has proven to be more effective for segmentation of small-scale objects. The instance segmentation head is trained to predict the offsets between the center of mass of an object instance and the surrounding pixels, without knowledge of the object class, forming the class-agnostic instance masks.

Results
To demonstrate the effectiveness of Panoptic-DeepLab, we conduct experiments on three popular academic datasets, Cityscapes, Mapillary Vistas, and COCO datasets. With a simple architecture, Panoptic-DeepLab ranks first in Cityscapes for all three tasks (semantic, instance and panoptic segmentation) without any task-specific fine-tuning. Additionally, Panoptic-DeepLab won the Best Result, Best Paper, and Most Innovative awards on the Mapillary Panoptic Segmentation track at ICCV 2019 Joint COCO and Mapillary Recognition Challenge Workshop. It outperforms the winner of 2018 by a healthy margin of 1.5%. Finally, Panoptic-DeepLab sets new state-of-the-art bottom-up (i.e., box-free) panoptic segmentation results on the COCO dataset, and is also comparable to other methods based on Mask R-CNN.

Accuracy (PQ) vs. Speed (GPU inference time) across three datasets.

Conclusion
With a simple architecture and only three training loss functions, Panoptic-DeepLab achieves state-of-the-art performance while being faster than other methods based on Mask R-CNN. To summarize, we develop the first single-shot panoptic segmentation model that attains state-of-the-art performance on several public benchmarks, and delivers near real time end-to-end inference speed. We hope our simple and effective Panoptic-DeepLab could establish a solid baseline and further benefit the research community.

Acknowledgements
We would like to thank the support and valuable discussions with Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Florian Schroff as well as the Google Mobile Vision team.

Read More

Exploring Faster Screening with Fewer Tests via Bayesian Group Testing

Exploring Faster Screening with Fewer Tests via Bayesian Group Testing

Posted by Marco Cuturi and Jean-Philippe Vert, Research Scientists, Google Research, Brain Team

How does one find a needle in a haystack? At the turn of World War II, that question took on a very concrete form when doctors wondered how to efficiently detect diseases among those who had been drafted into the war effort. Inspired by this challenge, Robert Dorfman, a young statistician at that time (later to become Harvard professor of economics), proposed in a seminal paper a 2-stage approach to detect infected individuals, whereby individual blood samples first are pooled in groups of four before being tested for the presence or absence of a pathogen. If a group is negative, then it is safe to assume that everyone in the group is free of the pathogen. In that case, the reduction in the number of required tests is substantial: an entire group of four people has been cleared with a single test. On the other hand, if a group tests positive, which is expected to happen rarely if the pathogen’s prevalence is small, at least one or more people within that group must be positive; therefore, a few more tests to determine the infected individuals are needed.

Left: Sixteen individual tests are required to screen 16 people — only one person’s test is positive, while 15 return negative. Right: Following Dorfman’s procedure, samples are pooled into four groups of four individuals, and tests are executed on the pooled samples. Because only the second group tests positive, 12 individuals are cleared and only those four belonging to the positive group need to be retested. This approach requires only eight tests, instead of the 16 needed for an exhaustive testing campaign.

Dorfman’s proposal triggered many follow-up works with connections to several areas in computer science, such as information theory, combinatorics or compressive sensing, and several variants of his approach have been proposed, notably those leveraging binary splitting or side knowledge on individual infection probability rates. The field has grown to the extent that several sub-problems are recognized and deserving of an entire literature on their own. Some algorithms are tailored for the noiseless case in which tests are perfectly reliable, whereas some consider instead the more realistic case where tests are noisy and may produce false negatives or positives. Finally, some strategies are adaptive, proposing groups based on test results already observed (including Dorfman’s, since it proposes to re-test individuals that appeared in positive groups), whereas others stick to a non-adaptive setting in which groups are known beforehand or drawn at random.

In “Noisy Adaptive Group Testing using Bayesian Sequential Experimental Design”, we present an approach to group testing that can operate in a noisy setting (i.e., where tests can be mistaken) to decide adaptively by looking at past results which groups to test next, with the goal to converge on a reliable detection as quickly, and with as few tests, as possible. Large scale simulations suggest that this approach may result in significant improvements over both adaptive and non-adaptive baselines, and are far more efficient than individual tests when disease prevalence is low. As such, this approach is particularly well suited for situations that require large numbers of tests to be conducted with limited resources, as may be the case for pandemics, such as that corresponding to the spread of COVID-19. We have open-sourced the code to the community through our GitHub repo.

Noisy and Adaptive Group Testing in a Non-Asymptotic Regime
A group testing strategy is an algorithm that is tasked with guessing who, among a list of n people, carries a particular pathogen. To do so, the strategy provides instructions for pooling individuals into groups. Assuming a laboratory can execute k tests at a time, the strategy will form a kn pooling matrix that defines these groups. Once the tests are carried out, the results are used to decide whether sufficient information has been gathered to determine who is or is not infected, and if not, how to form new groups for another round of testing.

We designed a group testing approach for the realistic setting where the testing strategy can be adaptive and where tests are noisy — the probability that the test of an infected sample is positive (sensitivity) is less than 100%, as is the specificity, the probability that a non-infected sample returns negative.

Screening More People with Fewer Tests Using Bayesian Optimal Experimental Design
The strategy we propose proceeds the way a detective would investigate a case. They first form several hypotheses about who may or may not be infected, using evidence from all tests (if any) that have been carried out so far and prior information on the infection rate (a). Using these hypotheses, our detectives produce an actionable item to continue the investigation, namely a next wave of groups that may help in validating or invalidating as many hypotheses as possible (b), and then loop back to (a) until the set of plausible hypotheses is small enough to unambiguously identify the target of the search. More precisely,

  1. Given a population of n people, an infection state is a binary vector of length n that describes who is infected (marked with a 1), and who is not (marked with a 0). At a certain time, a population is in a given state (most likely a few 1’s and mostly 0’s). The goal of group testing is to identify that state using as few tests as possible. Given a prior belief on the infection rate (the disease is rare) and test results observed so far (if any), we expect that only a small share of those infection states will be plausible. Rather than evaluating the plausibility of all 2n possible states (an extremely large number even for small n), we resort to a more efficient method to sample plausible hypotheses using a sequential Monte Carlo (SMC) sampler. Although quite costly by common standards (a few minutes using a GPU in our experimental setup), we show in this work that SMC samplers remain tractable even for large n, opening new possibilities for group testing. In short, in return for a few minutes of computations, our detectives get an extensive list of thousands of relevant hypotheses that may explain tests observed so far.
  2. Equipped with a relevant list of hypotheses, our strategy proceeds, as detectives would, by selectively gathering additional evidence. If k tests can be carried out at the next iteration, our strategy will propose to test k new groups, which are computed using the framework of Bayesian optimal experimental design. Intuitively, if k=1 and one can only propose a single new group to test, there would be clear advantage in building that group such that its test outcome is as uncertain as possible, i.e., with a probability that it returns positive as close to 50% as possible, given the current set of hypotheses. Indeed, to progress in an investigation, it is best to maximize the surprise factor (or information gain) provided by new test results, as opposed to using them to confirm further what we already hold to be very likely. To generalize that idea to a set of k>1 new groups, we score this surprise factor by computing the mutual information of these “virtual” group tests vs. the distribution of hypotheses. We also consider a more involved approach that computes the expected area under the ROC curve (AUC) one would obtain from testing these new groups using the distribution of hypotheses. The maximization of these two criteria is carried out using a greedy approach, resulting in two group selectors, GMIMAX and GAUCMAX (greedy maximization of mutual information or AUC, respectively).

The interaction between a laboratory (wet_lab) carrying out testing, and our strategy, composed of a sampler and a group selector, is summarized in the following drawing, which uses names of classes implemented in our open source package.

Our group testing framework describes an interaction between a testing environment, the wet_lab, whose pooled test results are used by the sampler to draw thousands of plausible hypotheses on the infection status of all individuals. These hypotheses are then used by an optimization procedure, group_selector, that figures out what groups may be the most relevant to test in order to narrow down on the true infection status. Once formed, these new groups are then tested again, closing the loop. At any point in the procedure, the hypotheses formed by the sampler can be averaged to obtain the average probability of infection for each patient. From these probabilities, a decision on whether a patient is infected or not can be done by thresholding these probabilities at a certain confidence level.

Benchmarking
We benchmarked our two strategies GMIMAX and GAUCMAX against various baselines in a wide variety of settings (infection rates, test noise levels), reporting performance as the number of tests increases. In addition to simple Dorfman strategies, the baselines we considered included a mix of non-adaptive strategies (origami assays, random designs) complemented at later stages with the so-called informative Dorfman approach. Our approaches significantly outperform the others in all settings.

We executed 5000 simulations on a sample population of 70 individuals with an infection rate of 2%. We have assumed sensitivity/specificity values of 85% / 97% for tests with groups of maximal size 10, which are representative of current PCR machines. This figure demonstrates that our approach outperforms the other baselines with as few as 24 tests (up to 8 tests used in 3 cycles), including both adaptive and non-adaptive varieties, and performs significantly better than individual tests (plotted in the sensitivity/specificity plane as a hexagon, requiring 70 tests), highlighting the savings potential offered by group testing. See preprint for other setups.

Conclusion
Screening a population for a pathogen is a fundamental problem, one that we currently face during the current COVID-19 epidemic. Seventy years ago, Dorfman proposed a simple approach currently adopted by various institutions. Here, we have proposed a method to extend the basic group testing approach in several ways. Our first contribution is to adopt a probabilistic perspective, and form thousands of plausible hypotheses of infection distributions given test outcomes, rather than trust test results to be 100% reliable as Dorfman did. This perspective allows us to seamlessly incorporate additional prior knowledge on infection, such as when we suspect some individuals to be more likely than others to carry the pathogen, based for instance on contact tracing data or answers to a questionnaire. This provides our algorithms, which can be compared to detectives investigating a case, the advantage of knowing what are the most likely infection hypotheses that agree with prior beliefs and tests carried out so far. Our second contribution is to propose algorithms that can take advantage of these hypotheses to form new groups, and therefore direct the gathering of new evidence, to narrow down as quickly as possible to the “true” infection hypothesis, and close the case with as little testing effort as possible.

Acknowledgements
We would like to thank our collaborators on this work, Olivier Teboul, in particular, for his help preparing figures, as well as Arnaud Doucet and Quentin Berthet. We also thank Kevin Murphy and Olivier Bousquet (Google) for their suggestions at the earliest stages of this project, as well as Dan Popovici for his unwavering support pushing this forward; Ignacio Anegon, Jeremie Poschmann and Laurent Tesson (INSERM) for providing us background information on RT-PCR tests and Nicolas Chopin (CREST) for giving guidance on his work to define SMCs for binary spaces.

Read More

30 years of family videos in an AI archive

30 years of family videos in an AI archive

My dad got his first video camera the day I was born nearly three decades ago. “Say hello to the camera!” are the first words he caught on tape, as he pointed it at a red, puffy baby (me) in a hospital bassinet. The clips got more embarrassing from there, as he continued to film through many diaper changes, temper tantrums and—worst of all—puberty.

Most of those potential blackmail tokens sat trapped on miniDV tapes or scattered across SD cards until two years ago when my dad uploaded them all to Google Drive. Theoretically, since they were now stored in the cloud, my family and I could watch them whenever we wanted. But with more than 456 hours of footage, watching it all would have been a herculean effort. You can only watch old family friends open Christmas gifts so many times. So, as an Applied AI Engineer, I got down to business and built an AI-powered searchable archive of our family videos.

If you’ve ever used Google Photos, you’ve seen the power of using AI to search and organize images and videos. The app uses machine learning to identify people and pets, as well as objects and text in images. So, if I search “pool” in the Google Photos app, it’ll show me all the pictures and videos I ever took of pools.

But for this project, I needed a couple of features Photos doesn’t (yet!) support. First, because my dad’s first camera recorded footage to miniDV tapes, those videos were uploaded as meaty, two-hour-long movies with no useful metadata. Instead, my dad would start a clip by saying, “let me put a date on the screen here…” and a little white text snippet would appear in the bottom right corner of the frame. In between shots on a single reel, he’d say: “Say goodbye, I’m going to fade out now.” I would scream, “NO, DON’T FADE OUT,” while the screen faded to black. So, my first step was to use machine learning to automatically parse the date shown on the screen, and split the single long video into shorter clips after each fade out.

video screenshot

In this picture, you can see the timestamp shown on screen. Using the Vision API, I could extract it to sort my videos by date.

For this, I turned the Video intelligence API, a Google Cloud tool that lets developers analyze videos with machine learning. It allows you to replicate many of the features found in the Google Photos app—like tagging objects in images and recognizing on-screen text—and a whole lot more. For example, the API’s shot change detection feature automatically finds the timestamps in videos where a scene changes, this allowed me to split those longs videos into smaller chunks. 

Using the label detection feature, I could search for all sorts of different events, like “bridal shower,” “wedding,” “bat and ball games” and “baby.” By searching “performance,” I was able to finally find one of my life’s proudest accomplishments on tape—a starring role singing “It’s Not Easy Being Green” in my kindergarten’s production of the Sesame Street musical.

home video 2

My starring role as Kermit the Frog in my school’s Sesame Street musical. The Video Intelligence API tagged it as “performance”.  

The Video Intelligence API’s real “killer feature” for me was its ability to do audio transcription. By transcribing my videos, I was able to query clips by what people said in them. I could search for specific names (“Scott,” “Dale,” “grandma”), proper nouns (“Chuck E Cheese”, “Pokemon”), and for unique phrases. By searching “first steps,” I found a clip of my dad saying, “Here she comes… plunk. That’s the first time she’s taken major steps” alongside a video of my managing, just barely, to waddle along.

homevideo3

My first steps that I was able to find with the Video Intelligence API’s Transcription feature. Here, my dad says, “…this is the first time she’s taken major steps.”

In the end, machine learning helped me build exactly the kind of archive I wanted—one that let me search my family videos by memories, not timestamps.

P.S. Want to see how I built it? Check out my technical blog post or catch the video on the Cloud Youtube Channel

Read More

Google at ICML 2020

Google at ICML 2020

Posted by Jaqui Herman and Cat Armato, Program Managers

Machine learning is a key strategic focus at Google, with highly active groups pursuing research in virtually all aspects of the field, including deep learning and more classical algorithms, exploring theory as well as application. We utilize scalable tools and architectures to build machine learning systems that enable us to solve deep scientific and engineering challenges in areas of language, speech, translation, music, visual processing and more.

As a leader in machine learning research, Google is proud to be a Platinum Sponsor of the thirty-seventh International Conference on Machine Learning (ICML 2020), a premier annual event taking place virtually this week. With over 100 accepted publications and Googlers participating in workshops, we look forward to our continued collaboration with the larger machine learning research community.

If you’re registered for ICML 2020, we hope you’ll visit the Google virtual booth to learn more about the exciting work, creativity and fun that goes into solving some of the field’s most interesting challenges. You can also learn more about the Google research being presented at ICML 2020 in the list below (Google affiliations bolded).

ICML Expo
Google Dataset Search: Building an Open Ecosystem for Dataset Discovery
Natasha Noy

End-to-end Bayesian inference workflows in TensorFlow Probability
Colin Carroll

Publications
Population-Based Black-Box Optimization for Biological Sequence Design
Christof Angermueller, David Belanger, Andreea Gane, Zelda Mariet, David Dohan, Kevin Murphy, Lucy Colwell, D Sculley

Predictive Coding for Locally-Linear Control
Rui Shu, Tung Nguyen, Yinlam Chow, Tuan Pham, Khoat Than, Mohammad Ghavamzadeh, Stefano Ermon, Hung Bui

FedBoost: A Communication-Efficient Algorithm for Federated Learning
Jenny Hamer, Mehryar Mohri, Ananda Theertha Suresh

Faster Graph Embeddings via Coarsening
Matthew Fahrbach, Gramoz Goranci, Richard Peng, Sushant Sachdeva, Chi Wang

Revisiting Fundamentals of Experience Replay
William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, Will Dabney

Boosting for Control of Dynamical Systems
Naman Agarwal, Nataly Brukhim, Elad Hazan, Zhou Lu

Neural Clustering Processes
Ari Pakman, Yueqi Wang, Catalin Mitelut, JinHyung Lee, Liam Paninski

The Tree Ensemble Layer: Differentiability Meets Conditional Computation
Hussein Hazimeh, Natalia Ponomareva, Petros Mol, Zhenyu Tan, Rahul Mazumder

Representations for Stable Off-Policy Reinforcement Learning
Dibya Ghosh, Marc Bellemare

REALM: Retrieval-Augmented Language Model Pre-Training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang

Context Aware Local Differential Privacy
Jayadev Acharya, Keith Bonawitz, Peter Kairouz, Daniel Ramage, Ziteng Sun

Scalable Deep Generative Modeling for Sparse Graphs
Hanjun Dai, Azade Nazi, Yujia Li, Bo Dai, Dale Schuurmans

Deep k-NN for Noisy Labels
Dara Bahri, Heinrich Jiang, Maya Gupta

Revisiting Spatial Invariance with Low-Rank Local Connectivity
Gamaleldin F. Elsayed, Prajit Ramachandran, Jonathon Shlens, Simon Kornblith

SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

Incremental Sampling Without Replacement for Sequence Models
Kensen Shi, David Bieber, Charles Sutton

SoftSort: A Continuous Relaxation for the argsort Operator
Sebastian Prillo, Julian Martin Eisenschlos

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation (see blog post)
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, Melvin Johnson

Learning to Stop While Learning to Predict
Xinshi Chen, Hanjun Dai, Yu Li, Xin Gao, Le Song

Bandits with Adversarial Scaling
Thodoris Lykouris, Vahab Mirrokni, Renato Paes Leme

SimGANs: Simulator-Based Generative Adversarial Networks for ECG Synthesis to Improve Deep ECG Classification
Tomer Golany, Daniel Freedman, Kira Radinsky

Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization
Geoffrey Negiar, Gideon Dresdner, Alicia Yi-Ting Tsai, Laurent El Ghaoui, Francesco Locatello, Robert M. Freund, Fabian Pedregosa

Implicit differentiation of Lasso-type models for hyperparameter optimization
Quentin Bertrand, Quentin Klopfenstein, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, Joseph Salmon

Infinite attention: NNGP and NTK for deep attention networks
Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, Roman Novak

Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently
Asaf Cassel, Alon Cohen, Tomer Koren

Adversarial Learning Guarantees for Linear Hypotheses and Neural Networks
Pranjal Awasthi, Natalie Frank, Mehryar Mohri

Random Hypervolume Scalarizations for Provable Multi-Objective Black Box Optimization
Daniel Golovin, Qiuyi (Richard) Zhang

Generating Programmatic Referring Expressions via Program Synthesis
Jiani Huang, Calvin Smith, Osbert Bastani, Rishabh Singh, Aws Albarghouthi, Mayur Naik

Optimizing Long-term Social Welfare in Recommender Systems: A Constrained Matching Approach
Martin Mladenov, Elliot Creager, Omer Ben-Porat, Kevin Swersky, Richard Zemel, Craig Boutilier

AutoML-Zero: Evolving Machine Learning Algorithms From Scratch (see blog post)
Esteban Real, Chen Liang, David R. So, Quoc V. Le

How Good is the Bayes Posterior in Deep Neural Networks Really?
Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin

Which Tasks Should Be Learned Together in Multi-task Learning?
Trevor Standley, Amir R. Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, Silvio Savarese

Influence Diagram Bandits: Variational Thompson Sampling for Structured Bandit Problems
Tong Yu, Branislav Kveton, Zheng Wen, Ruiyi Zhang, Ole J. Mengshoel

Disentangling Trainability and Generalization in Deep Neural Networks
Lechao Xiao, Jeffrey Pennington, Samuel S. Schoenholz

The Many Shapley Values for Model Explanation
Mukund Sundararajan, Amir Najmi

Neural Contextual Bandits with UCB-based Exploration
Dongruo Zhou, Lihong Li, Quanquan Gu

Automatic Shortcut Removal for Self-Supervised Representation Learning
Matthias Minderer, Olivier Bachem, Neil Houlsby, Michael Tschannen

Federated Learning with Only Positive Labels
Felix X. Yu, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

How Recurrent Networks Implement Contextual Processing in Sentiment Analysis
Niru Maheswaranathan, David Sussillo

Supervised Learning: No Loss No Cry
Richard Nock, Aditya Krishna Menon

Ready Policy One: World Building Through Active Learning
Philip Ball, Jack Parker-Holder, Aldo Pacchiano, Krzysztof Choromanski, Stephen Roberts

Weakly-Supervised Disentanglement Without Compromises
Francesco Locatello, Ben Poole, Gunnar Raetsch, Bernhard Schölkopf, Olivier Bachem, Michael Tschannen

Fast Differentiable Sorting and Ranking
Mathieu Blondel, Olivier Teboul, Quentin Berthet, Josip Djolonga

Debiased Sinkhorn barycenters
Hicham Janati, Marco Cuturi, Alexandre Gramfort

Interpretable, Multidimensional, Multimodal Anomaly Detection with Negative Sampling for Detection of Device Failure
John Sipple

Accelerating Large-Scale Inference with Anisotropic Vector Quantization
Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar

An Optimistic Perspective on Offline Reinforcement Learning (see blog post)
Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization
Ben Adlam, Jeffrey Pennington

Private Query Release Assisted by Public Data
Raef Bassily, Albert Cheu, Shay Moran, Aleksandar Nikolov, Jonathan Ullman, Zhiwei Steven Wu

Learning and Evaluating Contextual Embedding of Source Code
Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

Evaluating Machine Accuracy on ImageNet
Vaishaal Shankar, Rebecca Roelofs, Horia Mania, Alex Fang, Benjamin Recht, Ludwig Schmidt

Imputer: Sequence Modelling via Imputation and Dynamic Programming
William Chan, Chitwan Saharia, Geoffrey Hinton, Mohammad Norouzi, Navdeep Jaitly

Domain Aggregation Networks for Multi-Source Domain Adaptation
Junfeng Wen, Russell Greiner, Dale Schuurmans

Planning to Explore via Self-Supervised World Models
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, Deepak Pathak

Context-Aware Dynamics Model for Generalization in Model-Based Reinforcement Learning
Kimin Lee, Younggyo Seo, Seunghyun Lee, Honglak Lee, Jinwoo Shin

Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search
Binghong Chen, Chengtao Li, Hanjun Dai, Le Song

On the Consistency of Top-k Surrogate Losses
Forest Yang, Sanmi Koyejo

Dual Mirror Descent for Online Allocation Problems
Haihao Lu, Santiago Balseiro, Vahab Mirrokni

Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors
Michael W. Dusenberry, Ghassen Jerfel, Yeming Wen, Yi-An Ma, Jasper Snoek, Katherine Heller, Balaji Lakshminarayanan, Dustin Tran

Batch Stationary Distribution Estimation
Junfeng Wen, Bo Dai, Lihong Li, Dale Schuurmans

Small-GAN: Speeding Up GAN Training Using Core-Sets
Samarth Sinha, Han Zhang, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Augustus Odena

Data Valuation Using Reinforcement Learning
Jinsung Yoon, Sercan ‎Ö. Arik, Tomas Pfister

A Game Theoretic Perspective on Model-Based Reinforcement Learning
Aravind Rajeswaran, Igor Mordatch, Vikash Kumar

Encoding Musical Style with Transformer Autoencoders
Kristy Choi, Curtis Hawthorne, Ian Simon, Monica Dinculescu, Jesse Engel

The Shapley Taylor Interaction Index
Kedar Dhamdhere, Mukund Sundararajan, Ashish Agarwal

Multidimensional Shape Constraints
Maya Gupta, Erez Louidor, Olexander Mangylov, Nobu Morioka, Taman Narayan, Sen Zhao

Private Counting from Anonymous Messages: Near-Optimal Accuracy with Vanishing Communication Overhead
Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Rasmus Pagh

Learning to Score Behaviors for Guided Policy Optimization
Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Anna Choromanska, Krzysztof Choromanski, Michael I. Jordan

Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial Perturbations
Florian Tramèr, Jens Behrmann, Nicholas Carlini, Nicolas Papernot, Jörn-Henrik Jacobsen

Optimizing Black-Box Metrics with Adaptive Surrogates
Qijia Jiang, Olaoluwa Adigun, Harikrishna Narasimhan, Mahdi Milani Fard, Maya Gupta

Circuit-Based Intrinsic Methods to Detect Overfitting
Sat Chatterjee, Alan Mishchenko

Automatic Reparameterisation of Probabilistic Programs
Maria I. Gorinova, Dave Moore, Matthew D. Hoffman

Stochastic Flows and Geometric Optimization on the Orthogonal Group
Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov, Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack Parker-Holder, Jacob Bergquist, Yuan Gao, Aldo Pacchiano, Tamas Sarlos, Adrian Weller, Vikas Sindhwani

Black-Box Variational Inference as a Parametric Approximation to Langevin Dynamics
Matthew Hoffman, Yi-An Ma

Concise Explanations of Neural Networks Using Adversarial Training
Prasad Chalasani, Jiefeng Chen, Amrita Roy Chowdhury, Somesh Jha, Xi Wu

p-Norm Flow Diffusion for Local Graph Clustering
Shenghao Yang, Di Wang, Kimon Fountoulakis

Empirical Study of the Benefits of Overparameterization in Learning Latent Variable Models
Rares-Darius Buhai, Yoni Halpern, Yoon Kim, Andrej Risteski, David Sontag

Robust Pricing in Dynamic Mechanism Design
Yuan Deng, Sébastien Lahaie, Vahab Mirrokni

Differentiable Product Quantization for Learning Compact Embedding Layers
Ting Chen, Lala Li, Yizhou Sun

Adaptive Region-Based Active Learning
Corinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, Ningshan Zhang

Countering Language Drift with Seeded Iterated Learning
Yuchen Lu, Soumye Singhal, Florian Strub, Olivier Pietquin, Aaron Courville

Does Label Smoothing Mitigate Label Noise?
Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar

Acceleration Through Spectral Density Estimation
Fabian Pedregosa, Damien Scieur

Momentum Improves Normalized SGD
Ashok Cutkosky, Harsh Mehta

ConQUR: Mitigating Delusional Bias in Deep Q-Learning
Andy Su, Jayden Ooi, Tyler Lu, Dale Schuurmans, Craig Boutilier

Online Learning with Imperfect Hints
Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit

Go Wide, Then Narrow: Efficient Training of Deep Thin Networks
Denny Zhou, Mao Ye, Chen Chen, Tianjian Meng, Mingxing Tan, Xiaodan Song, Quoc Le, Qiang Liu, Dale Schuurmans

On Implicit Regularization in β-VAEs
Abhishek Kumar, Ben Poole

Is Local SGD Better than Minibatch SGD?
Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Sreb

A Simple Framework for Contrastive Learning of Visual Representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton

Universal Average-Case Optimality of Polyak Momentum
Damien Scieur, Fabian Pedregosa

An Imitation Learning Approach for Cache Replacement
Evan Zheran Liu, Milad Hashemi, Kevin Swersky, Parthasarathy Ranganathan, Junwhan Ahn

Collapsed Amortized Variational Inference for Switching Nonlinear Dynamical Systems
Zhe Dong, Bryan A. Seybold, Kevin P. Murphy, Hung H. Bui

Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels
Lu Jiang, Di Huang, Mason Liu, Weilong Yang

Optimizing Data Usage via Differentiable Rewards
Xinyi Wang, Hieu Pham, Paul Michel, Antonios Anastasopoulos, Jaime Carbonell, Graham Neubig

Sparse Sinkhorn Attention
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, Da-Cheng Juan

One Policy to Control Them All: Shared Modular Policies for Agent-Agnostic Control
Wenlong Huang, Igor Mordatch, Deepak Pathak

On Thompson Sampling with Langevin Algorithms
Eric Mazumdar, Aldo Pacchiano, Yi-An Ma, Peter L. Bartlett, Michael I. Jordan

Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection
Mao Ye, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, Qiang Liu

On the Global Convergence Rates of Softmax Policy Gradient Methods
Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans

Concept Bottleneck Models
Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, Percy Liang

Supervised Quantile Normalization for Low-Rank Matrix Approximation
Marco Cuturi, Olivier Teboul, Jonathan Niles-Weed, Jean-Philippe Vert

Missing Data Imputation Using Optimal Transport
Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi

Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention Over Modules
Sarthak Mittal, Alex Lamb, Anirudh Goyal, Vikram Voleti, Murray Shanahan, Guillaume Lajoie, Michael Mozer, Yoshua Bengio

Stochastic Optimization for Regularized Wasserstein Estimators
Marin Ballu, Quentin Berthet, Francis Bach

Low-Rank Bottleneck in Multi-head Attention Models
Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Jakkam Reddi, Sanjiv Kumar

Rigging the Lottery: Making All Tickets Winners
Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, Erich Elsen

Online Learning with Dependent Stochastic Feedback Graphs
Corinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, Ningshan Zhang

Calibration, Entropy Rates, and Memory in Language Models
Mark Braverman, Xinyi Chen, Sham Kakade, Karthik Narasimhan, Cyril Zhang, Yi Zhang

Composable Sketches for Functions of Frequencies: Beyond the Worst Case
Edith Cohen, Ofir Geri, Rasmus Pagh

Energy-Based Processes for Exchangeable Data
Mengjiao Yang, Bo Dai, Hanjun Dai, Dale Schuurmans

Near-Optimal Regret Bounds for Stochastic Shortest Path
Alon Cohen, Haim Kaplan, Yishay Mansour, Aviv Rosenberg

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization (see blog post)
Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu

The Complexity of Finding Stationary Points with Stochastic Gradient Descent
Yoel Drori, Ohad Shamir

The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks
Jakub Swiatkowski, Kevin Roth, Bas Veeling, Linh Tran, Josh Dillon, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin

Regularized Optimal Transport is Ground Cost Adversarial
François-Pierre Paty, Marco Cuturi

Workshops
New In ML
Invited Speaker: Nicolas Le Roux
Organizers: Zhen Xu, Sparkle Russell-Puleri, Zhengying Liu, Sinead A Williamson, Matthias W Seeger, Wei-Wei Tu, Samy Bengio, Isabelle Guyon

LatinX in AI
Workshop Advisor: Pablo Samuel Castro

Women in Machine Learning Un-Workshop
Invited Speaker: Doina Precup
Sponsor Expo Speaker: Jennifer Wei

Queer in AI
Invited Speaker: Shakir Mohamed

Workshop on Continual Learning
Organizers: Haytham Fayek, Arslan Chaudhry, David Lopez-Paz, Eugene Belilovsky, Jonathan Schwarz, Marc Pickett, Rahaf Aljundi, Sayna Ebrahimi, Razvan Pascanu, Puneet Dokania

5th ICML Workshop on Human Interpretability in Machine Learning (WHI)
Organizers: Kush Varshney, Adrian Weller, Alice Xiang, Amit Dhurandhar, Been Kim, Dennis Wei, Umang Bhatt

Self-supervision in Audio and Speech
Organizers: Mirco Ravanelli, Dmitriy Serdyuk, R Devon Hjelm, Bhuvana Ramabhadran, Titouan Parcollet

Workshop on eXtreme Classification: Theory and Applications
Invited Speakers: Sanjiv Kumar

Healthcare Systems, Population Health, and the Role of Health-tech
Organizers: Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov, Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack Parker-Holder, Jacob Bergquist, Yuan Gao, Aldo Pacchiano, Tamas Sarlos, Adrian Weller, Vikas Sindhwani

Theoretical Foundations of Reinforcement Learning
Program Committee: Alon Cohen, Chris Dann

Uncertainty and Robustness in Deep Learning Workshop (UDL)
Invited Speaker: Justin Gilmer

Organizers: Sharon Li, Balaji Lakshminarayanan, Dan Hendrycks, Thomas Dietterich, Jasper Snoek
Program Committee: Jeremiah Liu, Jie Ren, Rodolphe Jenatton, Zack Nado, Alexander Alemi, Florian Wenzel, Mike Dusenberry, Raphael Lopes

Beyond First Order Methods in Machine Learning Systems
Industry Panel: Jonathan Hseu

Object-Oriented Learning: Perception, Representation, and Reasoning
Invited Speakers: Thomas Kipf, Igor Mordatch

Graph Representation Learning and Beyond (GRL+)
Organizers: Michael Bronstein, Andreea Deac, William L. Hamilton, Jessica B. Hamrick, Milad Hashemi, Stefanie Jegelka, Jure Leskovec, Renjie Liao, Federico Monti, Yizhou Sun, Kevin Swersky, Petar Veličković, Rex Ying, Marinka Žitnik
Speakers: Thomas Kipf
Program Committee: Bryan Perozzi, Kevin Swersky, Milad Hashemi, Thomas Kipf, Ting Cheng

ML Interpretability for Scientific Discovery
Organizers: Subhashini Venugopalan, Michael Brenner, Scott Linderman, Been Kim
Program Committee: Akinori Mitani, Arunachalam Narayanaswamy, Avinash Varadarajan, Awa Dieng, Benjamin Sanchez-Lengeling, Bo Dai, Stephan Hoyer, Subham Sekhar Sahoo, Suhani Vora
Steering Committee: John Platt, Mukund Sundararajan, Jon Kleinberg

Negative Dependence and Submodularity for Machine Learning
Organizers: Zelda Mariet, Mike Gartrell, Michal Derezinski

7th ICML Workshop on Automated Machine Learning (AutoML)
Organizers: Charles Weill, Katharina Eggensperger, Matthias Feurer, Frank Hutter, Marius Lindauer, Joaquin Vanschoren

Federated Learning for User Privacy and Data Confidentiality
Keynote: Brendan McMahan
Program Committee: Peter Kairouz, Jakub Konecný

MLRetrospectives: A Venue for Self-Reflection in ML Research
Speaker: Margaret Mitchell

Machine Learning for Media Discovery
Speaker: Ed Chi

INNF+: Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models
Organizers: Chin-Wei Huang, David Krueger, Rianne van den Berg, George Papamakarios, Chris Cremer, Ricky Chen, Danilo Rezende

4th Lifelong Learning Workshop
Program Committee: George Tucker, Marlos C. Machado

2nd ICML Workshop on Human in the Loop Learning (HILL)
Organizers: Shanghang Zhang, Xin Wang, Fisher Yu, Jiajun Wu, Trevor Darrell

Machine Learning for Global Health
Organizers: Danielle Belgrave, Danielle Belgrave, Stephanie Hyland, Charles Onu, Nicholas Furnham, Ernest Mwebaze, Neil Lawrence

Committee
Social Chair: Adam White

Work performed while at Google

Read More

Grounding Natural Language Instructions to Mobile UI Actions

Grounding Natural Language Instructions to Mobile UI Actions

Posted by Yang Li, Research Scientist, Google Research

Mobile devices offer a myriad of functionalities that can assist in everyday activities. However, many of these functionalities are not easily discoverable or accessible to users, forcing users to look up how to perform a specific task — how to turn on the traffic mode in Maps or change notification settings in YouTube, for example. While searching the web for detailed instructions for these questions is an option, it is still up to the user to follow these instructions step-by-step and navigate UI details through a small touchscreen, which can be tedious and time consuming, and results in reduced accessibility. What if one could design a computational agent to turn these language instructions into actions and automatically execute them on the user’s behalf?

In “Mapping Natural Language Instructions to Mobile UI Action Sequences”, published at ACL 2020, we present the first step towards addressing the problem of automatic action sequence mapping, creating three new datasets used to train deep learning models that ground natural language instructions to executable mobile UI actions. This work lays the technical foundation for task automation on mobile devices that would alleviate the need to maneuver through UI details, which may be especially valuable for users who are visually or situationally impaired. We have also open-sourced our model code and data pipelines through our GitHub repository, in order to spur further developments among the research community.

Constructing Language Grounding Models
People often provide one another with instructions in order to coordinate joint efforts and accomplish tasks involving complex sequences of actions, for example, following a recipe to bake a cake, or having a friend walk you through setting up a home network. Building computational agents able to help with similar interactions is an important goal that requires true language grounding in the environments in which the actions take place.

The learning task addressed here is to predict a sequence of actions for a mobile platform given a set of instructions, a sequence of screens produced as the system transitions from one screen to another, as well as the set of interactive elements on those screens. Training such a model end-to-end would require paired language-action data, which is difficult to acquire at a large scale.

Instead, we deconstruct the problem into two sequential steps: an action phrase-extraction step and a grounding step.

The workflow of grounding language instructions to executable actions.

The action phrase-extraction step identifies the operation, object and argument descriptions from multi-step instructions using a Transformer model with area attention for representing each description phrase. Area attention allows the model to attend to a group of adjacent words in the instruction (a span) as a whole for decoding a description.

The action phrase extraction model takes a word sequence of a natural language instruction and outputs a sequence of spans (denoted in red boxes) that indicate the phrases describing the operation, the object and the argument of each action in the task.

Next, the grounding step matches the extracted operation and object descriptions with a UI object on the screen. Again, we use a Transformer model, but in this case, it contextually represents UI objects and grounds object descriptions to them.

The grounding model takes the extracted spans as input and grounds them to executable actions, including the object an action is applied to, given the UI screen at each step during execution.

Results
To investigate the feasibility of this task and the effectiveness of our approach, we construct three new datasets to train and evaluate our model. The first dataset includes 187 multi-step English instructions for operating Pixel phones along their corresponding action-screen sequences and enables assessment of full task performance on naturally occurring instructions, which is used for testing end-to-end grounding quality. For action phrase extraction training and evaluation, we obtain English “how-to” instructions that can be found abundantly from the web and annotate phrases that describe each action. To train the grounding model, we synthetically generate 295K single-step commands to UI actions, covering 178K different UI objects across 25K mobile UI screens from a public android UI corpus.

A Transformer with area attention obtains 85.56% accuracy for predicting span sequences that completely match the ground truth. The phrase extractor and grounding model together obtain 89.21% partial and 70.59% complete accuracy for matching ground-truth action sequences on the more challenging task of mapping language instructions to executable actions end-to-end. We also evaluated alternative methods and representations of UI objects, such as using a graph convolutional network (GCN) or a feedforward network, and found those that can represent an object contextually in the screen lead to better grounding accuracy. The new datasets, models and results provide an important first step on the challenging problem of grounding natural language instructions to mobile UI actions.

Conclusion
This research, and language grounding in general, is an important step for translating multi-stage instructions into actions on a graphical user interface. Successful application of task automation to the UI domain has the potential to significantly improve accessibility, where language interfaces might help individuals who are visually impaired perform tasks with interfaces that are predicated on sight. This also matters for situational impairment when one cannot access a device easily while encumbered by tasks at hand.

By deconstructing the problem into action phrase extraction and language grounding, progress on either can improve full task performance and it alleviates the need to have language-action paired datasets, which are difficult to collect at scale. For example, action span extraction is related to both semantic role labeling and extraction of multiple facts from text and could benefit from innovations in span identification and multitask learning. Reinforcement learning that has been applied in previous grounding work may help improve out-of-sample prediction for grounding in UIs and improve direct grounding from hidden state representations. Although our datasets were based on Android UIs, our approach can be applied generally to instruction grounding on other user interface platforms. Lastly, our work provides a technical foundation for investigating user experiences in language-based human computer interaction.

Acknowledgements
Many thanks to my collaborators on this work at Google Research. Xin Zhou and Jiacong He contributed substantially to the data pipelines and the creation of the datasets. Yuan Zhang and Jason Baldridge provided much valuable advice for the project and contributed to the presentation of the work. Gang Li provided generous help for creating open-source datasets. Many thanks to Ashwin Kakarla, Muqthar Mohammad and Mohd Majeed for their help with the annotations.

Read More