REALM: Integrating Retrieval into Language Representation Models

REALM: Integrating Retrieval into Language Representation Models

Posted by Ming-Wei Chang and Kelvin Guu, Research Scientists, Google Research

Recent advances in natural language processing have largely built upon the power of unsupervised pre-training, which trains general purpose language representation models using a large amount of text, without human annotations or labels. These pre-trained models, such as BERT and RoBERTa, have been shown to memorize a surprising amount of world knowledge, such as “the birthplace of Francesco Bartolomeo Conti”, “the developer of JDK” and “the owner of Border TV”. While the ability to encode knowledge is especially important for certain natural language processing tasks such as question answering, information retrieval and text generation, these models memorize knowledge implicitly — i.e., world knowledge is captured in an abstract way in the model weights — making it difficult to determine what knowledge has been stored and where it is kept in the model. Furthermore, the storage space, and hence the accuracy of the model, is limited by the size of the network. To capture more world knowledge, the standard practice is to train ever-larger networks, which can be prohibitively slow or expensive.

Instead, what if there was a method for pre-training that could access knowledge explicitly, e.g., by referencing an additional large external text corpus, in order to achieve accurate results without increasing the model size or complexity?  For example, a sentence found in an external document collection, “Francesco Bartolomeo Conti was born in Florence,” could be referenced by the model to determine the birthplace of the musician, rather than relying on the model’s opaque ability to access the knowledge stored in its own parameters. The ability to retrieve text containing explicit knowledge such as this would improve the efficiency of pre-training while enabling the model to perform well on knowledge-intensive tasks without using billions of parameters.

In “REALM: Retrieval-Augmented Language Model Pre-Training”, accepted at the 2020 International Conference on Machine Learning, we share a novel paradigm for language model pre-training, which augments a language representation model with a knowledge retriever, allowing REALM models to retrieve textual world knowledge explicitly from raw text documents, instead of memorizing all the knowledge in the model parameters. We have also open sourced the REALM codebase to demonstrate how one can train the retriever and the language representation jointly.

Background: Pre-training Language Representation Models
To understand how standard language representation models memorize world knowledge, one should first review how these models are pre-trained. Since the invention of BERT, the fill-in-the-blank task, called masked language modeling, has been widely used for pre-training language representation models. Given any text with certain words masked out, the task is to fill back the missing words. An example of this task looks like:

I am so thirsty. I need to __ water.

During pre-training, a model will go over a large number of examples and adjust the parameters in order to predict the missing words (answer: drink, in the above example). Interestingly, the fill-in-the-blank task makes the model memorize certain facts about the world. For example, the knowledge of Einstein’s birthplace is required to fill the missing word in the following example:

Einstein was a __-born scientist. (answer: German)

However, because the world knowledge captured by the model is stored in the model weights, it is abstract, making it difficult to understand what information is stored.

Our Proposal: Retrieval-Augmented Language Representation Model Pre-training
In contrast to standard language representation models, REALM augments the language representation model with a knowledge retriever that first retrieves another piece of text from an external document collection as the supporting knowledge — in our experiments, we use the Wikipedia text corpus — and then feeds this supporting text as well as the original text into a language representation model.

The key intuition of REALM is that a retrieval system should improve the model’s ability to fill in missing words. Therefore, a retrieval that provides more context for filling the missing words should be rewarded. If the retrieved information does not help the model make its predictions, it should be discouraged, making room for better retrievals.

How does one train a knowledge retriever, given that only unlabeled text is available during pre-training? It turns out that one can use the task of filling words to train the knowledge retriever indirectly, without any human annotations. Assume the input of the query is:

We paid twenty __ at the Buckingham Palace gift shop.

Filling the missing word (answer:pounds) in this sentence without retrieval can be tricky, as the model would need to have implicitly stored knowledge of the country in which the Buckingham Palace is located and the associated currency, as well as make the connection between the two. It would be easier for the model to fill in the missing word if it was presented with a passage that explicitly connects some of the necessary knowledge, retrieved from an external corpus.

In this example, the retriever would be rewarded for retrieving the following sentence.

Buckingham Palace is the London residence of the British monarchy.

Since the retrieval step needs to add more context, there may be multiple retrieval targets that could be helpful in filling the missing word, for example, “The official currency of the United Kingdom is the Pound.” The whole process is demonstrated in the next figure:

Computational Challenges for REALM
Scaling REALM pre-training such that models can retrieve knowledge from millions of documents is challenging. In REALM, the selection of the best document is formulated as maximum inner product search (MIPS). To perform retrieval, MIPS models need to first encode all of the documents in the collection, such that each document has a corresponding document vector. When an input arrives, it is encoded as a query vector. In MIPS, given a query, the document in the collection that has the maximum inner product value between its document vector and the query vector is retrieved, as shown in the following figure:

In REALM, we use the ScaNN package to conduct MIPS efficiently, which makes finding the maximum inner product value relatively cheap, given that the document vectors are pre-computed. However, if the model parameters were updated during training, it is typically necessary to re-encode the document vectors for the entire collection of documents. To address the computational challenges, we structure the retriever so that the computation performed for each document can be cached and asynchronously updated. We also found that updating document vectors every 500 training steps, instead of every step, is able to achieve good performance and make training tractable.

Applying REALM to Open-domain Question Answering
We evaluate the effectiveness of REALM by applying it to open-domain question answering (Open-QA), one of the most knowledge-intensive tasks in natural language processing. The goal of the task is to answer questions, such as “What is the angle of the equilateral triangle?”

In standard question answering tasks (e.g., SQuAD or Natural Questions), the supporting document is provided as part of input, so a model only needs to look up the answer in the given document. In Open-QA, there are no given documents, so that Open-QA models need to look up the knowledge by themselves — this makes Open-QA an excellent task to examine the effectiveness of REALM.

The following figure shows the results on the OpenQA version of Natural Question. We mainly compared our results with T5, another approach that trains models without annotated supporting documents. From the figure, one can clearly see that REALM pre-training generates very powerful Open-QA models, and even outperforms the much larger T5 (11B) model by almost 4 points, using only a fraction of the parameters (300M).

Conclusion
The release of REALM has helped drive interest in developing end-to-end retrieval-augmented models, including a recent retrieval-augmented generative model. We look forward to the possibility of extending this line of work in several ways, including 1) applying REALM-like methods to new applications that require knowledge-intensive reasoning and interpretable provenance (beyond Open-QA), and 2) exploring the benefits of retrieving other forms of knowledge, such as images, knowledge graph structures, or even text in other languages. We are also excited to see what the research community does with the open source REALM codebase!

Acknowledgements
This work has been a collaborative effort involving Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.

Read More

A Simulation Suite for Tackling Applied Reinforcement Learning Challenges

A Simulation Suite for Tackling Applied Reinforcement Learning Challenges

Posted by Daniel J. Mankowitz, Research Scientist, DeepMind and Gabriel Dulac-Arnold, Research Scientist, Google Research

Reinforcement Learning (RL) has proven to be effective in solving numerous complex problems ranging from Go, StarCraft and Minecraft to robot locomotion and chip design. In each of these cases, a simulator is available or the real environment is quick and inexpensive to access. Yet, there are still considerable challenges to deploying RL to real-world products and systems. For example, in physical control systems, such as robotics and autonomous driving, RL controllers are trained to solve tasks like grasping objects or driving on a highway. These controllers are susceptible to effects such as sensor noise, system delays, or normal wear-and-tear that can reduce the quality of input to the controller, leading to incorrect decision-making and potentially catastrophic failures.

A physical control system: Robots learning how to grasp and sort objects using RL at the Everyday Robot Project at X. These types of systems are subject to many of the real-world challenges detailed here.

In “Challenges of Real-World Reinforcement Learning”, we identify and discuss nine different challenges that hinder the application of current RL algorithms to applied systems. We then follow up this work with an empirical investigation in which we simulated versions of these challenges on state-of-the-art RL algorithms, and benchmark the effects of each. We have open-sourced these simulated challenges in the Real-World RL (RWRL) task suite to help draw attention to these important issues, as well as accelerate research toward solving them.

The RWRL Suite
The RWRL suite is a set of simulated tasks inspired by applied reinforcement learning challenges, the goal of which is to enable fast algorithmic iterations for both researchers and practitioners, without having to run slow, expensive experiments on real-systems. While there will be additional challenges transitioning from RL algorithms that were trained in simulation to real-world applications, this suite intends to close some of the more fundamental, algorithmic gaps. At present, RWRL supports a subset of the DeepMind Control Suite domains, but the goal is to broaden the suite to support an even more diverse domain set.

Easy-to-Use & Flexible
We designed the suite with two main goals in mind. (1) It should be easy to use — a user should be able to start running experiments within minutes of downloading the suite, simply by changing a few lines of code. (2) It should be flexible — a user should be able to incorporate any combination of challenges into the environment with very little effort.

A Delayed Action Example
To illustrate the ease of use of the RWRL suite, imagine a researcher or practitioner wants to implement action delays (i.e., temporal delays on actions being sent to the environment). To use the RWRL suite, simply import the rwrl module. Next, load an environment (e.g., cartpole) with the delay_spec argument. This optional argument is specified as a dictionary configuring delay applied to actions, observations, or rewards and the number of timesteps the corresponding element is delayed (e.g., 20 timesteps). Once the environment is loaded, the effects of actions are automatically delayed without any other changes to the experiment. This makes it easy to test an RL algorithm with action delays in a range of different environments supported by the RWRL suite.

A high-level overview of the RWRL suite. Add a challenge (e.g., action delays) into the environment with a few lines of code, run a hyperparameter sweep and produce a graph shown on the right

A user can combine different challenges or choose from a set of predefined benchmark challenges by simply adding additional arguments to the load function, all of which are specified in the open-source RWRL suite codebase.

Supported Challenges
The RWRL suite provides functionality to support experiments related to eight of the nine different challenges that make applying current RL algorithms on applied systems difficult: sample efficiency; system delays; high-dimensional state and action spaces; constraints; partial observability, stochasticity and non-stationarity; multiple objectives; real-time inference; and training from offline logs. RWRL excludes the explainability challenge, which is abstract and non-trivial to define. The supported experiments are non-exhaustive and provide researchers and practitioners with the ability to analyze the capabilities of their agent with respect to each challenge dimension. Examples of the supported challenges include:

  • System Delays
    Most real systems have delays in either sensing, actuation or reward feedback, all of which can be configured and applied to any task within the RWRL suite.The graphs below show the performance of a D4PG agent as actions (left), observations (middle) and rewards (right) are increasingly delayed.
    The effect of increasing the action (left), observation (middle) and reward (right) delays respectively on a state-of-the art RL agent in four MuJoCo domains.

    As can be seen in the graphs, a researcher or practitioner can quickly gain insights as to which type of delay affects their agent’s performance. These delays can also be combined together to observe their combined effect.

  • Constraints
    Almost all applied systems have some form of constraints embedded into the overall objective, which is not common in most RL environments. The RWRL suite implements a series of constraints for each task, with varying difficulties, to facilitate research in constrained RL. An example of a complex local angular velocity constraint being violated is visualized in the video below.
    An example of constraint violations for cartpole. The red screen indicates that a violation has occurred on localized angular velocity.
  • Non-Stationarity
    The user can introduce non-stationarity by perturbing environment parameters. These perturbations are in contrast to the pixel level adversarial perturbations that have recently gained popularity in research on supervised deep learning. For example, in the human walker domain, the size of the head and friction of the ground can be modified throughout training to simulate changing conditions. A variety of schedulers are available in the RWRL suite (see our codebase for more details), along with multiple default parameter perturbations, which were carefully defined to handicap the learning capabilities of state-of-the-art learning algorithms.
    Non-stationary perturbations. The suite supports perturbing environment parameters across episodes such as changing head size (center) and contact friction (right).
  • Training from Offline Log Data
    In most applied systems, it is both slow and expensive to run experiments. There are often logs of data available from previous experiments that can be utilized to train a policy. However, it is often difficult to outperform the previous model in production due to the data being limited, of low variance, or of poor quality. To address this, we have generated offline datasets of the combined RWRL benchmark challenges, which we made available as part of a wider offline dataset release. More information can be found in this notebook.

Conclusion
Most systems rarely manifest only a single challenge, and we are excited to see how algorithms can deal with an environment in which there are multiple challenges combined with increasing levels of difficulty (‘Easy’, ‘Medium’ and ‘Hard’). We highly encourage the research community to try and solve these challenges, as we believe that solving them will facilitate more widespread applications of RL to products and real-world systems.

While the initial set of RWRL suite features and experiments provide a starting point for closing the gap between the current state of RL and the challenges of applied systems, there is still much work to do. The supported experiments are not exhaustive and we welcome new ideas from the wider community to better evaluate the capabilities of our RL agents. Our main goal with this suite is to highlight and encourage research on the core problems that limit the effectiveness of RL algorithms in applied products and systems and to accelerate progress towards enabling future RL applications.

Acknowledgements
We would like to thank our core contributor and co-author Nir Levine for his invaluable help. We would also like to thank our co-authors Jerry Li, Sven Gowal, Todd Hester and Cosmin Paduraru as well as Robert Dadashi, the ACME team, Dan A. Calian, Juliet Rothenberg and Timothy Mann for their contributions.

Read More

On-device Supermarket Product Recognition

On-device Supermarket Product Recognition

Posted by Chao Chen, Software Engineer, Google Research

One of the greatest challenges faced by users who are visually impaired is identifying packaged foods, both in a grocery store and also in their kitchen cupboard at home. This is because many foods share the same packaging, such as boxes, tins, bottles and jars, and only differ in the text and imagery printed on the label. However, the ubiquity of smart mobile devices provides an opportunity to address such challenges using machine learning (ML).

In recent years, there have been significant improvements in the accuracy of on-device neural networks for various perception tasks. When coupled with the increased computing power in modern smartphones, it is now possible for many vision tasks to yield high performance while running entirely on a mobile device. The development of on-device models such as MnasNet and MobileNets (based on resource-aware architecture search) in combination with on-device indexing allows one to run a full computer vision system, such as labeled product recognition, entirely on-device, in real time.

Leveraging developments such as these, we recently released Lookout, an Android app that uses computer vision to make the physical world more accessible for users who are visually impaired. When the user aims their smartphone camera at the product, Lookout identifies it and speaks aloud the brand name and product size. To accomplish this, Lookout includes a supermarket product detection and recognition model with an on-device product index, along with MediaPipe object tracking and an optical character recognition model. The resulting architecture is efficient enough to run in real-time entirely on-device.

Why On-Device?
A completely on-device system has the benefit of being low latency and with no reliance on network connectivity. However, this means that for a product recognition system to be truly useful to the users, it must have a on-device database with good product coverage. These requirements drive the design of the datasets used by Lookout, which consist of two million popular products chosen dynamically according to the user’s geographic location.

Traditional Solutions
Product recognition using computer vision has traditionally been solved using local image features extracted by, for example, the SIFT algorithm. These non ML-based approaches provide fairly reliable matching but are storage intensive per index image (typically ranging from 10KB to 40KB per image) and are less robust to poor lighting and blur in images. Additionally, the local nature of these descriptors means that it typically does not capture more global aspects of the product’s appearance.

An alternative approach that has a number of advantages would be to use ML and run an optical character recognition (OCR) system over the query image and database images to extract the text present on the product packaging. The text on the query image can be matched to the database using N-Grams to be robust to OCR errors such as spelling mistakes, misrecognitions, failed recognition of words on product packaging. N-Grams can also allow for partial match between query document and index document using measures such as Jaccard similarity coefficient, as opposed to requiring an exact match. However, with OCR, the index document size can grow very large since one would need to store N-Grams for product packaging text along with other signals like TF-IDF. Furthermore, the reliability of the matches is a concern with the OCR+N-Gram approach since it can easily over trigger in situations where there are a lot of common words present on the packaging of two different products.

In contrast to both the SIFT and OCR+N-Gram methods, our neural network-based approach, which generates a global descriptor (i.e., an embedding) for each image, requires only 64 bytes, significantly reducing the storage requirements from the 10-40KB per image needed for each SIFT feature index entry, or the few KBs per image for the less reliable OCR+N-gram approach. With fewer bytes consumed for each index image, more products can be included as a part of the index, yielding more complete product coverage and a better overall user experience.

Design
The Lookout system consists of a frame cache, frame selector, detector, object tracker, embedder, index searcher, OCR, scorer and result presenter.

Product recognition pipeline internal architecture.
  • Frame cache
    The frame cache manages the lifecycle of the input camera frames in the pipeline. It efficiently delivers the data, including YUV/RGB/gray images, as requested by the other model components and manages the data life cycle to avoid duplicated conversions for the same camera frame requested by multiple components.
  • Frame selector
    When a user points the camera viewfinder towards a product, a lightweight IMU-based frame selector is run as a prefiltering stage. It selects the frames that best match a certain quality criterion (e.g., balanced image quality and latency) from the continuously incoming image stream, based on the jitter as measured by the angular rotation rate (deg/sec). This approach minimizes energy consumption by selectively processing only the high quality image frames and skipping the blurry frames.
  • Detector
    Each selected frame is then passed to a product detector model, which proposes regions of interest(a.k.a. Detection bounding boxes) in the frames. The detector model architecture is a single-shot detector with an MnasNet backbone that strikes a balance between high quality and low latency.
  • Object tracker
    MediaPipe Box tracking is used to track the detected box in real-time, and plays an important role in filling the gap between the detection of different objects and reducing the detection frequency, thus reducing energy consumption. The object tracker also maintains an object map in which each object is assigned a unique object ID during runtime, which are later used by the result presenter to differentiate between objects and to avoid repeating the announcement of a single object. For each detection result, the tracker either registers a new object in the map or updates an existing object with the detection bounding box, using the Intersection over Union (IoU) between existing object bounding boxes with the detection result.
  • Embedder
    The regions of interest (ROIs) from the detector are sent to the embedder model, which then computes a 64-dimension embedding. The embedder model is initially trained from a large classification model (i.e., the teacher model, based on NASNet), which spans tens of thousands of classes. An embedding layer is added in the model to project the input image into an ‘embedding space’, i.e., a vector space where two points being close means that the images they represent are visually similar (e.g., two images show the same product). Analyzing only the embeddings ensures that the model is flexible and does not need to be retrained every time it is to be expanded to new products. However, because the teacher model is too large to be used directly on-device, the embeddings it generates are used to train a smaller, mobile-friendly student model that learns to map the input images to the same points in the embedding space as the teacher network. Finally, we apply principal component analysis (PCA) to reduce the dimensionality of the embedding vectors from 256 to 64, streamlining the embeddings for storing on-device.
  • Index searcher
    The index searcher performs KNN search over a pre-built, compatible ScaNN index using a query embedding. As a result, it returns the top-ranked index documents containing their metadata, such as product names, packaging size, etc. To reduce the index lookup latency, all embeddings are k-means clustered into clusters. At query time, the relevant clusters of data are loaded in memory for the actual distance computation. To reduce the index size without sacrificing quality, we use product quantization at indexing time.
  • OCR
    OCR is executed on the ROI for each camera frame in order to extract additional information, such as packet size, product flavor variant, etc. Whereas traditional solutions used the OCR result for index searching, here we only use it for scoring. A proper scoring algorithm informed by the OCR text assists the scorer (below) in determining the correct result and improves the precision, especially in the case where multiple products have similar packages.
  • Scorer
    The scorer takes the input from the embeddings (with index results) and the OCR module and scores each of the previously retrieved index documents (embeddings and metadata retrieved via the index searcher). The top result after scoring is used as the final recognition from the system.
  • Result presenter
    Result presenter takes in all the results above, and surfaces the results to users by speaking the product name via text-to-speech service.
Early experiments with on-device product recognition in a Swiss supermarket.

Conclusion/Future Work
The on-device system outlined here can be used to enable a spectrum of new in-store experiences, including the display of detailed product information (nutritional facts, allergens, etc.), customer ratings, product comparisons, smart shopping lists, price tracking, and more. We are excited to explore some of these future applications, while continuing research into advancing the quality and robustness of the underlying on-device models.

Acknowledgements
The work described here was authored by Abhanshu Sharma, Chao Chen, Lukas Mach, Matt Sharifi, Matteo Agosti, Sasa Petrovic and Tom Binder. This work wouldn’t have been possible without the support and help we received from Alec Go, Alessandro Bissacco, Cédric Deltheil, Eunyoung Kim, Haoran Qi, Jeff Gilbert and Mingxing Tan.

Read More

MediaPipe Iris: Real-time Iris Tracking & Depth Estimation

Posted by Andrey Vakunov and Dmitry Lagun, Research Engineers, Google Research

A wide range of real-world applications, including computational photography (e.g., portrait mode and glint reflections) and augmented reality effects (e.g., virtual avatars) rely on estimating eye position by tracking the iris. Once accurate iris tracking is available, we show that it is possible to determine the metric distance from the camera to the user — without the use of a dedicated depth sensor. This, in-turn, can improve a variety of use cases, ranging from computational photography, over virtual try-on of properly sized glasses and hats to usability enhancements that adopt the font size depending on the viewer’s distance.

Iris tracking is a challenging task to solve on mobile devices, due to limited computing resources, variable light conditions and the presence of occlusions, such as hair or people squinting. Often, sophisticated specialized hardware is employed, limiting the range of devices on which the solution could be applied.

FaceMesh can be adopted to drive virtual avatars (middle). By additionally employing iris tracking (right), the avatar’s liveliness is significantly enhanced.
An example of eye re-coloring enabled by MediaPipe Iris.

Today, we announce the release of MediaPipe Iris, a new machine learning model for accurate iris estimation. Building on our work on MediaPipe Face Mesh, this model is able to track landmarks involving the iris, pupil and the eye contours using a single RGB camera, in real-time, without the need for specialized hardware. Through use of iris landmarks, the model is also able to determine the metric distance between the subject and the camera with relative error less than 10% without the use of depth sensor. Note that iris tracking does not infer the location at which people are looking, nor does it provide any form of identity recognition. Thanks to the fact that this system is implemented in MediaPipe — an open source cross-platform framework for researchers and developers to build world-class ML solutions and applications — it can run on most modern mobile phones, desktops, laptops and even on the web.

Usability prototype for far-sighted individuals: observed font size remains constant independent of the device distance from the user.

An ML Pipeline for Iris Tracking
The first step in the pipeline leverages our previous work on 3D Face Meshes, which uses high-fidelity facial landmarks to generate a mesh of the approximate face geometry. From this mesh, we isolate the eye region in the original image for use in the iris tracking model. The problem is then divided into two parts: eye contour estimation and iris location. We designed a multi-task model consisting of a unified encoder with a separate component for each task, which allowed us to use task-specific training data.

Examples of iris (blue) and eyelid (red) tracking.

To train the model from the cropped eye region, we manually annotated ~50k images, representing a variety of illumination conditions and head poses from geographically diverse regions, as shown below.

Eye region annotated with eyelid (red) and iris (blue) contours.
Cropped eye regions form the input to the model, which predicts landmarks via separate components.

Depth-from-Iris: Depth Estimation from a Single Image
Our iris-tracking model is able to determine the metric distance of a subject to the camera with less than 10% error, without requiring any specialized hardware. This is done by relying on the fact that the horizontal iris diameter of the human eye remains roughly constant at 11.7±0.5 mm across a wide population [1, 2, 3, 4], along with some simple geometric arguments. For illustration, consider a pinhole camera model projecting onto a sensor of square pixels. The distance to a subject can be estimated from facial landmarks by using the focal length of the camera, which can be obtained using camera capture APIs or directly from the EXIF metadata of a captured image, along with other camera intrinsic parameters. Given the focal length, the distance from the subject to the camera is directly proportional to the physical size of the subject’s eye, as visualized below.

The distance of the subject (d) can be computed from the focal length (f) and the size of the iris using similar triangles.
Left: MediaPipe Iris predicting metric distance in cm on a Pixel 2 from iris tracking alone, without the use of a depth sensor. Right: Ground-truth depth.

In order to quantify the accuracy of the method, we compared it to the depth sensor on an iPhone 11 by collecting front-facing, synchronized video and depth images on over 200 participants. We experimentally verified the error of the iPhone 11 depth sensor to be < 2% for distances up to 2 meters, using a laser ranging device. Our evaluation shows that our approach for depth estimation using iris size has a mean relative error of 4.3% and standard deviation of 2.4%. We tested our approach on participants with and without eyeglasses (not accounting for contact lenses on participants) and found that eyeglasses increase the mean relative error slightly to 4.8% (standard deviation 3.1%). We did not test this approach on participants with any eye diseases (like arcus senilis or pannus). Considering MediaPipe Iris requires no specialized hardware, these results suggest it may be possible to obtain metric depth from a single image on devices with a wide range of cost-points.

Histogram of estimation errors (left) and comparison of actual to estimated distance by iris (right).

Release of MediaPipe Iris
We are releasing the iris and depth estimation models as a cross-platform MediaPipe pipeline that can run on desktop, mobile and the web. As described in our recent Google Developer Blog post on MediaPipe on the web, we leverage WebAssembly and XNNPACK to run our Iris ML pipeline locally in the browser, without any data being sent to the cloud.

Using MediaPipe’s WASM stack, you can run the models locally in your browser! Left: Iris tracking. Right: Depth from Iris computed just from a photo with EXIF data. Iris tracking can be tried out here and iris depth measurements here.

Future Directions
We plan to extend our MediaPipe Iris model with even more stable tracking for lower error and deploy it for accessibility use cases. We strongly believe in sharing code that enables reproducible research, rapid experimentation, and development of new ideas in different areas. In our documentation and the accompanying Model Card, we detail the intended uses, limitations and model fairness to ensure that use of these models aligns with Google’s AI Principles. Note, that any form of surveillance or identification is explicitly out of scope and not enabled by this technology. We hope that providing this iris perception functionality to the wider research and development community will result in an emergence of creative use cases, stimulating responsible new applications and new research avenues.

For more ML solutions from MediaPipe, please see our solutions page and Google Developer blog for the latest updates.

Acknowledgements
We would like to thank Artsiom Ablavatski, Andrei Tkachenka, Buck Bourdon, Ivan Grishchenko and Gregory Karpiak for support in model evaluation and data collection; Yury Kartynnik, Valentin Bazarevsky, Artsiom Ablavatski for developing the mesh technology; Aliaksandr Shyrokau and the annotation team for their diligence to data preparation; Vidhya Navalpakkam, Tomer, Tomer Shekel, Kai Kohlhoff for their domain expertise, Fan Zhang, Esha Uboweja, Tyler Mullen, Michael Hays and Chuo-Ling Chang for help to integrate the model to MediaPipe; Matthias Grundmann, Florian Schroff and Ming Guang Yong for continuous help for building this technology.

Read More

Live HDR+ and Dual Exposure Controls on Pixel 4 and 4a

Posted by Jiawen Chen and Sam Hasinoff, Software Engineers, Google Research

High dynamic range (HDR) imaging is a method for capturing scenes with a wide range of brightness, from deep shadows to bright highlights. On Pixel phones, the engine behind HDR imaging is HDR+ burst photography, which involves capturing a rapid burst of deliberately underexposed images, combining them, and rendering them in a way that preserves detail across the range of tones. Until recently, one challenge with HDR+ was that it could not be computed in real time (i.e., at 30 frames per second), which prevented the viewfinder from matching the final result. For example, bright white skies in the viewfinder might appear blue in the HDR+ result.

Starting with Pixel 4 and 4a, we have improved the viewfinder using a machine-learning-based approximation to HDR+, which we call Live HDR+. This provides a real-time preview of the final result, making HDR imaging more predictable. We also created dual exposure controls, which generalize the classic “exposure compensation” slider into two controls for separately adjusting the rendition of shadows and highlights. Together, Live HDR+ and dual exposure controls provide HDR imaging with real-time creative control.

Live HDR+ on Pixel 4 and 4a helps the user compose their shot with a WYSIWYG viewfinder that closely resembles the final result. You can see individual images here. Photos courtesy of Florian Kainz.

The HDR+ Look
When the user presses the shutter in the Pixel camera app, it captures 3-15 underexposed images. These images are aligned and merged to reduce noise in the shadows, producing a 14-bit intermediate “linear RGB image” with pixel values proportional to the scene brightness. What gives HDR+ images their signature look is the “tone mapping” of this image, reducing the range to 8 bits and making it suitable for display.

Consider the backlit photo of a motorcyclist, below. While the linear RGB image contains detail in both the dark motorcycle and bright sky, the dynamic range is too high to see it. The simplest method to reveal more detail is to apply a “global curve”, remapping all pixels with a particular brightness to some new value. However, for an HDR scene with details in both shadows and highlights, no single curve is satisfactory.

Different ways to tone-map a linear RGB image. (a) The original, “un-tone-mapped” image. (b) Global curve optimizing for the sky. (c) Global curve optimizing for the subject. (d) HDR+, which preserves details everywhere. In the 2D histogram, brighter areas indicate where more pixels of a given input brightness are mapped to the same output. The overlapping shapes show that the relationship cannot be modeled using a single curve. Photo courtesy of Nicholas Wilson.

In contrast to applying a single curve, HDR+ uses a local tone mapping algorithm to ensure that the final result contains detail everywhere, while keeping edges and textures looking natural. Effectively, this applies a different curve to different regions, depending on factors such as overall brightness, local texture, and amount of noise. Unfortunately, HDR+ is too slow to run live in the viewfinder, requiring an alternative approach for Live HDR+.

Local Curve Approximation for Live HDR+
Using a single tone curve does not produce a satisfying result for the entire image — but how about for a small region? Consider the small red patch in the figure below. Although the patch includes both shadows and highlights, the relationship between input and output brightness follows a smooth curve. Furthermore, the curve varies gradually. For the blue patch, shifted ten pixels to the right, both the image content and curve are similar. But while the curve approximation works well for small patches, it breaks down for larger patches. For the larger yellow patch, the input/output relationship is more complicated, and not well approximated by a single curve.

(a) Input and HDR+ result. (b) The effect of HDR+ on a small patch (red) is approximately a smooth curve. (c) The relationship is nearly identical for the nearby blue patch. (d) However, if the patch is too big, a single curve will no longer provide a good fit.

To address this challenge, we divide the input image into “tiles” of size roughly equal to the red patch in the figure above, and approximate HDR+ using a curve for each tile. Since these curves vary gradually, blending between curves is a good way to approximate the optimal curve at any pixel. To render a pixel we apply the curves from each of the four nearest tiles, then blend the results according to the distances to the respective tile centers.

Compared to HDR+, this algorithm is particularly well suited for GPUs. Since the tone mapping of each pixel can be computed independently, the algorithm can also be parallelized. Moreover, the representation is memory-efficient: only a small number of tiles is enough to represent HDR+ local tone mapping for the viewfinder.

To compute local curves, we use a machine learning algorithm called HDRnet, a deep neural network that predicts, from a linear image, per-tile curves that approximate the HDR+ look of that image. It’s also fast, due to its compact architecture and the way that low-resolution input images can be used to predict the curves for the high-resolution viewfinder. We train HDRnet on thousands of images to ensure it works well on all kinds of scenes.

HDRnet vs. HDR+ on a challenging scene with extreme brights and darks. The results are very similar at viewfinder resolution. Photo courtesy of Nicholas Wilson.

Dual Exposure Controls
HDR+ is designed to produce pleasing HDR images automatically, without the need for manual controls or post-processing. But sometimes the HDR+ rendition may not match the photographer’s artistic vision. While image editing tools are a partial remedy, HDR images can be challenging to edit, because some decisions are effectively baked into the final JPG. To maximize latitude for editing, it’s possible to save RAW images for each shot (an option in the app). However, this process takes the photographer out of the moment and requires expertise with RAW editing tools as well as additional storage.

Another approach to artistic control is to provide it live in the viewfinder. Many photographers are familiar with the exposure compensation slider, which brightens or darkens the image. But overall brightness is not expressive enough for HDR photography. At a minimum two controls are needed in order to control the highlights and shadows separately.

To address this, we introduce dual exposure controls. When the user taps on the Live HDR+ viewfinder, two sliders appear. The “Brightness” slider works like traditional exposure compensation, changing the overall exposure. This slider is used to recover more detail in bright skies, or intentionally blow out the background and make the subject more visible. The “Shadows” slider affects only dark areas — it operates by changing the tone mapping, not the exposure. This slider is most useful for high-contrast scenes, letting the user boost shadows to reveal details, or suppress them to create a silhouette.

Screen capture of dual exposure controls in action on an outdoor HDR scene with HDR+ results below. You can see individual images here. Photos courtesy of Florian Kainz.

Here are some of the dramatic renditions we were able to achieve using dual exposure controls.

Different renditions using Dual Exposure Controls. You can see individual images here. Photo credits: Jiawen Chen, Florian Kainz, Alexander Schiffhauer.

Dual Exposure Controls gives you the flexibility to capture dramatically different versions of the same subject. They are not limited to tough HDR scenes, so don’t be afraid to experiment with different subjects and lighting. You may be surprised at how much these sliders will change how you shoot!

Acknowledgements
Live HDR+ and Dual Exposure Controls is the result of a collaboration between Google Research, Android, Hardware, and UX Design teams. Key contributors include: Francois Bleibel, Sean Callanan, Yulun Chang, Eric Chen, Michelle Chen, Kourosh Derakshan, Ryan Geiss, Zhijun He, Joy Hsu, Liz Koh, Marc Levoy, Chia-Kai Liang, Diane Liang, Timothy Lin, Gaurav Malik, Hossein Mohtasham, Nandini Mukherjee, Sushil Nath, Gabriel Nava, Karl Rasche, YiChang Shih, Daniel Solomon, Gary Sun, Kelly Tsai, Sung-fang Tsai, Ted Tsai, Ruben Velarde, Lida Wang, Tianfan Xue, Junlan Yang.

Read More

Introducing the Model Card Toolkit for Easier Model Transparency Reporting

Introducing the Model Card Toolkit for Easier Model Transparency Reporting

Posted by Huanming Fang and Hui Miao, Software Engineers, Google Research

Machine learning (ML) model transparency is important across a wide variety of domains that impact peoples’ lives, from healthcare to personal finance to employment. The information needed by downstream users will vary, as will the details that developers need in order to decide whether or not a model is appropriate for their use case. This desire for transparency led us to develop a new tool for model transparency, Model Cards, which provide a structured framework for reporting on ML model provenance, usage, and ethics-informed evaluation and give a detailed overview of a model’s suggested uses and limitations that can benefit developers, regulators, and downstream users alike.

Over the past year, we’ve launched Model Cards publicly and worked to create Model Cards for open-source models released by teams across Google. For example, the MediaPipe team creates state-of-the-art computer vision models for a number of common tasks, and has included Model Cards for each of their open-source models in their GitHub repository. Creating Model Cards like these takes substantial time and effort, often requiring a detailed evaluation and analysis of both data and model performance. In many cases, one needs to additionally evaluate how a model performs on different subsets of data, noting any areas where the model underperforms. Further, Model Card creators may want to report on the model’s intended uses and limitations, as well as any ethical considerations potential users might find useful, compiling and presenting the information in a format that’s accessible and understandable.

To streamline the creation of Model Cards for all ML practitioners, we are sharing the Model Card Toolkit (MCT), a collection of tools that support developers in compiling the information that goes into a Model Card and that aid in the creation of interfaces that will be useful for different audiences. To demonstrate how the MCT can be used in practice, we have also released a Colab tutorial that builds a Model Card for a simple classification model trained on the UCI Census Income dataset.

Introducing the MCT
To guide the Model Card creator to organize model information, we provide a JSON schema, which specifies the fields to include in the Model Card. Using the model provenance information stored with ML Metadata (MLMD), the MCT automatically populates the JSON with relevant information, such as class distributions in the data and model performance statistics. We also provide a ModelCard data API to represent an instance of the JSON schema and visualize it as a Model Card. The Model Card creator can choose which metrics and graphs to display in the final Model Card, including metrics that highlight areas where the model’s performance might deviate from its overall performance.

Once the MCT has populated the Model Card with key metrics and graphs, the Model Card creator can supplement this with information regarding the model’s intended usage, limitations, trade-offs, and any other ethical considerations that would otherwise be unknown to people using the model. If a model underperforms for certain slices of data, the limitations section would be another place to acknowledge this, along with suggested mitigation strategies to help developers address these issues. This type of information is critical in helping developers decide whether or not a model is suitable for their use case, and helps Model Card creators provide context so that their models are used appropriately. Right now, we’re providing one UI template to visualize the Model Card, but you can create different templates in HTML should you want to visualize the information in other formats.

Currently, the MCT is available to anyone using TensorFlow Extended (TFX) in open source or on Google Cloud Platform. Users who are not serving their ML models via TFX can still leverage the JSON schema and the methods to visualize via the HTML template.

Here is an example of the completed Model Card from the Colab tutorial, which leverages the MCT and the provided UI template.

Conclusion
Currently, the MCT includes a standard template for reporting on ML models broadly, but we’re continuing to create UI templates for more specific applications of ML. If you’d like to join the conversation about what fields are important and how best to leverage the MCT for different use cases, you can get started here or with the Colab tutorial. Let us know how you’ve leveraged the MCT for your use case by emailing us at model-cards@google.com. You can learn more about Google’s efforts to promote responsible AI in the TensorFlow ecosystem on our TensorFlow Responsible AI page.

Acknowledgements
Huanming Fang, Hui Miao, Karan Shukla, Dan Nanas, Catherina Xu, Christina Greer, Tulsee Doshi, Tiffany Deng, Margaret Mitchell, Timnit Gebru, Andrew Zaldivar, Mahima Pushkarna, Meena Natarajan, Roy Kim, Parker Barnes, Tom Murray, Susanna Ricco, Lucy Vasserman, and Simone Wu

Read More

Announcing ScaNN: Efficient Vector Similarity Search

Announcing ScaNN: Efficient Vector Similarity Search

Posted by Philip Sun, Software Engineer, Google Research

Suppose one wants to search through a large dataset of literary works using queries that require an exact match of title, author, or other easily machine-indexable criteria. Such a task would be well suited for a relational database using a language such as SQL. However, if one wants to support more abstract queries, such as “Civil War poem,” it is no longer possible to rely on naive similarity metrics such as the number of words in common between two phrases. For example, the query “science fiction” is more related to “future” than it is to “earth science” despite the former having zero, and the latter having one, word in common with the query.

Machine learning (ML) has greatly improved computers’ abilities to understand language semantics and therefore answer these abstract queries. Modern ML models can transform inputs such as text and images into embeddings, high dimensional vectors trained such that more similar inputs cluster closer together. For a given query, we can therefore compute its embedding, and find the literary works whose embeddings are closest to the query’s. In this manner, ML has transformed an abstract and previously difficult-to-specify task into a rigorous mathematical one. However, a computational challenge remains: for a given query embedding, how does one quickly find the nearest dataset embeddings? The set of embeddings is often too large for exhaustive search and its high dimensionality makes pruning difficult.

In our ICML 2020 paper, “Accelerating Large-Scale Inference with Anisotropic Vector Quantization,” we address this problem by focusing on how to compress the dataset vectors to enable fast approximate distance computations, and propose a new compression technique that significantly boosts accuracy compared to prior works. This technique is utilized in our recently open-sourced vector similarity search library (ScaNN), and enables us to outperform other vector similarity search libraries by a factor of two, as measured on ann-benchmarks.com.

The Importance of Vector Similarity Search
Embedding-based search is a technique that is effective at answering queries that rely on semantic understanding rather than simple indexable properties. In this technique, machine learning models are trained to map the queries and database items to a common vector embedding space, such that the distance between embeddings carries semantic meaning, i.e., similar items are closer together.

The two-tower neural network model, illustrated above, is a specific type of embedding-based search where queries and database items are mapped to the embedding space by two respective neural networks. In this example the model responds to natural-language queries for a hypothetical literary database.

To answer a query with this approach, the system must first map the query to the embedding space. It then must find, among all database embeddings, the ones closest to the query; this is the nearest neighbor search problem. One of the most common ways to define the query-database embedding similarity is by their inner product; this type of nearest neighbor search is known as maximum inner-product search (MIPS).

Because the database size can easily be in the millions or even billions, MIPS is often the computational bottleneck to inference speed, and exhaustive search is impractical. This necessitates the use of approximate MIPS algorithms that exchange some accuracy for a significant speedup over brute-force search.

A New Quantization Approach for MIPS
Several state-of-the-art solutions for MIPS are based on compressing the database items so that an approximation of their inner product can be computed in a fraction of the time taken by brute-force. This compression is commonly done with learned quantization, where a codebook of vectors is trained from the database and is used to approximately represent the database elements.

Previous vector quantization schemes quantized database elements with the aim of minimizing the average distance between each vector x and its quantized form . While this is a useful metric, optimizing for this is not equivalent to optimizing nearest-neighbor search accuracy. The key idea behind our paper is that encodings with higher average distance may actually result in superior MIPS accuracy.

The intuition for our result is illustrated below. Suppose we have two database embeddings x1 and x2, and must quantize each to one of two centers: c1 or c2. Our goal is to quantize each xi to i such that the inner product <q, i> is as similar to the original inner product <q, xi> as possible. This can be visualized as making the magnitude of the projection of i onto q as similar as possible to the projection of xi onto q. In the traditional approach to quantization (left), we would pick the closest center for each xi, which leads to an incorrect relative ranking of the two points: <q, 1> is greater than <q, 2>, even though <q, x1> is less than <q, x2>! If we instead assign x1 to c1 and x2 to c2, we get the correct ranking. This is illustrated in the figure below.

The goal is to quantize each xi to i = c1 or i = c2. Traditional quantization (left) results in the incorrect ordering of x1 and x2 for this query. Even though our approach (right) chooses centers farther away from the data points, this in fact leads to lower inner product error and higher accuracy.

It turns out that direction matters as well as magnitude–even though c1 is farther from x1 than c2, c1 is offset from x1 in a direction almost entirely orthogonal to x1, while c2’s offset is parallel (for x2, the same situation applies but flipped). Error in the parallel direction is much more harmful in the MIPS problem because it disproportionately impacts high inner products, which by definition are the ones that MIPS is trying to estimate accurately.

Based on this intuition, we more heavily penalize quantization error that is parallel to the original vector. We refer to our novel quantization technique as anisotropic vector quantization due to the directional dependence of its loss function. The ability of this technique to trade increased quantization error of lower inner products in exchange for superior accuracy for high inner products is the key innovation and the source of its performance gains.

In the above diagrams, ellipses denote contours of equal loss. In anisotropic vector quantization, error parallel to the original data point x is penalized more.

Anisotropic Vector Quantization in ScaNN
Anisotropic vector quantization allows ScaNN to better estimate inner products that are likely to be in the top-k MIPS results and therefore achieve higher accuracy. On the glove-100-angular benchmark from ann-benchmarks.com, ScaNN outperformed eleven other carefully tuned vector similarity search libraries, handling roughly twice as many queries per second for a given accuracy as the next-fastest library.*

Recall@k is a commonly used metric for nearest neighbor search accuracy, which measures the proportion of the true nearest k neighbors that are present in an algorithm’s returned k neighbors. ScaNN (upper purple line) consistently achieves superior performance across various points of the speed-accuracy trade-off.

ScaNN is open-source software and you can try it yourself at GitHub. The library can be directly installed via Pip and has interfaces for both TensorFlow and Numpy inputs. Please see the GitHub repository for further instructions on installing and configuring ScaNN.

Conclusion
By modifying the vector quantization objective to align with the goals of MIPS, we achieve state-of-the-art performance on nearest neighbor search benchmarks, a key indicator of embedding-based search performance. Although anisotropic vector quantization is an important technique, we believe it is just one example of the performance gains achievable by optimizing algorithms for the end goal of improving search accuracy rather than an intermediate goal such as compression distortion.

Acknowledgements
This post reflects the work of the entire ScaNN team: David Simcha, Erik Lindgren, Felix Chern, Nathan Cordeiro, Ruiqi Guo, Sanjiv Kumar, and Zonglin Li. We’d also like to thank Dan Holtmann-Rice, Dave Dopson, and Felix Yu.



* ScaNN performs similarly well on the other datasets of ann-benchmarks.com, but the website currently shows outdated, lower numbers. See this pull request for more representative performance figures on other datasets.

Read More

Closing data gaps with Lacuna Fund

Closing data gaps with Lacuna Fund

Machine learning has shown enormous promise for social good, whether in helping respond to global health pandemics or reach citizens before natural disasters hit. But even as machine learning technology becomes increasingly accessible, social innovators still face significant barriers in their efforts to use this technology to unlock new solutions. From languages to health and agriculture, there is a lack of relevant, labeled data to represent and address the challenges that face much of the world’s population.

To help close this gap, Google.org is making a $2.5 million grant alongside The Rockefeller Foundation, Canada’s International Development Resource Center (IDRC) and Germany’s GiZ FAIR Forward to launch Lacuna Fund, the world’s first collaborative nonprofit effort to directly address this missing data. The Fund aims to unlock the power of machine learning by providing data scientists, researchers, and social entrepreneurs in low- and middle-income communities around the world with resources to produce labeled datasets that address urgent problems.  

Labeled data is a particular type of data that is useful in generating machine learning models: This data provides the “ground truth” that a model can use to guess about cases that it hasn’t yet seen. To create a labeled dataset, example data is systematically “tagged” by knowledgeable humans with one or more concepts or entities each one represents. For example, a researcher might label short videos of insects with their type; images of fungi with whether or not they are harmful to plants around them; or passages of Swahili text with the parts of speech that each word represents. In turn, these datasets could enable biologists to track insect migration; farmers to accurately identify threats to their crops; and Swahili speakers to use an automated text messaging service to get vital health information.  

Guided by committees of domain and machine learning experts and facilitated by Meridian Institute, the Fund will provide resources and support to produce new labeled datasets, as well as augment or update existing ones to be more representative, relevant and sustainable. The Fund’s initial work will focus on agriculture and underrepresented languages, but we welcome additional collaborators and anticipate the fund will grow in the years to come. And our work is bigger than just individual datasets: Lacuna Fund will focus explicitly on growing the capacity of local organizations to be data collectors, curators and owners. While following best practices for responsible collection, publication and use, we endeavor to make all datasets as broadly available as possible.

Thanks in part to the rise of cloud computing, in particular services like Cloud AutoML and libraries like TensorFlow, AI is increasingly able to help address society’s most pressing issues. Yet we’ve seen firsthand in our work on the Google AI Impact Challenge the gap between the potential of AI and the ability to successfully implement it. The need for data is quickly becoming one of the most salient barriers to progress. It’s our hope that the Fund provides not only a way for social sector organizations to fund high-impact, immediately-applicable data collection and labeling, but also a foundation from which changemakers can build a better future.

Image at top: A team from AI Challenge Grantee Wadhwani Institute for Artificial Intelligence in India is working with local farmers to manage pest damage to crop.

Read More

Using AI to identify the aggressiveness of prostate cancer

Using AI to identify the aggressiveness of prostate cancer

Prostate cancer diagnoses are common, with 1 in 9 men developing prostate cancer in their lifetime. A cancer diagnosis relies on specialized doctors, called pathologists, looking at biological tissue samples under the microscope for signs of abnormality in the cells. The difficulty and subjectivity of pathology diagnoses led us to develop an artificial intelligence (AI) system that can identify the aggressiveness of prostate cancer.

Since many prostate tumors are non-aggressive, doctors first obtain small samples (biopsies) to better understand the tumor for the initial cancer diagnosis. If signs of tumor aggressiveness are found, radiation or invasive surgery to remove the whole prostate may be recommended. Because these treatments can have painful side effects, understanding tumor aggressiveness is important to avoid unnecessary treatment.

Grading the biopsies

One of the most crucial factors in this process is to “grade” any cancer in the sample for how abnormal it looks, through a process called Gleason grading. Gleason grading involves first matching each cancerous region to one of three Gleason patterns, followed by assigning an overall “grade group” based on the relative amounts of each Gleason pattern in the whole sample. Gleason grading is a challenging task that relies on subjective visual inspection and estimation, resulting in pathologists disagreeing on the right grade for a tumor as much as 50 percent of the time. To explore whether AI could assist in this grading, we previously developed an algorithm that Gleason grades large samples (i.e. surgically-removed prostates) with high accuracy, a step that confirms the original diagnosis and informs patient prognosis.

Our research

In our recent work, “Development and Validation of a Deep Learning Algorithm for Gleason Grading of Prostate Cancer from Biopsy Specimens”, published in JAMA Oncology, we explored whether an AI system could accurately Gleason grade smaller prostate samples (biopsies). Biopsies are done during the initial part of prostate cancer care to get the initial cancer diagnosis and determine patient treatment, and so are more commonly performed than surgeries. However, biopsies can be more difficult to grade than surgical samples due to the smaller amount of tissue and unintended changes to the sample from tissue extraction and preparation process. The AI system we developed first “grades” each region of biopsy, and then summarizes the region-level classifications into an overall biopsy-level score.

Gleason grading

The first stage of the deep learning system Gleason grades every region in a biopsy. In this biopsy, green indicates Gleason pattern 3 while yellow indicates Gleason pattern 4.

Our results 

Given the complexity of Gleason grading, we worked with six experienced expert pathologists to evaluate the AI system. These experts, who have specialized training in prostate cancer and an average of 25 years of experience, determined the Gleason grades of 498 tumor samples. Highlighting how difficult Gleason grading is, a cohort of 19 “general” pathologists (without specialist training in prostate cancer) achieved an average accuracy of 58 percent on these samples. By contrast, our AI system’s accuracy was substantially higher at 72 percent. Finally, some prostate cancers have ambiguous appearances, resulting in disagreements even amongst experts. Taking this uncertainty into account, the deep learning system’s agreement rate with experts was comparable to the agreement rate between the experts themselves.

Cancer pathology workflow

Potential cancer pathology workflow augmented with AI-based assistive tools: a tumor sample is first collected and digitized using a high-magnification scanner. Next, the AI system provides a grade group for each sample.

These promising results indicate that the deep learning system has the potential to support expert-level diagnoses and expand access to high-quality cancer care. To evaluate if it could improve the accuracy and consistency of prostate cancer diagnoses, this technology needs to be validated as an assistive tool in further clinical studies and on larger and more diverse patient groups. However, we believe that AI-based tools could help pathologists in their work, particularly in situations where specialist expertise is limited.

Our research advancements in both prostate and breast cancer were the result of collaborations with the Naval Medical Center San Diego and support from Verily. Our appreciation also goes to several institutions that provided access to de-identified data, and many pathologists who provided advice or reviewed prostate cancer samples. We look forward to future research and investigation into how our technology can be best validated, designed and used to improve patient care and cancer outcomes.

Read More

Improving Holistic Scene Understanding with Panoptic-DeepLab

Improving Holistic Scene Understanding with Panoptic-DeepLab

Posted by Bowen Cheng, Student Researcher, and Liang-Chieh Chen, Research Scientist, Google Research

Real-world computer vision applications, such as self-driving cars and robotics, rely on two core tasks — instance segmentation and semantic segmentation. Instance segmentation identifies the class and extent of individual “things” in an image (i.e., countable objects such as people, animals, cars, etc.) and assigns unique identifiers to each (e.g., car_1 and car_2). This is complemented by semantic segmentation, which labels all pixels in an image, including the “things” that are present as well as the surrounding “stuff” (e.g., amorphous regions of similar texture or material, such as grass, sky or road). This latter task, however, does not differentiate between pixels of the same class that belong to different instances of that class.

Panoptic segmentation represents the unification of these two approaches with the goal of assigning a unique value to every pixel in an image that encodes both semantic label and instance ID. Most existing panoptic segmentation algorithms are based on Mask R-CNN, which treats semantic and instance segmentation separately. The instance segmentation step identifies objects in an image, but it often produces object instance masks that overlap one another. To settle the conflict between overlapping instance masks, one commonly employs an heuristic that resolves the discrepancy either based on the mask with a higher confidence score or by use of a pre-defined pairwise relationship between categories (e.g., a tie should always be worn on a person’s front). Additionally, the discrepancies between semantic and instance segmentation results are sorted out by favoring the instance predictions. While these methods generally produce good results, they also introduce heavy latency, which makes it challenging to apply them in real-time applications.

Driven by the need of a real-time panoptic segmentation model, we propose “Panoptic-DeepLab: a simple, fast and strong system for panoptic segmentation”, accepted to CVPR 2020. In this work, we extend the commonly used modern semantic segmentation model, DeepLab, to perform panoptic segmentation using only a small number of additional parameters with the addition of marginal computation overhead. The resulting model, Panoptic-DeepLab, produces semantic and instance segmentation in parallel and without overlap, avoiding the need for the manually designed heuristics adopted by other methods. Additionally, we develop a computationally efficient operation that merges the semantic and instance segmentation results, enabling near real-time end-to-end panoptic segmentation prediction. Unlike methods based on Mask R-CNN, Panoptic-DeepLab does not generate bounding box predictions and requires only three loss functions during training, significantly fewer than current state-of-the-art methods, such as UPSNet, which can have up to eight. Finally, Panoptic-DeepLab has demonstrated state-of-the-art performance on several academic datasets.

Panoptic segmentation results obtained by Panoptic-DeepLab. Left: Video frames used as input to the panoptic segmentation model. Right: Results overlaid on video frames. Each object instance has a unique label, e.g., car_1, car_2, etc.

Overview
Panoptic-DeepLab is simple both conceptually and architecturally. At a high-level, it predicts three outputs. The first is semantic segmentation, in which it assigns a semantic class (e.g., car or grass) to each pixel. However, it does not differentiate between multiple instances of the same class. So, for example, if one car is partly behind another, the pixels associated with both would have the same associated class and would be indistinguishable from one another. This can be addressed by the second two outputs from the model: a center-of-mass prediction for each instance and instance center regression, where the model learns to regress each instance pixel to its center of mass. This latter step ensures that the model associates pixels of a given class to the appropriate instance. The class-agnostic instance segmentation, obtained by grouping predicted foreground pixels to their closest predicted instance centers, is then fused with semantic segmentation by majority-vote rule to generate the final panoptic segmentation.

Overview of Panoptic-DeepLab. Semantic segmentation associates pixels in the image with general classes, while the class-agnostic instance segmentation step identifies the pixels associated with an individual object, regardless of the class. Taken together one gets the final panoptic segmentation image.

Neural Network Design
Panoptic-DeepLab consists of four components: (1) an encoder backbone pre-trained on ImageNet, shared by both the semantic segmentation and instance segmentation branches of the architecture; (2) atrous spatial pyramid pooling (ASPP) modules, similar to that used by DeepLab, which are deployed independently in each branch in order to perform segmentation at a range of spatial scales; (3) similarly decoupled decoder modules specific to each segmentation task; and (4) task-specific prediction heads.

The encoder backbone (1), which has been pre-trained on ImageNet, extracts feature maps that are shared by both the semantic segmentation and instance segmentation branches of the architecture. Typically, the feature map is generated by the backbone model using a standard convolution, which reduces the resolution of the output map to 1/32nd that of the input image and is too coarse for accurate image segmentation. In order to preserve the details of object boundaries, we instead employ atrous convolution, which better retains important features like edges, to generate a feature map with a resolution of 1/16th the original. This is then followed by two ASPP modules (2), one for each branch, which captures multi-scale information for segmentation.

The light-weight decoder modules (3) follow those used in the most recent DeepLab version (DeepLabV3+), but with two modifications. First, we reintroduce an additional low-level feature map (1/8th scale) to the decoder, which helps to preserve spatial information from the original image (e.g., object boundaries) that can be significantly degraded in the final feature map output by the backbone. Second, instead of using the typical 3 × 3 kernel, the decoder employs a 5 × 5 depthwise-separable convolution, which yields somewhat better performance at only a minimal cost in additional overhead.

The two prediction heads (4) are tailored to their task. The semantic segmentation head employs a weighted version of the standard bootstrapped cross entropy loss function, which weights each pixel differently and has proven to be more effective for segmentation of small-scale objects. The instance segmentation head is trained to predict the offsets between the center of mass of an object instance and the surrounding pixels, without knowledge of the object class, forming the class-agnostic instance masks.

Results
To demonstrate the effectiveness of Panoptic-DeepLab, we conduct experiments on three popular academic datasets, Cityscapes, Mapillary Vistas, and COCO datasets. With a simple architecture, Panoptic-DeepLab ranks first in Cityscapes for all three tasks (semantic, instance and panoptic segmentation) without any task-specific fine-tuning. Additionally, Panoptic-DeepLab won the Best Result, Best Paper, and Most Innovative awards on the Mapillary Panoptic Segmentation track at ICCV 2019 Joint COCO and Mapillary Recognition Challenge Workshop. It outperforms the winner of 2018 by a healthy margin of 1.5%. Finally, Panoptic-DeepLab sets new state-of-the-art bottom-up (i.e., box-free) panoptic segmentation results on the COCO dataset, and is also comparable to other methods based on Mask R-CNN.

Accuracy (PQ) vs. Speed (GPU inference time) across three datasets.

Conclusion
With a simple architecture and only three training loss functions, Panoptic-DeepLab achieves state-of-the-art performance while being faster than other methods based on Mask R-CNN. To summarize, we develop the first single-shot panoptic segmentation model that attains state-of-the-art performance on several public benchmarks, and delivers near real time end-to-end inference speed. We hope our simple and effective Panoptic-DeepLab could establish a solid baseline and further benefit the research community.

Acknowledgements
We would like to thank the support and valuable discussions with Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Florian Schroff as well as the Google Mobile Vision team.

Read More