Data Cascades in Machine Learning

Nithya Sambasivan, Research Scientist, Google Research

Data is a foundational aspect of machine learning (ML) that can impact performance, fairness, robustness, and scalability of ML systems. Paradoxically, while building ML models is often highly prioritized, the work related to data itself is often the least prioritized aspect. This data work can require multiple roles (such as data collectors, annotators, and ML developers) and often involves multiple teams (such as database, legal, or licensing teams) to power a data infrastructure, which adds complexity to any data-related project. As such, the field of human-computer interaction (HCI), which is focused on making technology useful and usable for people, can help both to identify potential issues and to assess the impact on models when data-related work is not prioritized.

In “‘Everyone wants to do the model work, not the data work’: Data Cascades in High-Stakes AI”, published at the 2021 ACM CHI Conference, we study and validate downstream effects from data issues that result in technical debt over time (defined as “data cascades”). Specifically, we illustrate the phenomenon of data cascades with the data practices and challenges of ML practitioners across the globe working in important ML domains, such as cancer detection, landslide detection, loan allocation and more — domains where ML systems have enabled progress, but also where there is opportunity to improve by addressing data cascades. This work is the first that we know of to formalize, measure, and discuss data cascades in ML as applied to real-world projects. We further discuss the opportunity presented by a collective re-imagining of ML data as a high priority, including rewarding ML data work and workers, recognizing the scientific empiricism in ML data research, improving the visibility of data pipelines, and improving data equity around the world.

Origins of Data Cascades
We observe that data cascades often originate early in the lifecycle of an ML system, at the stage of data definition and collection. Cascades also tend to be complex and opaque in diagnosis and manifestation, so there are often no clear indicators, tools, or metrics to detect and measure their effects. Because of this, small data-related obstacles can grow into larger and more complex challenges that affect how a model is developed and deployed. Challenges from data cascades include the need to perform costly system-level changes much later in the development process, or the decrease in users’ trust due to model mis-predictions that result from data issues. Nevertheless and encouragingly, we also observe that such data cascades can be avoided through early interventions in ML development.

Different color arrows indicate different types of data cascades, which typically originate upstream, compound over the ML development process, and manifest downstream.

Examples of Data Cascades
One of the most common causes of data cascades is when models that are trained on noise-free datasets are deployed in the often-noisy real world. For example, a common type of data cascade originates from model drifts, which occur when target and independent variables deviate, resulting in less accurate models. Drifts are more common when models closely interact with new digital environments — including high-stakes domains, such as air quality sensing, ocean sensing, and ultrasound scanning — because there are no pre-existing and/or curated datasets. Such drifts can lead to more factors that further decrease a model’s performance (e.g., related to hardware, environmental, and human knowledge). For example, to ensure good model performance, data is often collected in controlled, in-house environments. But in the live systems of new digital environments with resource constraints, it is more common for data to be collected with physical artefacts such as fingerprints, shadows, dust, improper lighting, and pen markings, which can add noise that affects model performance. In other cases, environmental factors such as rain and wind can unexpectedly move image sensors in deployment, which also trigger cascades. As one of the model developers we interviewed reported, even a small drop of oil or water can affect data that could be used to train a cancer prediction model, therefore affecting the model’s performance. Because drifts are often caused by the noise in real-world environments, they also take the longest — up to 2-3 years — to manifest, almost always in production.

Another common type of data cascade can occur when ML practitioners are tasked with managing data in domains in which they have limited expertise. For instance, certain kinds of information, such as identifying poaching locations or data collected during underwater exploration, rely on expertise in the biological sciences, social sciences, and community context. However, some developers in our study described having to take a range of data-related actions that surpassed their domain expertise — e.g., discarding data, correcting values, merging data, or restarting data collection — leading to data cascades that limited model performance. The practice of relying on technical expertise more than domain expertise (e.g., by engaging with domain experts) is what appeared to set off these cascades.

Two other cascades observed in this paper resulted from conflicting incentives and organizational practices between data collectors, ML developers, and other partners — for example, one cascade was caused by poor dataset documentation. While work related to data requires careful coordination across multiple teams, this is especially challenging when stakeholders are not aligned on priorities or workflows.

How to Address Data Cascades
Addressing data cascades requires a multi-part, systemic approach in ML research and practice:

  1. Develop and communicate the concept of goodness of the data that an ML system starts with, similar to how we think about goodness of fit with models. This includes developing standardized metrics and frequently using those metrics to measure data aspects like phenomenological fidelity (how accurately and comprehensively does the data represent the phenomena) and validity (how well the data explains things related to the phenomena captured by the data), similar to how we have developed good metrics to measure model performance, like F1-scores.
  2. Innovate on incentives to recognize work on data, such as welcoming empiricism on data in conference tracks, rewarding dataset maintenance, or rewarding employees for their work on data (collection, labelling, cleaning, or maintenance) in organizations.
  3. Data work often requires coordination across multiple roles and multiple teams, but this is quite limited currently (partly, but not wholly, because of the previously stated factors). Our research points to the value of fostering greater collaboration, transparency, and fairer distribution of benefits between data collectors, domain experts, and ML developers, especially with ML systems that rely on collecting or labelling niche datasets.
  4. Finally, our research across multiple countries indicates that data scarcity is pronounced in lower-income countries, where ML developers face the additional problem of defining and hand-curating new datasets, which makes it difficult to even start developing ML systems. It is important to enable open dataset banks, create data policies, and foster ML literacy of policy makers and civil society to address the current data inequalities globally.

Conclusion
In this work we both provide empirical evidence and formalize the concept of data cascades in ML systems. We hope to create an awareness of the potential value that could come from incentivising data excellence. We also hope to introduce an under-explored but significant new research agenda for HCI. Our research on data cascades has led to evidence-backed, state-of-the-art guidelines for data collection and evaluation in the revised PAIR Guidebook, aimed at ML developers and designers.

Acknowledgements
This paper was written in collaboration with Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh and Lora Aroyo. We thank our study participants, and Sures Kumar Thoddu Srinivasan, Jose M. Faleiro, Kristen Olson, Biswajeet Malik, Siddhant Agarwal, Manish Gupta, Aneidi Udo-Obong, Divy Thakkar, Di Dang, and Solomon Awosupin.

Read More

A Browsable Petascale Reconstruction of the Human Cortex

Posted by Tim Blakely, Software Engineer and Michał Januszewski, Research Scientist, Connectomics at Google

In January 2020 we released the fly “hemibrain” connectome — an online database providing the morphological structure and synaptic connectivity of roughly half of the brain of a fruit fly (Drosophila melanogaster). This database and its supporting visualization has reframed the way that neural circuits are studied and understood in the fly brain. While the fruit fly brain is small enough to attain a relatively complete map using modern mapping techniques, the insights gained are, at best, only partially informative to understanding the most interesting object in neuroscience — the human brain.

Today, in collaboration with the Lichtman Laboratory at Harvard University, we are releasing the “H01” dataset, a 1.4 petabyte rendering of a small sample of human brain tissue, along with a companion paper, “A connectomic study of a petascale fragment of human cerebral cortex.” The H01 sample was imaged at 4nm-resolution by serial section electron microscopy, reconstructed and annotated by automated computational techniques, and analyzed for preliminary insights into the structure of the human cortex. The dataset comprises imaging data that covers roughly one cubic millimeter of brain tissue, and includes tens of thousands of reconstructed neurons, millions of neuron fragments, 130 million annotated synapses, 104 proofread cells, and many additional subcellular annotations and structures — all easily accessible with the Neuroglancer browser interface. H01 is thus far the largest sample of brain tissue imaged and reconstructed in this level of detail, in any species, and the first large-scale study of synaptic connectivity in the human cortex that spans multiple cell types across all layers of the cortex. The primary goals of this project are to produce a novel resource for studying the human brain and to improve and scale the underlying connectomics technologies.

Petabyte connectomic reconstruction of a volume of human neocortex. Left: Small subvolume of the dataset. Right: A subgraph of 5000 neurons and excitatory (green) and inhibitory (red) connections in the dataset. The full graph (connectome) would be far too dense to visualize.

What is the Human Cortex?
The cerebral cortex is the thin surface layer of the brain found in vertebrate animals that has evolved most recently, showing the greatest variation in size among different mammals (it is especially large in humans). Each part of the cerebral cortex is six layered (e.g., L2), with different kinds of nerve cells (e.g., spiny stellate) in each layer. The cerebral cortex plays a crucial role in most higher level cognitive functions, such as thinking, memory, planning, perception, language, and attention. Although there has been some progress in understanding the macroscopic organization of this very complicated tissue, its organization at the level of individual nerve cells and their interconnecting synapses is largely unknown.

Human Brain Connectomics: From Surgical Biopsy to a 3D Database
Mapping the structure of the brain at the resolution of individual synapses requires high-resolution microscopy techniques that can image biochemically stabilized (fixed) tissue. We collaborated with brain surgeons at Massachusetts General Hospital in Boston (MGH) who sometimes remove pieces of normal human cerebral cortex when performing a surgery to cure epilepsy in order to gain access to a site in the deeper brain where an epileptic seizure is being initiated. Patients anonymously donated this tissue, which is normally discarded, to our colleagues in the Lichtman lab. The Harvard researchers cut the tissue into ~5300 individual 30 nanometer sections using an automated tape collecting ultra-microtome, mounted those sections onto silicon wafers, and then imaged the brain tissue at 4 nm resolution in a customized 61-beam parallelized scanning electron microscope for rapid image acquisition.

Imaging the ~5300 physical sections produced 225 million individual 2D images. Our team then computationally stitched and aligned this data to produce a single 3D volume. While the quality of the data was generally excellent, these alignment pipelines had to robustly handle a number of challenges, including imaging artifacts, missing sections, variation in microscope parameters, and physical stretching and compression of the tissue. Once aligned, a multiscale flood-filling network pipeline was applied (using thousands of Google Cloud TPUs) to produce a 3D segmentation of each individual cell in the tissue. Additional machine learning pipelines were applied to identify and characterize 130 million synapses, classify each 3D fragment into various “subcompartments” (e.g., axon, dendrite, or cell body), and identify other structures of interest such as myelin and cilia. Automated reconstruction results were imperfect, so manual efforts were used to “proofread” roughly one hundred cells in the data. Over time, we expect to add additional cells to this verified set through additional manual efforts and further advances in automation.

The H01 volume: roughly one cubic millimeter of human brain tissue captured in 1.4 petabytes of images.

The imaging data, reconstruction results, and annotations are viewable through an interactive web-based 3D visualization interface, called Neuroglancer, that was originally developed to visualize the fruit fly brain. Neuroglancer is available as open-source software, and widely used in the broader connectomics community. Several new features were introduced to support analysis of the H01 dataset, in particular support for searching for specific neurons in the dataset based on their type or other properties.

   
The volume spans all six cortical layers   Highlighting Layer 2 interneurons   Excitatory and inhibitory incoming synapses
   
Neuronal subcompartments classified   A Chandelier cell and some of the Pyramidal neurons it inhibits   Blood vessels traced throughout the volume
   
Serial contact between a pair of neurons   An axon with elaborate whorl structures   A neuron with unusual propensity for self-contact (Credit: Rachael Han)
The Neuroglancer interface to the H01 volume and annotations. The user can select specific cells on the basis of their layer and type, can view incoming and outgoing synapses for the cell, and much more. (Click on the images above to take you to the Neuroglancer view shown.)

Analysis of the Human Cortex
In a companion preprint, we show how H01 has already been used to study several interesting aspects of the organization of the human cortex. In particular, new cell types have been discovered, as well as the presence of “outlier” axonal inputs, which establish powerful synaptic connections with target dendrites. While these findings are a promising start, the vastness of the H01 dataset will provide a basis for many years of further study by researchers interested in the human cortex.

In order to accelerate the analysis of H01, we also provide embeddings of the H01 data that were generated by a neural network trained using a variant of the SimCLR self-supervised learning technique. These embeddings provide highly informative representations of local parts of the dataset that can be used to rapidly annotate new structures and develop new ways of clustering and categorizing brain structures according to purely data-driven criteria. We trained these embeddings using Google Cloud TPU pods and then performed inference at roughly four billion data locations spread throughout the volume.

Managing Dataset Size with Improved Compression
H01 is a petabyte-scale dataset, but is only one-millionth the volume of an entire human brain. Serious technical challenges remain in scaling up synapse-level brain mapping to an entire mouse brain (500x bigger than H01), let alone an entire human brain. One of these challenges is data storage: a mouse brain could generate an exabyte worth of data, which is costly to store. To address this, we are today also releasing a paper, “Denoising-based Image Compression for Connectomics”, that details how a machine learning-based denoising strategy can be used to compress data, such as H01, at least 17-fold (dashed line in the figure below), with negligible loss of accuracy in the automated reconstruction.

Reconstruction quality of noisy and denoised images as a function of compression rate for JPEG XL (JXL) and AV Image Format (AVIF) codecs. Points and lines show the means, and the shaded area covers ±1 standard deviation around the mean.

Random variations in the electron microscopy imaging process lead to image noise that is difficult to compress even in principle, as the noise lacks spatial correlations or other structure that could be described with fewer bytes. Therefore we acquired images of the same piece of tissue in both a “fast” acquisition regime (resulting in high amounts of noise) and a “slow” acquisition regime (resulting in low amounts of noise) and then trained a neural network to infer “slow” scans from “fast” scans. Standard image compression codecs were then able to (lossily) compress the “virtual” slow scans with fewer artifacts compared to the raw data. We believe this advance has the potential to significantly mitigate the costs associated with future large scale connectomics projects.

Next Steps
But storage is not the only problem. The sheer size of future data sets will require developing new strategies for researchers to organize and access the rich information inherent in connectomic data. These are challenges that will require new modes of interaction between humans and the brain mapping data that will be forthcoming.

Read More

Cross-Modal Contrastive Learning for Text-to-Image Generation

Posted by Han Zhang, Research Scientist and Jing Yu Koh, Software Engineer, Google Research

Automatic text-to-image synthesis, in which a model is trained to generate images from text descriptions alone, is a challenging task that has recently received significant attention. Its study provides rich insights into how machine learning (ML) models capture visual attributes and relate them to text. Compared to other kinds of inputs to guide image creation, such as sketches, object masks or mouse traces (which we have highlighted in prior work), descriptive sentences are a more intuitive and flexible way to express visual concepts. Hence, a strong automatic text-to-image generation system can also be a useful tool for rapid content creation and could be applied to many other creative applications, similar to other efforts to integrate machine learning into the creation of art (e.g., Magenta).

State-of-the-art image synthesis results are typically achieved using generative adversarial networks (GANs), which train two models — a generator, which tries to create realistic images, and a discriminator, which tries to determine if an image is real or fabricated. Many text-to-image generation models are GANs that are conditioned using text inputs in order to generate semantically relevant images. This is significantly challenging, especially when long, ambiguous descriptions are provided. Moreover, GAN training can be prone to mode collapse, a common failure case for the training process in which the generator learns to produce only a limited set of outputs, so that the discriminator fails to learn robust strategies to recognize fabricated images. To mitigate mode collapse, some approaches use multi-stage refinement networks that iteratively refine an image. However, such systems require multi-stage training, which is less efficient than simpler single-stage end-to-end models. Other efforts rely on hierarchical approaches that first model object layouts before finally synthesizing a realistic image. This requires the use of labeled segmentation data, which can be difficult to obtain.

In “Cross-Modal Contrastive Learning for Text-to-Image Generation,” to appear at CVPR 2021, we present the Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN), which addresses text-to-image generation by learning to maximize the mutual information between image and text using inter-modal (image-text) and intra-modal (image-image) contrastive losses. This approach helps the discriminator to learn more robust and discriminative features, so XMC-GAN is less prone to mode collapse even with one-stage training. Importantly, XMC-GAN achieves state-of-the-art performance with a simple one-stage generation, as compared to previous multi-stage or hierarchical approaches. It is end-to-end trainable, and only requires image-text pairs (as opposed to labeled segmentation or bounding box data).

Contrastive Losses for Text-to-Image Synthesis
The goal of text-to-image synthesis systems is to produce clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. To achieve this, we propose to maximize the mutual information between the corresponding pairs: (1) images (real or generated) with a sentence describing the scene; (2) a generated image and a real image with the same description; and (3) regions of an image (real or generated) and words or phrases associated with them.

In XMC-GAN, this is enforced using contrastive losses. Similar to other GANs, XMC-GAN contains a generator for synthesizing images, and a discriminator that is trained to act as a critic between real and generated images. Three sets of data contribute to the contrastive loss in this system — the real images, the text that describes those images, and the images generated from the text descriptions. The individual loss functions for both the generator and the discriminator are combinations of the loss calculated from whole images with the full text description, combined with the loss calculated from sub-divided images with associated words or phrases. Then, for each batch of training data, we calculate the cosine similarity score between each text description and the real images, and likewise, between each text description and the batch of generated images. The goal is for the matching pairs (both text-to-image and real image-to-generated image) to have high similarity scores and for non-matching pairs to have low scores. Enforcing such a contrastive loss allows the discriminator to learn more robust and discriminative features.

Inter-modal and intra-modal contrastive learning in our proposed XMC-GAN text-to-image synthesis model.

Results
We apply XMC-GAN to three challenging datasets — the first was a collection of MS-COCO descriptions of MS-COCO images, and the other two were datasets annotated with Localized Narratives, one of which covers MS-COCO images (which we call LN-COCO) and the other of which describes Open Images data (LN-OpenImages). We find that XMC-GAN achieves a new state of the art on each. The images generated by XMC-GAN depict scenes that are of higher quality than those generated using other techniques. On MS-COCO, XMC-GAN improves the state-of-the-art Fréchet inception distance (FID) score from 24.7 to 9.3, and is significantly preferred by human evaluators.

Selected qualitative results for generated images on MS-COCO.

Similarly, human raters prefer the image quality in XMC-GAN generated images 77.3% of the time, and 74.1% prefer its image-text alignment compared to three other state-of-the-art approaches (CP-GAN, SD-GAN, and OP-GAN) .

Human evaluation on MS-COCO for image quality and text alignment. Annotators rank (anonymized and order-randomized) generated images from best to worst.

XMC-GAN also generalizes well to the challenging Localized Narratives dataset, which contains longer and more detailed descriptions. Our prior work TReCS tackles text-to-image generation for Localized Narratives using mouse trace inputs to improve image generation quality. Despite not receiving mouse trace annotations, XMC-GAN is able to significantly outperform TReCS on image generation on LN-COCO, improving state-of-the-art FID from 48.7 to 14.1. Incorporating mouse traces and other additional inputs into an end-to-end model such as XMC-GAN would be interesting to study in future work.

In addition, we also train and evaluate on the LN-OpenImages, which is more challenging than MS-COCO because the dataset is much larger with images that cover a broader range of subject matter and that are more complex (8.4 objects on average). To the best of our knowledge, XMC-GAN is the first text-to-image synthesis model that is trained and evaluated on Open Images. XMC-GAN is able to generate high quality results, and sets a strong benchmark FID score of 26.9 on this very challenging task.

Random samples of real and generated images on Open Images.

Conclusion and Future Work
In this work, we present a cross-modal contrastive learning framework to train GAN models for text-to-image synthesis. We investigate several cross-modal contrastive losses that enforce correspondence between image and text. For both human evaluations and quantitative metrics, XMC-GAN establishes a marked improvement over previous models on multiple datasets. It generates high quality images that match their input descriptions well, including for long, detailed narratives, and does so while being a simpler, end-to-end model. We believe that this represents a significant advance towards creative applications for image generation from natural language descriptions. As we continue this research, we are continually evaluating responsible approaches, potential applications and risk mitigation, in accordance with our AI Principles.

Acknowledgements
This is a joint work with Jason Baldridge, Honglak Lee, and Yinfei Yang. We would like to thank Kevin Murphy, Zizhao Zhang, Dilip Krishnan for their helpful feedback. We also want to thank the Google Data Compute team for their work on conducting human evaluations. We are also grateful for general support from the Google Research team.

Read More

Understanding Contextual Facial Expressions Across the Globe

Posted by Alan Cowen, Visiting Researcher and Gautam Prasad, Software Engineer, Google Research

It might seem reasonable to assume that people’s facial expressions are universal — so, for example, whether a person is from Brazil, India or Canada, their smile upon seeing close friends or their expression of awe at a fireworks display would look essentially the same. But is that really true? Is the association between these facial expressions and their relevant context across geographies indeed universal? What can similarities — or differences — between the situations where someone grins or frowns tell us about how people may be connected across different cultures?

Scientists seeking to answer these questions and to uncover the extent to which people are connected across cultures and geography often use survey-based studies that can rely heavily on local language, norms, and values. However, such studies are not scalable, and often end up with small sample sizes and inconsistent findings.

In contrast to survey-based studies, studying patterns of facial movement provides a more direct understanding of expressive behavior. But analyzing how facial expressions are actually used in everyday life would require researchers to go through millions of hours of real-world footage, which is too time-consuming to do manually. In addition, facial expressions and the contexts in which they are exhibited are complicated, requiring large sample sizes in order to make statistically sound conclusions. While existing studies have produced diverging answers to the question of the universality of facial expressions in given contexts, applying machine learning (ML) in order to appropriately scale the research has the potential to provide clarity.

In “Sixteen facial expressions occur in similar contexts worldwide”, published in Nature, we present research undertaken in collaboration with UC Berkeley to conduct the first large-scale worldwide analysis of how facial expressions are actually used in everyday life, leveraging deep neural networks (DNNs) to drastically scale up expression analysis in a responsible and thoughtful way. Using a dataset of six million publicly available videos across 144 countries, we analyze the contexts in which people use a variety of facial expressions and demonstrate that rich nuances in facial behavior — including subtle expressions — are used in similar social situations around the world.

A Deep Neural Network Measuring Facial Expression
Facial expressions are not static. If one were to examine a person’s expression instant by instant, what might at first appear to be “anger”, may instead end up being “awe”, “surprise” or “confusion”. The interpretation depends on the dynamics of a person’s face as their expression presents itself. The challenge in building a neural network to understand facial expressions, then, is that it must interpret the expression within its temporal context. Training such a system requires a large and diverse, cross-cultural dataset of videos with fully annotated expressions.

To build the dataset, skilled raters manually searched through a broad collection of publicly available videos to identify those likely to contain clips covering all of our pre-selected expression categories. To ensure that the videos matched the region they were assumed to represent, preference in video selection was given to those that included the geographic location of origin. The faces in the videos were then found using a deep convolutional neural network (CNN) — similar to the Google Cloud Face Detection API — that follows faces over the course of the clip using a method based on traditional optical flow. Using an interface similar to Google Crowdsource, annotators then labeled facial expressions across 28 distinct categories if present at any point during the clip. Because the goal was to sample how an average person would perceive an expression, the annotators were not coached or trained, nor were they provided examples or definitions of the target expressions. We discuss additional experiments to evaluate whether the model trained from these annotations was biased below.

Raters were presented videos with a single face highlighted for their attention. They observed the subject throughout the duration of the clip and annotated the facial expressions they exhibited. (source video)

The face detection algorithm established a sequence of locations of each face throughout the video. We then used a pre-trained Inception network to extract features representing the most salient aspects of facial expressions from the faces. The features were then fed into a long short-term memory (LSTM) network, a type of recurrent neural network that is able to model how a facial expression might evolve over time due to its ability to remember salient information from the past.

In order to ensure that the model was making consistent predictions across a range of demographic groups, we evaluated the model fairness on an existing dataset that was constructed using similar facial expression labels, targeting a subset of 16 expressions on which it exhibited the best performance.

The model’s performance was consistent across all of the demographic groups represented in the evaluation dataset, which provides supporting evidence that the model trained to annotated facial expressions is not measurably biased. The model’s annotations of those 16 facial expressions across 1,500 images can be explored here.

We modeled the selected face in each video by using a CNN to extract features from the face at each frame, which were then fed into an LSTM network to model the changes in the expression over time. (source video)

Measuring the Contexts Captured in Videos
To understand the context of facial expressions across millions of videos, we used DNNs that could capture the fine-grained content and automatically recognize the context. The first DNN modeled a combination of text features (title and description) associated with a video along with the actual visual content (video-topic model). In addition, we used a DNN that only relied on text features without any visual information (text-topic model). These models predict thousands of labels describing the videos. In our experiments these models were able to identify hundreds of unique contexts (e.g., wedding, sporting event, or fireworks) showcasing the diversity of the data we used for the analysis.

The Covariation Between Expressions and Contexts Around the World
In our first experiment, we analyzed 3 million public videos captured on mobile phones. We chose to focus on mobile uploads because they are more likely to contain natural expressions. We correlated the facial expressions that occurred in the videos to the context annotations derived from the video-topic model. We found 16 kinds of facial expressions had distinct associations with everyday social contexts that were consistent across the world. For instance, the expressions that people associate with amusement occurred more often in videos with practical jokes; expressions that people associate with awe, in videos with fireworks; and triumph, with sporting events. These results have strong implications for discussions about the relative importance of psychologically relevant context in facial expression, compared to other factors, such as those unique to an individual, culture, or society.

Our second experiment analyzed a separate set of 3 million videos, but this time we annotated the contexts with the text-topic model. The results verified that the findings in the first experiment were not driven by subtle influences of facial expressions in the video on the annotations of the video-topic model. In other words we used this experiment to verify our conclusions from the first experiment given the possibility that the video-topic model could implicitly be factoring in facial expressions when computing its content labels.

We correlated the expression and context annotations across all of the videos within each region. Each expression was found to have specific associations with different contexts that were preserved across 12 world regions. For example, here, in red, we can see that expressions people associate with awe were found more often in the context of fireworks, pets, and toys than in other contexts.

In both experiments, the correlations between expressions and contexts appeared to be well-preserved across cultures. To quantify exactly how similar the associations between expressions and contexts were across the 12 different world regions we studied, we computed second-order correlations between each pair of regions. These correlations identify the relationships between different expressions and contexts in each region and then compare them with other regions. We found that 70% of the context–expression associations found in each region are shared across the modern world.

Finally, we asked how many of the 16 kinds of facial expression we measured had distinct associations with different contexts that were preserved around the world. To do so, we applied a method called canonical correlations analysis, which showed that all 16 facial expressions had distinct associations that were preserved across the world.

Conclusions
We were able to examine the contexts in which facial expressions occur in everyday life across cultures at an unprecedented scale. Machine learning allowed us to analyze millions of videos across the world and discover evidence supporting hypotheses that facial expressions are preserved to a degree in similar contexts across cultures.

Our results also leave room for cultural differences. Although the correlations between facial expressions and contexts were 70% consistent around the world, they were up to 30% variable across regions. Neighboring world regions generally had more similar associations between facial expressions and contexts than distant world regions, indicating that the geographic spread of human culture may also play a role in the meanings of facial expressions.

This work shows that we can use machine learning to better understand ourselves and identify common communication elements across cultures. Tools such as DNNs give us the opportunity to provide vast amounts of diverse data in service of scientific discovery, enabling more confidence in the statistical conclusions. We hope our work provides a template for using the tools of machine learning in a responsible way and sparks more innovative research in other scientific domains.

Acknowledgements
Special thanks to our co-authors Dacher Keltner from UC Berkeley, along with Florian Schroff, Brendan Jou, and Hartwig Adam from Google Research. We are also grateful for additional support at Google provided by Laura Rapin, Reena Jana, Will Carter, Unni Nair, Christine Robson, Jen Gennai, Sourish Chaudhuri, Greg Corrado, Brian Eoff, Andrew Smart, Raine Serrano, Blaise Aguera y Arcas, Jay Yagnik, and Carson Mcneil.

Read More

Finding any Cartier watch in under 3 seconds

Cartier is legendary in the world of luxury — a name that is synonymous with iconic jewelry and watches, timeless design,  savoir-faire and exceptional customer service. 

Maison Cartier’s collection dates back to the opening of Louis-François Cartier’s very first Paris workshop in 1847. And with over 174 years of history, the Maison’s catalog is extensive, with over a thousand wristwatches, some with only slight variations between them. Finding specific models, or comparing several models at once, could take some time for a sales associate working at one of Cartier’s 265 boutiques — hardly ideal for a brand with a reputation for high-end client service. 

In 2020, Cartier turned to Google Cloud to address this challenge. 

An impressive collection needs an app to match 

Cartier’s goal was to develop an app to help sales associates find any watch in its immense catalog quickly. The app would use an image to find detailed information about any watch the Maison had ever designed (starting with the past decade) and suggest similar-looking watches with possibly different characteristics, such as price. 

But creating this app presented some unique challenges for the Cartier team. Visual product search uses artificial intelligence (AI) technology like machine learning algorithms to identify an item (like a Cartier wristwatch) in a picture and return related products. But visual search technology needs to be “trained” with a huge amount of data to recognize a product correctly — in this case, images of the thousands of watches in Cartier’s collections. 

As a Maison that has always been driven by its exclusive design, Cartier had very few in-store product images available. The photos that did exist weren’t consistent, varying in backgrounds, lighting, quality and styling. This made it very challenging to create an app that could categorize images correctly. 

On top of that, Cartier has very high standards for its client service. For the stores to successfully adopt the app, the visual product search app would need to identify products accurately 90% of the time and ideally return results within five seconds. 

Redefining Cartier’s luxury customer experience with AI technology

Working together with Cartier’s team, we helped them build a visual product search system using Google Cloud AI Platform services, including AutoML Vision and Vision API.

The system can recognize a watch’s colors and materials and then use this information to figure out which collection the watch is from. It analyzes an image and comes back with a list of the three watches that look most similar, which sales associates can click on to get more information. The visual product search system identifies watches with 96.5% accuracy and can return results within three seconds.

Now, when customers are interested in a specific Cartier watch, the boutique team can take a picture of the desired model (or use any existing photo of it) and use the app to find its equivalent product page online. The app can also locate products that look similar in the catalog, displaying each item with its own image and a detailed description that customers can explore if the boutique team clicks on it. Sales associates can also send feedback about how relevant the recommendations were so that the Cartier team can continually improve the app. For a deeper understanding of the Cloud and AI technology powering this app, check out this blog post

High-quality design and service never go out of style

Today, the visual product search app is used across all of the Maison’s global boutiques, helping sales associates find information about any of Cartier’s creations across its catalog. Instead of several minutes, associates can now answer customer questions in seconds. And over time, the Maison hopes to add other helpful features to the app. 

The success of this project shows it’s possible to embrace new technology and bring innovation while preserving the quality and services that have established Cartier as a force among luxury brands. With AI technology, the future is looking very bright. 

Read More

KELM: Integrating Knowledge Graphs with Language Model Pre-training Corpora

Posted by Siamak Shakeri, Staff Software Engineer and Oshin Agarwal, Research Intern, Google Research

Large pre-trained natural language processing (NLP) models, such as BERT, RoBERTa, GPT-3, T5 and REALM, leverage natural language corpora that are derived from the Web and fine-tuned on task specific data, and have made significant advances in various NLP tasks. However, natural language text alone represents a limited coverage of knowledge, and facts may be contained in wordy sentences in many different ways. Furthermore, existence of non-factual information and toxic content in text can eventually cause biases in the resulting models.

Alternate sources of information are knowledge graphs (KGs), which consist of structured data. KGs are factual in nature because the information is usually extracted from more trusted sources, and post-processing filters and human editors ensure inappropriate and incorrect content are removed. Therefore, models that can incorporate them carry the advantages of improved factual accuracy and reduced toxicity. However, their different structural format makes it difficult to integrate them with the existing pre-training corpora in language models.

In “Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training” (KELM), accepted at NAACL 2021, we explore converting KGs to synthetic natural language sentences to augment existing pre-training corpora, enabling their integration into the pre-training of language models without architectural changes. To that end, we leverage the publicly available English Wikidata KG and convert it into natural language text in order to create a synthetic corpus. We then augment REALM, a retrieval-based language model, with the synthetic corpus as a method of integrating natural language corpora and KGs in pre-training. We have released this corpus publicly for the broader research community.

Converting KG to Natural Language Text
KGs consist of factual information represented explicitly in a structured format, generally in the form of [subject entity, relation, object entity] triples, e.g., [10×10 photobooks, inception, 2012]. A group of related triples is called an entity subgraph. An example of an entity subgraph that builds on the previous example of a triple is { [10×10 photobooks, instance of, Nonprofit Organization], [10×10 photobooks, inception, 2012] }, which is illustrated in the figure below. A KG can be viewed as interconnected entity subgraphs.

Converting subgraphs into natural language text is a standard task in NLP known as data-to-text generation. Although there have been significant advances on data-to-text-generation on benchmark datasets such as WebNLG, converting an entire KG into natural text has additional challenges. The entities and relations in large KGs are more vast and diverse than small benchmark datasets. Moreover, benchmark datasets consist of predefined subgraphs that can form fluent meaningful sentences. With an entire KG, such a segmentation into entity subgraphs needs to be created as well.

An example illustration of how the pipeline converts an entity subgraph (in bubbles) into synthetic natural sentences (far right).

In order to convert the Wikidata KG into synthetic natural sentences, we developed a verbalization pipeline named “Text from KG Generator” (TEKGEN), which is made up of the following components: a large training corpus of heuristically aligned Wikipedia text and Wikidata KG triples, a text-to-text generator (T5) to convert the KG triples to text, an entity subgraph creator for generating groups of triples to be verbalized together, and finally, a post-processing filter to remove low quality outputs. The result is a corpus containing the entire Wikidata KG as natural text, which we call the Knowledge-Enhanced Language Model (KELM) corpus. It consists of ~18M sentences spanning ~45M triples and ~1500 relations.

Converting a KG to natural language, which is then used for language model augmentation

Integrating Knowledge Graph and Natural Text for Language Model Pre-training
Our evaluation shows that KG verbalization is an effective method of integrating KGs with natural language text. We demonstrate this by augmenting the retrieval corpus of REALM, which includes only Wikipedia text.

To assess the effectiveness of verbalization, we augment the REALM retrieval corpus with the KELM corpus (i.e., “verbalized triples”) and compare its performance against augmentation with concatenated triples without verbalization. We measure the accuracy with each data augmentation technique on two popular open-domain question answering datasets: Natural Questions and Web Questions.

Augmenting REALM with even the concatenated triples improves accuracy, potentially adding information not expressed in text explicitly or at all. However, augmentation with verbalized triples allows for a smoother integration of the KG with the natural language text corpus, as demonstrated by the higher accuracy. We also observed the same trend on a knowledge probe called LAMA that queries the model using fill-in-the-blank questions.

Conclusion
With KELM, we provide a publicly-available corpus of a KG as natural text. We show that KG verbalization can be used to integrate KGs with natural text corpora to overcome their structural differences. This has real-world applications for knowledge-intensive tasks, such as question answering, where providing factual knowledge is essential. Moreover, such corpora can be applied in pre-training of large language models, and can potentially reduce toxicity and improve factuality. We hope that this work encourages further advances in integrating structured knowledge sources into pre-training of large language models.

Acknowledgements
This work has been a collaborative effort involving Oshin Agarwal, Heming Ge, Siamak Shakeri and Rami Al-Rfou. We thank William Woods, Jonni Kanerva, Tania Rojas-Esponda, Jianmo Ni, Aaron Cohen and Itai Rolnick for rating a sample of the synthetic corpus to evaluate its quality. We also thank Kelvin Guu for his valuable feedback on the paper.

Read More

11 ways we’re innovating with AI

AI is integral to so much of the work we do at Google. Fundamental advances in computing are helping us confront some of the greatest challenges of this century, like climate change. Meanwhile, AI is also powering updates across our products, including Search, Maps and Photos — demonstrating how machine learning can improve your life in both big and small ways. 

In case you missed it, here are some of the AI-powered updates we announced at Google I/O.


LaMDA is a breakthrough in natural language understanding for dialogue.

Human conversations are surprisingly complex. They’re grounded in concepts we’ve learned throughout our lives; are composed of responses that are both sensible and specific; and unfold in an open-ended manner. LaMDA — short for “Language Model for Dialogue Applications” — is a machine learning model designed for dialogue and built on Transformer, a neural network architecture that Google invented and open-sourced. We think that this early-stage research could unlock more natural ways of interacting with technology and entirely new categories of helpful applications. Learn more about LaMDA.


And MUM, our new AI language model, will eventually help make Google Search a lot smarter.

In 2019 we launched BERT, a Transformer AI model that can better understand the intent behind your Search queries. Multitask Unified Model (MUM), our latest milestone, is 1000x more powerful than BERT. It can learn across 75 languages at once (most AI models train on one language at a time), and it can understand information across text, images, video and more. We’re still in the early days of exploring MUM, but the goal is that one day you’ll be able to type a long, information-dense, and natural sounding query like “I’ve hiked Mt. Adams and now want to hike Mt. Fuji next fall, what should I do differently to prepare?” and more quickly find relevant information you need. Learn more about MUM.

 

Project Starline will help you feel like you’re there, together.

Imagine looking through a sort of magic window. And through that window, you see another person, life-size, and in three dimensions. You can talk naturally, gesture and make eye contact.

  • A woman communicates with her sister and baby using Project Starline.

    We brought in people to reconnect using Project Starline.

  • A woman communicates with her friend using Project Starline.

    We brought in people to reconnect using Project Starline.

  • A woman and a man communicate using sign language using Project Starline.

    We brought in people to reconnect using Project Starline.

  • A woman communicates with her friend using Project Starline.

    We brought in people to reconnect using Project Starline.

Project Starline is a technology project that combines advances in hardware and software to enable friends, family and co-workers to feel together, even when they’re cities (or countries) apart. To create this experience, we’re applying research in computer vision, machine learning, spatial audio and real-time compression. And we’ve developed a light field display system that creates a sense of volume and depth without needing additional glasses or headsets. It feels like someone is sitting just across from you, like they’re right there. Learn more about Project Starline.

Within a decade, we’ll build the world’s first useful, error-corrected quantum computer. And our new Quantum AI campus is where it’ll happen. 

Confronting many of the world’s greatest challenges, from climate change to the next pandemic, will require a new kind of computing. A useful, error-corrected quantum computer will allow us to mirror the complexity of nature, enabling us to develop new materials, better batteries, more effective medicines and more. Our new Quantum AI campus — home to research offices, a fabrication facility, and our first quantum data center — will help us build that computer before the end of the decade. Learn more about our work on the Quantum AI campus.


Maps will help reduce hard-braking moments while you drive.

Soon, Google Maps will use machine learning to reduce your chances of experiencing hard-braking moments — incidents where you slam hard on your brakes, caused by things like sudden traffic jams or confusion about which highway exit to take. 

When you get directions in Maps, we calculate your route based on a lot of factors, like how many lanes a road has or how direct the route is. With this update, we’ll also factor in the likelihood of hard-braking. Maps will identify the two fastest route options for you, and then we’ll automatically recommend the one with fewer hard-braking moments (as long as your ETA is roughly the same). We believe these changes have the potential to eliminate over 100 million hard-braking events in routes driven with Google Maps each year. Learn more about our updates to Maps.


Your Memories in Google Photos will become even more personalized.

With Memories, you can already look back on important photos from years past or highlights from the last week. Using machine learning, we’ll soon be able to identify the less-obvious patterns in your photos. Starting later this summer, when we find a set of three or more photos with similarities like shape or color, we’ll highlight these little patterns for you in your Memories. For example, Photos might identify a pattern of your family hanging out on the same couch over the years — something you wouldn’t have ever thought to search for, but that tells a meaningful story about your daily life. Learn more about our updates to Google Photos.


And Cinematic moments will bring your pictures to life.

When you’re trying to get the perfect photo, you usually take the same shot two or three (or 20) times. Using neural networks, we can take two nearly identical images and fill in the gaps by creating new frames in between. This creates vivid, moving images called Cinematic moments. 

Producing this effect from scratch would take professional animators hours, but with machine learning we can automatically generate these moments and bring them to your Recent Highlights. Best of all, you don’t need a specific phone; Cinematic moments will come to everyone across Android and iOS. Learn more about Cinematic moments in Google Photos.

Two very similar pictures of a child and their baby sibling get transformed into a moving image thanks to AI.

Cinematic moments bring your pictures to life, thanks to AI.

New features in Google Workspace help make collaboration more inclusive. 

In Google Workspace, assisted writing will suggest more inclusive language when applicable. For example, it may recommend that you use the word “chairperson” instead of “chairman” or “mail carrier” instead of “mailman.” It can also give you other stylistic suggestions to avoid passive voice and offensive language, which can speed up editing and help make your writing stronger. Learn more about our updates to Workspace.

Google Shopping shows you the best products for your particular needs, thanks to our Shopping Graph.

To help shoppers find what they’re looking for, we need to have a deep understanding of all the products that are available, based on information from images, videos, online reviews and even inventory in local stores. Enter the Shopping Graph: our AI-enhanced model tracks products, sellers, brands, reviews, product information and inventory data — as well as how all these attributes relate to one another. With people shopping across Google more than a billion times a day, the Shopping Graph makes those sessions more helpful by connecting people with over 24 billion listings from millions of merchants across the web. Learn how we’re working with merchants to give you more ways to shop.

A dermatology assist tool can help you figure out what’s going on with your skin.

Each year we see billions of Google Searches related to skin, nail and hair issues, but it can be difficult to describe what you’re seeing on your skin through words alone.

With our CE marked AI-powered dermatology assist tool, a web-based application that we aim to make available for early testing in the EU later this year, it’s easier to figure out what might be going on with your skin. Simply use your phone’s camera to take three images of the skin, hair or nail concern from different angles. You’ll then be asked questions about your skin type, how long you’ve had the issue and other symptoms that help the AI to narrow down the possibilities. The AI model analyzes all of this information and draws from its knowledge of 288 conditions to give you a list of possible conditions that you can then research further. It’s not meant to be a replacement for diagnosis, but rather a good place to start. Learn more about our AI-powered dermatology assist tool.

And AI could help improve screening for tuberculosis.

Tuberculosis (TB) is one of the leading causes of death worldwide, infecting 10 million people per year and disproportionately impacting people in low-to-middle-income countries. It’s also really tough to diagnose early because of how similar symptoms are to other respiratory diseases. Chest X-rays help with diagnosis, but experts aren’t always available to read the results. That’s why the World Health Organization (WHO) recently recommended using technology to help with screening and triaging for TB. Researchers at Google are exploring how AI can be used to identify potential TB patients for follow-up testing, hoping to catch the disease early and work to eradicate it. Learn more about our ongoing research into tuberculosis screening.

Read More

Maysam Moussalem teaches Googlers human-centered AI

Originally, Maysam Moussalem dreamed of being an architect. “When I was 10, I looked up to see the Art Nouveau dome over the Galeries Lafayette in Paris, and I knew I wanted to make things like that,” she says. “Growing up between Austin, Paris, Beirut and Istanbul just fed my love of architecture.” But she found herself often talking to her father, a computer science (CS) professor, about what she wanted in a career. “I always loved art and science and I wanted to explore the intersections between fields. CS felt broader to me, and so I ended up there.”

While in grad school for CS, her advisor encouraged her to apply for a National Science Foundation Graduate Research Fellowship. “Given my lack of publications at the time, I wasn’t sure I should apply,” Maysam remembers. “But my advisor gave me some of the best advice I’ve ever received: ‘If you try, you may not get it. But if you don’t try, you definitely won’t get it.’” Maysam received the scholarship, which supported her throughout grad school. “I’ll always be grateful for that advice.” 

Today, Maysam works in AI, in Google’s Machine Learning Education division and also as the co-author and editor-in-chief of the People + AI Research (PAIR) Guidebook. She’s hosting a session at Google I/O on “Building trusted AI products” as well, which you can view when it’s live at 9 am PT Thursday, May 20, as a part of Google Design’s I/O Agenda. We recently took some time to talk to Maysam about what landed her at Google, and her path toward responsible innovation.

How would you explain your job to someone who isn’t in tech?

I create different types of training, like workshops and labs for Googlers who work in machine learning and data science. I also help create guidebooks and courses that people who don’t work at Google use.

What’s something you didn’t realize would help you in your career one day?

I didn’t think that knowing seven languages would come in handy for my work here, but it did! When I was working on the externalization of the Machine Learning Crash Course, I was so happy to be able to review modules and glossary entries for the French translation!

How do you apply Google’s AI Principles in your work? 

I’m applying the AI Principles whenever I’m helping teams learn best practices for building user-centered products with AI. It’s so gratifying when someone who’s taken one of my classes tells me they had a great experience going through the training, they enjoyed learning something new and they feel ready to apply it in their work. Just like when I was an engineer, anytime someone told me the tool I’d worked on helped them do their job better and addressed their needs, it drove home the fourth AI principle: Being accountable to people. It’s so important to put people first in our work. 

This idea was really important when I was working on Google’s People + AI Research (PAIR) Guidebook. I love PAIR’s approach of putting humans at the center of product development. It’s really helpful when people in different roles come together and pool their skills to make better products. 

How did you go from being an engineer to doing what you’re doing now? 

At Google, it feels like I don’t have to choose between learning and working. There are tech talks every week, plus workshops and codelabs constantly. I’ve loved continuing to learn while working here.

Being raised by two professors also gave me a love of teaching. I wanted to share what I’d learned with others. My current role enables me to do this and use a wider range of my skills.

My background as an engineer gives me a strong understanding of how we build software at Google’s scale. This inspires me to think more about how to bring education into the engineering workflow, rather than forcing people to learn from a disconnected experience.

How can aspiring AI thinkers and future technologists prepare for a career in responsible innovation? 

Pick up and exercise a variety of skills! I’m a technical educator, but I’m always happy to pick up new skills that aren’t traditionally specific to my job. For example, I was thinking of a new platform to deliver internal data science training, and I learned how to create a prototype using UX tools so that I could illustrate my ideas really clearly in my proposal. I write, code, teach, design and I’m always interested in learning new techniques from my colleagues in other roles.

And spend time with your audience, the people who will be using your product or the coursework you’re creating or whatever it is you’re working on. When I was an engineer, I’d always look for opportunities to sit with, observe, and talk with the people who were using my team’s products. And I learned so much from this process.

Read More

Project Guideline: Enabling Those with Low Vision to Run Independently

Posted by Xuan Yang, Software Engineer, Google Research

For the 285 million people around the world living with blindness or low vision, exercising independently can be challenging. Earlier this year, we announced Project Guideline, an early-stage research project, developed in partnership with Guiding Eyes for the Blind, that uses machine learning to guide runners through a variety of environments that have been marked with a painted line. Using only a phone running Guideline technology and a pair of headphones, Guiding Eyes for the Blind CEO Thomas Panek was able to run independently for the first time in decades and complete an unassisted 5K in New York City’s Central Park.

Safely and reliably guiding a blind runner in unpredictable environments requires addressing a number of challenges. Here, we will walk through the technology behind Guideline and the process by which we were able to create an on-device machine learning model that could guide Thomas on an independent outdoor run. The project is still very much under development, but we’re hopeful it can help explore how on-device technology delivered by a mobile phone can provide reliable, enhanced mobility and orientation experiences for those who are blind or low vision.

Thomas Panek using Guideline technology to run independently outdoors.

Project Guideline
The Guideline system consists of a mobile device worn around the user’s waist with a custom belt and harness, a guideline on the running path marked with paint or tape, and bone conduction headphones. Core to the Guideline technology is an on-device segmentation model that takes frames from a mobile device’s camera as input and classifies every pixel in the frame into two classes, “guideline” and “not guideline”. This simple confidence mask, applied to every frame, allows the Guideline app to predict where runners are with respect to a line on the path, without using location data. Based on this prediction and the proceeding smoothing/filtering function, the app sends audio signals to the runners to help them orient and stay on the line, or audio alerts to tell runners to stop if they veer too far away.

Project Guideline uses Android’s built-in Camera 2 and MLKit APIs and adds custom modules to segment the guideline, detect its position and orientation, filter false signals, and send a stereo audio signal to the user in real-time.

We faced a number of important challenges in building the preliminary Guideline system:

  1. System accuracy: Mobility for the blind and low vision community is a challenge in which user safety is of paramount importance. It demands a machine learning model that is capable of generating accurate and generalized segmentation results to ensure the safety of the runner in different locations and under various environmental conditions.
  2. System performance: In addition to addressing user safety, the system needs to be performative, efficient, and reliable. It must process at least 15 frames per second (FPS) in order to provide real-time feedback for the runner. It must also be able to run for at least 3 hours without draining the phone battery, and must work offline, without the need for internet connection should the walking/running path be in an area without data service.
  3. Lack of in-domain data: In order to train the segmentation model, we needed a large volume of video consisting of roads and running paths that have a yellow line on them. To generalize the model, data variety is equally as critical as data quantity, requiring video frames taken at different times of day, with different lighting conditions, under different weather conditions, at different locations, etc.

Below, we introduce solutions for each of these challenges.

Network Architecture
To meet the latency and power requirements, we built the line segmentation model on the DeepLabv3 framework, utilizing MobilenetV3-Small as the backbone, while simplifying the outputs to two classes – guideline and background.

The model takes an RGB frame and generates an output grayscale mask, representing the confidence of each pixel’s prediction.

To increase throughput speed, we downsize the camera feed from 1920 x 1080 pixels to 513 x 513 pixels as input to the DeepLab segmentation model. To further speed-up the DeepLab model for use on mobile devices, we skipped the last up-sample layer, and directly output the 65 x 65 pixel predicted masks. These 65 x 65 pixel predicted masks are provided as input to the post processing. By minimizing the input resolution in both stages, we’re able to improve the runtime of the segmentation model and speed up post-processing.

Data Collection
To train the model, we required a large set of training images in the target domain that exhibited a variety of path conditions. Not surprisingly, the publicly available datasets were for autonomous driving use cases, with roof mounted cameras and cars driving between the lines, and were not in the target domain. We found that training models on these datasets delivered unsatisfying results due to the large domain gap. Instead, the Guideline model needed data collected with cameras worn around a person’s waist, running on top of the line, without the adversarial objects found on highways and crowded city streets.

The large domain gap between autonomous driving datasets and the target domain. Images on the left courtesy of the Berkeley DeepDrive dataset.

With preexisting open-source datasets proving unhelpful for our use case, we created our own training dataset composed of the following:

  1. Hand-collected data: Team members temporarily placed guidelines on paved pathways using duct tape in bright colors and recorded themselves running on and around the lines at different times of the day and in different weather conditions.
  2. Synthetic data: The data capture efforts were complicated and severely limited due to COVID-19 restrictions. This led us to build a custom rendering pipeline to synthesize tens of thousands of images, varying the environment, weather, lighting, shadows, and adversarial objects. When the model struggled with certain conditions in real-world testing, we were able to generate specific synthetic datasets to address the situation. For example, the model originally struggled with segmenting the guideline amidst piles of fallen autumn leaves. With additional synthetic training data, we were able to correct for that in subsequent model releases.
Rendering pipeline generates synthetic images to capture a broad spectrum of environments.

We also created a small regression dataset, which consisted of annotated samples of the most frequently seen scenarios combined with the most challenging scenarios, including tree and human shadows, fallen leaves, adversarial road markings, sunlight reflecting off the guideline, sharp turns, steep slopes, etc. We used this dataset to compare new models to previous ones and to make sure that an overall improvement in accuracy of the new model did not hide a reduction in accuracy in particularly important or challenging scenarios.

Training Procedure
We designed a three-stage training procedure and used transfer learning to overcome the limited in-domain training dataset problem. We started with a model that was pre-trained on Cityscape, and then trained the model using the synthetic images, as this dataset is larger but of lower quality. Finally, we fine-tuned the model using the limited in-domain data we collected.

Three-stage training procedure to overcome the limited data issue. Images in the left column courtesy of Cityscapes.

Early in development, it became clear that the segmentation model’s performance suffered at the top of the image frame. As the guidelines travel further away from the camera’s point of view at the top of the frame, the lines themselves start to vanish. This causes the predicted masks to be less accurate at the top parts of the frame. To address this problem, we computed a loss value that was based on the top k pixel rows in every frame. We used this value to select those frames that included the vanishing guidelines with which the model struggled, and trained the model repeatedly on those frames. This process proved to be very helpful not only in addressing the vanishing line problem, but also for solving other problems we encountered, such as blurry frames, curved lines and line occlusion by adversarial objects.

The segmentation model’s accuracy and robustness continuously improved even in challenging cases.

System Performance
Together with Tensorflow Lite and ML Kit, the end-to-end system runs remarkably fast on Pixel devices, achieving 29+ FPS on Pixel 4 XL and 20+ FPS on Pixel 5. We deployed the segmentation model entirely on DSP, running at 6 ms on Pixel 4 XL and 12 ms on Pixel 5 with high accuracy. The end-to-end system achieves 99.5% frame success rate, 93% mIoU on our evaluation dataset, and passes our regression test. These model performance metrics are incredibly important and enable the system to provide real-time feedback to the user.

What’s Next
We’re still at the beginning of our exploration, but we’re excited about our progress and what’s to come. We’re starting to collaborate with additional leading non-profit organizations that serve the blind and low vision communities to put more Guidelines in parks, schools, and public places. By painting more lines, getting direct feedback from users, and collecting more data under a wider variety of conditions, we hope to further generalize our segmentation model and improve the existing feature-set. At the same time, we are investigating new research and techniques, as well as new features and capabilities that would improve the overall system robustness and reliability.

To learn more about the project and how it came to be, read Thomas Panek’s story. If you want to help us put more Guidelines in the world, please visit goo.gle/ProjectGuideline.

Acknowledgements
Project Guideline is a collaboration across Google Research, Google Creative Lab, and the Accessibility Team. We especially would like to thank our team members: Mikhail Sirotenko, Sagar Waghmare, Lucian Lonita, Tomer Meron, Hartwig Adam, Ryan Burke, Dror Ayalon, Amit Pitaru, Matt Hall, John Watkinson, Phil Bayer, John Mernacaj, Cliff Lungaretti, Dorian Douglass, Kyndra LoCoco. We also thank Fangting Xia, Jack Sim and our other colleagues and friends from the Mobile Vision team and Guiding Eyes for the Blind.

Read More

Google I/O 2021: Being helpful in moments that matter

It’s great to be back hosting our I/O Developers Conference this year. Pulling up to our Mountain View campus this morning, I felt a sense of normalcy for the first time in a long while. Of course, it’s not the same without our developer community here in person. COVID-19 has deeply affected our entire global community over the past year and continues to take a toll. Places such as Brazil, and my home country of India, are now going through their most difficult moments of the pandemic yet. Our thoughts are with everyone who has been affected by COVID and we are all hoping for better days ahead.

The last year has put a lot into perspective. At Google, it’s also given renewed purpose to our mission to organize the world’s information and make it universally accessible and useful. We continue to approach that mission with a singular goal: building a more helpful Google, for everyone. That means being helpful to people in the moments that matter and giving everyone the tools to increase their knowledge, success, health and happiness. 

Helping in moments that matter

Sometimes it’s about helping in big moments, like keeping 150 million students and educators learning virtually over the last year with Google Classroom. Other times it’s about helping in little moments that add up to big changes for everyone. For example, we’re introducing safer routing in Maps. This AI-powered capability in Maps can identify road, weather and traffic conditions where you are likely to brake suddenly; our aim is to reduce up to 100 million events like this every year. 

Reimagining the future of work

One of the biggest ways we can help is by reimagining the future of work. Over the last year, we’ve seen work transform in unprecedented ways, as offices and coworkers have been replaced by kitchen countertops and pets. Many companies, including ours, will continue to offer flexibility even when it’s safe to be in the same office again. Collaboration tools have never been more critical, and today we announced a new smart canvas experience in Google Workspace that enables even richer collaboration. 

GIF of Smart Canvas integration with Google Meet

 Smart Canvas integration with Google Meet

Responsible next-generation AI

We’ve made remarkable advances over the past 22 years, thanks to our progress in some of the most challenging areas of AI, including translation, images and voice. These advances have powered improvements across Google products, making it possible to talk to someone in another language using Assistant’s interpreter mode, view cherished memories on Photos or use Google Lens to solve a tricky math problem. 

We’ve also used AI to improve the core Search experience for billions of people by taking a huge leap forward in a computer’s ability to process natural language. Yet, there are still moments when computers just don’t understand us. That’s because language is endlessly complex: We use it to tell stories, crack jokes and share ideas — weaving in concepts we’ve learned over the course of our lives. The richness and flexibility of language make it one of humanity’s greatest tools and one of computer science’s greatest challenges. 

Today I am excited to share our latest research in natural language understanding: LaMDA. LaMDA is a language model for dialogue applications. It’s open domain, which means it is designed to converse on any topic. For example, LaMDA understands quite a bit about the planet Pluto. So if a student wanted to discover more about space, they could ask about Pluto and the model would give sensible responses, making learning even more fun and engaging. If that student then wanted to switch over to a different topic — say, how to make a good paper airplane — LaMDA could continue the conversation without any retraining.

This is one of the ways we believe LaMDA can make information and computing radically more accessible and easier to use (and you can learn more about that here). 

We have been researching and developing language models for many years. We’re focused on ensuring LaMDA meets our incredibly high standards on fairness, accuracy, safety and privacy, and that it is developed consistently with our AI Principles. And we look forward to incorporating conversation features into products like Google Assistant, Search and Workspace, as well as exploring how to give capabilities to developers and enterprise customers.

LaMDA is a huge step forward in natural conversation, but it’s still only trained on text. When people communicate with each other they do it across images, text, audio and video. So we need to build multimodal models (MUM) to allow people to naturally ask questions across different types of information. With MUM you could one day plan a road trip by asking Google to “find a route with beautiful mountain views.” This is one example of how we’re making progress towards more natural and intuitive ways of interacting with Search.

Pushing the frontier of computing

Translation, image recognition and voice recognition laid the foundation for complex models like LaMDA and multimodal models. Our compute infrastructure is how we drive and sustain these advances, and TPUs, our custom-built machine learning processes, are a big part of that. Today we announced our next generation of TPUs: the TPU v4. These are powered by the v4 chip, which is more than twice as fast as the previous generation. One pod can deliver more than one exaflop, equivalent to the computing power of 10 million laptops combined. This is the fastest system we’ve ever deployed, and a historic milestone for us. Previously to get to an exaflop, you needed to build a custom supercomputer. And we’ll soon have dozens of TPUv4 pods in our data centers, many of which will be operating at or near 90% carbon-free energy. They’ll be available to our Cloud customers later this year.

Images of a TPU v4 chip tray, and of TPU v4 pods at our Oklahoma data center

Left: TPU v4 chip tray; Right: TPU v4 pods at our Oklahoma data center 

It’s tremendously exciting to see this pace of innovation. As we look further into the future, there are types of problems that classical computing will not be able to solve in reasonable time. Quantum computing can help. Achieving our quantum milestone was a tremendous accomplishment, but we’re still at the beginning of a multiyear journey. We continue to work to get to our next big milestone in quantum computing: building an error-corrected quantum computer, which could help us increase battery efficiency, create more sustainable energy and improve drug discovery. To help us get there, we’ve opened a new state of the art Quantum AI campus with our first quantum data center and quantum processor chip fabrication facilities.

A photo of the interior of our new Quantum AI campus

Inside our new Quantum AI campus.

Safer with Google

At Google we know that our products can only be as helpful as they are safe. And advances in computer science and AI are how we continue to make them better. We keep more users safe by blocking malware, phishing attempts, spam messages and potential cyber attacks than anyone else in the world.

Our focus on data minimization pushes us to do more, with less data. Two years ago at I/O, I announced Auto-Delete, which encourages users to have their activity data automatically and continuously deleted. We’ve since made Auto-Delete the default for all new Google Accounts. Now, after 18 months we automatically delete your activity data, unless you tell us to do it sooner. It’s now active for over 2 billion accounts.

All of our products are guided by three important principles: With one of the world’s most advanced security infrastructures, our products are secure by default. We strictly uphold responsible data practices so every product we build is private by design. And we create easy to use privacy and security settings so you’re in control.

Long-term research: Project Starline

We were all grateful to have video conferencing over the last year to stay in touch with family and friends, and keep schools and businesses going. But there is no substitute for being together in the room with someone. 

Several years ago we kicked off a project called Project Starline to use technology to explore what’s possible. Using high-resolution cameras and custom-built depth sensors, it captures your shape and appearance from multiple perspectives, and then fuses them together to create an extremely detailed, real-time 3D model. The resulting data is many gigabits per second, so to send an image this size over existing networks, we developed novel compression and streaming algorithms that reduce the data by a factor of more than 100. We also developed a breakthrough light-field display that shows you the realistic representation of someone sitting in front of you. As sophisticated as the technology is, it vanishes, so you can focus on what’s most important. 

We’ve spent thousands of hours testing it at our own offices, and the results are promising. There’s also excitement from our lead enterprise partners, and we’re working with partners in health care and media to get early feedback. In pushing the boundaries of remote collaboration, we’ve made technical advances that will improve our entire suite of communications products. We look forward to sharing more in the months ahead.

A person in a booth talking to someone over Project Starline

A person having a conversation with someone over Project Starline.

Solving complex sustainability challenges

Another area of research is our work to drive forward sustainability. Sustainability has been a core value for us for more than 20 years. We were the first major company to become carbon neutral in 2007. We were the first to match our operations with 100% renewable energy in 2017, and we’ve been doing it ever since. Last year we eliminated our entire carbon legacy. 

Our next ambition is our biggest yet: operating on carbon free energy by the year 2030. This represents a significant step change from current approaches and is a moonshot on the same scale as quantum computing. It presents equally hard problems to solve, from sourcing carbon-free energy in every place we operate to ensuring it can run every hour of every day. 

Building on the first carbon-intelligent computing platform that we rolled out last year, we’ll soon be the first company to implement carbon-intelligent load shifting across both time and place within our data center network. By this time next year we’ll be shifting more than a third of non-production compute to times and places with greater availability of carbon-free energy. And we are working to apply our Cloud AI with novel drilling techniques and fiber optic sensing to deliver geothermal power in more places, starting in our Nevada data centers next year.

Investments like these are needed to get to 24/7 carbon-free energy, and it’s happening in Mountain View, California, too. We’re building our new campus to the highest sustainability standards. When completed, these buildings will feature a first-of-its-kind dragonscale solar skin, equipped with 90,000 silver solar panels and the capacity to generate nearly 7 megawatts. They will house the largest geothermal pile system in North America to help heat buildings in the winter and cool them in the summer. It’s been amazing to see it come to life.

Images with a rendering of the new Charleston East campus in Mountain View, California; and a model view with dragon scale solar skin.

Left: Rendering of the new Charleston East campus in Mountain View, California; Right: Model view with dragon scale solar skin.

A celebration of technology

I/O isn’t just a celebration of technology but of the people who use it, and build it — including the millions of developers around the world who joined us virtually today. Over the past year we’ve seen people use technology in profound ways: To keep themselves healthy and safe, to learn and grow, to connect and to help one another through really difficult times. It’s been inspiring to see and has made us more committed than ever to being helpful in the moments that matter. 

I look forward to seeing everyone at next year’s I/O — in person, I hope. Until then, be safe and well.

Read More