Recreating Historical Streetscapes Using Deep Learning and Crowdsourcing

Recreating Historical Streetscapes Using Deep Learning and Crowdsourcing

Posted by Raimondas Kiveris, Software Engineer, Google Research

For many, gazing at an old photo of a city can evoke feelings of both nostalgia and wonder — what was it like to walk through Manhattan in the 1940s? How much has the street one grew up on changed? While Google Street View allows people to see what an area looks like in the present day, what if you want to explore how places looked in the past?

To create a rewarding “time travel” experience for both research and entertainment purposes, we are launching (pronounced as re”turn“), an open source, scalable system running on Google Cloud and Kubernetes that can reconstruct cities from historical maps and photos, representing an implementation of our suite of open source tools launched earlier this year. Referencing the common prefix meaning again or anew, is meant to represent the themes of reconstruction, research, recreation and remembering behind this crowdsourced research effort, and consists of three components:

  • A crowdsourcing platform, which allows users to upload historical maps of cities, georectify (i.e., match them to real world coordinates), and vectorize them
  • A temporal map server, which shows how maps of cities change over time
  • A 3D experience platform, which runs on top of the map server, creating the 3D experience by using deep learning to reconstruct buildings in 3D from limited historical images and maps data.

Our goal is for to become a compendium that allows history enthusiasts to virtually experience historical cities around the world, aids researchers, policy makers and educators, and provides a dose of nostalgia to everyday users.

Bird’s eye view of Chelsea, Manhattan with a time slider from 1890 to 1970, crafted from historical photos and maps and using ’s 3D reconstruction pipeline and colored with a preset Manhattan-inspired palette.

Crowdsourcing Data from Historical Maps
Reconstructing how cities used to look at scale is a challenge — historical image data is more difficult to work with than modern data, as there are far fewer images available and much less metadata captured from the images. To help with this difficulty, the maps module is a suite of open source tools that work together to create a map server with a time dimension, allowing users to jump back and forth between time periods using a slider. These tools allow users to upload scans of historical print maps, georectify them to match real world coordinates, and then convert them to vector format by tracing their geographic features. These vectorized maps are then served on a tile server and rendered as slippy maps, which lets the user zoom in and pan around.

Sub-modules of the suite of tools

The entry point of the maps module is Warper, a web app that allows users to upload historical images of maps and georectify them by finding control points on the historical map and corresponding points on a base map. The next app, Editor, allows users to load the georectified historical maps as the background and then trace their geographic features (e.g., building footprints, roads, etc.). This traced data is stored in an OpenStreetMap (OSM) vector format. They are then converted to vector tiles and served from the Server app, a vector tile server. Finally, our map renderer, Kartta, visualizes the spatiotemporal vector tiles allowing the users to navigate space and time on historical maps. These tools were built on top of numerous open source resources including OpenStreetMap, and we intend for our tools and data to be completely open source as well.

Warper and Editor work together to let users upload a map, anchor it to a base map using control points, and trace geographic features like building footprints and roads.

3D Experience
The 3D Models module aims to reconstruct the detailed full 3D structures of historical buildings using the associated images and maps data, organize these 3D models properly in one repository, and render them on the historical maps with a time dimension.

In many cases, there is only one historical image available for a building, which makes the 3D reconstruction an extremely challenging problem. To tackle this challenge, we developed a coarse-to-fine reconstruction-by-recognition algorithm.

High-level overview of ’s 3D reconstruction pipeline, which takes annotated images and maps and prepares them for 3D rendering.

Starting with footprints on maps and façade regions in historical images (both are annotated by crowdsourcing or detected by automatic algorithms), the footprint of one input building is extruded upwards to generate its coarse 3D structure. The height of this extrusion is set to the number of floors from the corresponding metadata in the maps database.

In parallel, instead of directly inferring the detailed 3D structures of each façade as one entity, the 3D reconstruction pipeline recognizes all individual constituent components (e.g., windows, entries, stairs, etc.) and reconstructs their 3D structures separately based on their categories. Then these detailed 3D structures are merged with the coarse one for the final 3D mesh. The results are stored in a 3D repository and ready for 3D rendering.

The key technology powering this feature is a number of state-of-art deep learning models:

  • Faster region-based convolutional neural networks (RCNN) were trained using the façade component annotations for each target semantic class (e.g., windows, entries, stairs, etc), which are used to localize bounding-box level instances in historical images.
  • DeepLab, a semantic segmentation model, was trained to provide pixel-level labels for each semantic class.
  • A specifically designed neural network was trained to enforce high-level regularities within the same semantic class. This ensured that windows generated on a façade were equally spaced and consistent in shape with each other. This also facilitated consistency across different semantic classes such as stairs to ensure they are placed at reasonable positions and have consistent dimensions relative to the associated entry ways.

Key Results

Street level view of 3D-reconstructed Chelsea, Manhattan

Conclusion
With , we have developed tools that facilitate crowdsourcing to tackle the main challenge of insufficient historical data when recreating virtual cities. The 3D experience is still a work-in-progress and we aim to improve it with future updates. We hope acts as a nexus for an active community of enthusiasts and casual users that not only utilizes our historical datasets and open source code, but actively contributes to both.

Acknowledgements
This effort has been successful thanks to the hard work of many people, including, but not limited to the following (in alphabetical order of last name): Yale Cong, Feng Han, Amol Kapoor, Raimondas Kiveris, Brandon Mayer, Mark Phillips, Sasan Tavakkol, and Tim Waters (Waters Geospatial Ltd).

Read More

Project Euphonia’s new step: 1,000 hours of speech recordings

Project Euphonia’s new step: 1,000 hours of speech recordings

Muratcan Cicek, a PhD candidate at UC Santa Cruz, worked as a summer intern on Google’s Project Euphonia, which aims to improve computers’ abilities to understand impaired speech. This work was especially relevant and important for Muratcan, who was born with cerebral palsy and has a severe speech impairment.

Before his internship, Muratcan recorded 2,000 phrases for Project Euphonia. These phrases, expressions like “Turn the lights on” and “Turn up thermostat to 74 degrees,” were used to build a personalized speech recognition model that could better recognize the unique sound of his voice and transcribe his speech. The prototype allowed Muratcan to share the transcription in a video call so others could better understand him. He used the prototype to converse with co-workers, give status updates during team meetings and connect with people in ways that were previously impossible. Muratcan says, “Euphonia transformed my communication skills in a way that I can leverage in my career as an engineer without feeling insecure about my condition.”

Muratcan, a Google intern

Muratcan, a summer research intern on the Euphonia team, uses the Euphonia prototype app

1,000 hours of speech samples

The phrases that Muratcan recorded were key to training custom machine learning models that could help him be more easily understood. To help other people that have impaired speech caused by ALS, Parkinson’s disease or Down Syndrome, we need to gather samples of their speech patterns. So we’ve worked with partners like CDSS, ALS TDI, ALSA, LSVT Global, Team Gleason and CureDuchenne to encourage people with speech impairments to record their voices and contribute to this research.

Since 2018, nearly 1,000 participants have recorded over 1,000 hours of speech samples. For many, it’s been a source of pride and purpose to shape the future of speech recognition, not only for themselves but also for others who struggle to be understood.

I contribute to this research so that I can help not only myself, but also a larger group of people with communication challenges that are often left out. Project Euphonia participant

While the technology is still under development, the speech samples we’ve collected helped us create personalized speech recognition models for individuals with speech impairments, like Muratcan. For more technical details about how these models work, see the Euphonia and Parrotron blog posts. We’re evaluating these personalized models with a group of early testers. The next phase of our research aims to improve speech recognition systems for many more people, but it requires many more speech samples from a broad range of speakers.

How you can contribute

To continue our research, we hope to collect speech samples from an additional 5,000 participants. If you have difficulty being understood by others and want to contribute to meaningful research to improve speech recognition technologies, learn more and consider signing up to record phrases. We look forward to hearing from more participants and experts— and together, helping everyone be understood.

Read More

Measuring Gendered Correlations in Pre-trained NLP Models

Measuring Gendered Correlations in Pre-trained NLP Models

Posted by Kellie Webster, Software Engineer, Google Research

Natural language processing (NLP) has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, e.g., Wikipedia, by repeatedly masking out words and trying to predict them (this is called masked language modeling). The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. There is then a second training stage, fine-tuning, in which the model uses task-specific training data to learn how to use the general pre-trained representations to do a concrete task, like classification. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream, to ensure the application of these models aligns with our AI Principles.

In “Measuring and Reducing Gendered Correlations in Pre-trained Models” we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models. We present experimental results over public model checkpoints and an academic task dataset to illustrate how the best practices apply, providing a foundation for exploring settings beyond the scope of this case study. We will soon release a series of checkpoints, Zari1, which reduce gendered correlations while maintaining state-of-the-art accuracy on standard NLP task metrics.

Measuring Correlations
To understand how correlations in pre-trained representations can affect downstream task performance, we apply a diverse set of evaluation metrics for studying the representation of gender. Here, we’ll discuss results from one of these tests, based on coreference resolution, which is the capability that allows models to understand the correct antecedent to a given pronoun in a sentence. For example, in the sentence that follows, the model should recognize his refers to the nurse, and not to the patient.

The standard academic formulation of the task is the OntoNotes test (Hovy et al., 2006), and we measure how accurate a model is at coreference resolution in a general setting using an F1 score over this data (as in Tenney et al. 2019). Since OntoNotes represents only one data distribution, we also consider the WinoGender benchmark that provides additional, balanced data designed to identify when model associations between gender and profession incorrectly influence coreference resolution. High values of the WinoGender metric (close to one) indicate a model is basing decisions on normative associations between gender and profession (e.g., associating nurse with the female gender and not male). When model decisions have no consistent association between gender and profession, the score is zero, which suggests that decisions are based on some other information, such as sentence structure or semantics.

BERT and ALBERT metrics on OntoNotes (accuracy) and WinoGender (gendered correlations). Low values on the WinoGender metric indicate that a model does not preferentially use gendered correlations in reasoning.

In this study, we see that neither the (Large) BERT or ALBERT public model achieves zero score on the WinoGender examples, despite achieving impressive accuracy on OntoNotes (close to 100%). At least some of this is due to models preferentially using gendered correlations in reasoning. This isn’t completely surprising: there are a range of cues available to understand text and it is possible for a general model to pick up on any or all of these. However, there is reason for caution, as it is undesirable for a model to make predictions primarily based on gendered correlations learned as priors rather than the evidence available in the input.

Best Practices
Given that it is possible for unintended correlations in pre-trained model representations to affect downstream task reasoning, we now ask: what can one do to mitigate any risk this poses when developing new NLP models?

  • It is important to measure for unintended correlations: Model quality may be assessed using accuracy metrics, but these only measure one dimension of performance, especially if the test data is drawn from the same distribution as the training data. For example, the BERT and ALBERT checkpoints have accuracy within 1% of each other, but differ by 26% (relative) in the degree to which they use gendered correlations for coreference resolution. This difference might be important for some tasks; selecting a model with low WinoGender score could be desirable in an application featuring texts about people in professions that may not conform to historical social norms, e.g., male nurses.
  • Be careful even when making seemingly innocuous configuration changes: Neural network model training is controlled by many hyperparameters that are usually selected to maximize some training objective. While configuration choices often seem innocuous, we find they can cause significant changes for gendered correlations, both for better and for worse. For example, dropout regularization is used to reduce overfitting by large models. When we increase the dropout rate used for pre-training BERT and ALBERT, we see a significant reduction in gendered correlations even after fine-tuning. This is promising since a simple configuration change allows us to train models with reduced risk of harm, but it also shows that we should be mindful and evaluate carefully when making any change in model configuration.
    Impact of increasing dropout regularization in BERT and ALBERT.
  • There are opportunities for general mitigations: A further corollary from the perhaps unexpected impact of dropout on gendered correlations is that it opens the possibility to use general-purpose methods for reducing unintended correlations: by increasing dropout in our study, we improve how the models reason about WinoGender examples without manually specifying anything about the task or changing the fine-tuning stage at all. Unfortunately, OntoNotes accuracy does start to decline as the dropout rate increases (which we can see in the BERT results), but we are excited about the potential to mitigate this in pre-training, where changes can lead to model improvements without the need for task-specific updates. We explore counterfactual data augmentation as another mitigation strategy with different tradeoffs in our paper.

What’s Next
We believe these best practices provide a starting point for developing robust NLP systems that perform well across the broadest possible range of linguistic settings and applications. Of course these techniques on their own are not sufficient to capture and remove all potential issues. Any model deployed in a real-world setting should undergo rigorous testing that considers the many ways it will be used, and implement safeguards to ensure alignment with ethical norms, such as Google’s AI Principles. We look forward to developments in evaluation frameworks and data that are more expansive and inclusive to cover the many uses of language models and the breadth of people they aim to serve.

Acknowledgements
This is joint work with Xuezhi Wang, Ian Tenney, Ellie Pavlick, Alex Beutel, Jilin Chen, Emily Pitler, and Slav Petrov. We benefited greatly throughout the project from discussions with Fernando Pereira, Ed Chi, Dipanjan Das, Vera Axelrod, Jacob Eisenstein, Tulsee Doshi, and James Wexler.



1 Zari is an Afghan Muppet designed to show that ‘a little girl could do as much as everybody else’.

Read More

Announcing the 2020 Google PhD Fellows

Announcing the 2020 Google PhD Fellows

Posted by Susie Kim, Program Manager, University Relations

Google created the PhD Fellowship Program in 2009 to recognize and support outstanding graduate students who seek to influence the future of technology by pursuing exceptional research in computer science and related fields. Now in its twelfth year, these Fellowships have helped support approximately 500 graduate students globally in North America and Europe, Africa, Australia, East Asia, and India.

It is our ongoing goal to continue to support the academic community as a whole, and these Fellows as they make their mark on the world. We congratulate all of this year’s awardees!

Algorithms, Optimizations and Markets
Jan van den Brand, KTH Royal Institute of Technology
Mahsa Derakhshan, University of Maryland, College Park
Sidhanth Mohanty, University of California, Berkeley

Computational Neuroscience
Connor Brennan, University of Pennsylvania

Human Computer Interaction
Abdelkareem Bedri, Carnegie Mellon University
Brendan David-John, University of Florida
Hiromu Yakura, University of Tsukuba
Manaswi Saha, University of Washington
Muratcan Cicek, University of California, Santa Cruz
Prashan Madumal, University of Melbourne

Machine Learning
Alon Brutzkus, Tel Aviv University
Chin-Wei Huang, Universite de Montreal
Eli Sherman, Johns Hopkins University
Esther Rolf, University of California, Berkeley
Imke Mayer, Fondation Sciences Mathématique de Paris
Jean Michel Sarr, Cheikh Anta Diop University
Lei Bai, University of New South Wales
Nontawat Charoenphakdee, The University of Tokyo
Preetum Nakkiran, Harvard University
Sravanti Addepalli, Indian Institute of Science
Taesik Gong, Korea Advanced Institute of Science and Technology
Vihari Piratla, Indian Institute of Technology – Bombay
Vishakha Patil, Indian Institute of Science
Wilson Tsakane Mongwe, University of Johannesburg
Xinshi Chen, Georgia Institute of Technology
Yadan Luo, University of Queensland

Machine Perception, Speech Technology and Computer Vision
Benjamin van Niekerk, University of Stellenbosch
Eric Heiden, University of Southern California
Gyeongsik Moon, Seoul National University
Hou-Ning Hu, National Tsing Hua University
Nan Wu, New York University
Shaoshuai Shi, The Chinese University of Hong Kong
Yaman Kumar, Indraprastha Institute of Information Technology – Delhi
Yifan Liu, University of Adelaide
Yu Wu, University of Technology Sydney
Zhengqi Li, Cornell University

Mobile Computing
Xiaofan Zhang, University of Illinois at Urbana-Champaign

Natural Language Processing
Anjalie Field, Carnegie Mellon University
Mingda Chen, Toyota Technological Institute at Chicago
Shang-Yu Su, National Taiwan University
Yanai Elazar, Bar-Ilan

Privacy and Security
Julien Gamba, Universidad Carlos III de Madrid
Shuwen Deng, Yale University
Yunusa Simpa Abdulsalm, Mohammed VI Polytechnic University

Programming Technology and Software Engineering
Adriana Sejfia, University of Southern California
John Cyphert, University of Wisconsin-Madison

Quantum Computing
Amira Abbas, University of KwaZulu-Natal
Mozafari Ghoraba Fereshte, EPFL

Structured Data and Database Management
Yanqing Peng, University of Utah

Systems and Networking
Huynh Nguyen Van, University of Technology Sydney
Michael Sammler, Saarland University, MPI-SWS
Sihang Liu, University of Virginia
Yun-Zhan Cai, National Cheng Kung University

Read More

Fernanda Viégas puts people at the heart of AI

Fernanda Viégas puts people at the heart of AI

When Fernanda Viégas was in college, it took three years with three different majors before she decided she wanted to study graphic design and art history. And even then, she couldn’t have imagined the job she has today: building artificial intelligence and machine learning with fairness and transparency in mind to help people in their daily lives.  

Today Fernanda, who grew up in Rio de Janeiro, Brazil, is a senior researcher at Google. She’s based in London, where she co-leads the global People + AI Research (PAIR) Initiative, which she co-founded with fellow senior research scientist Martin M. Wattenberg and Senior UX Researcher Jess Holbrook, and the Big Picture team. She and her colleagues make sure people at Google think about fairness and values–and putting Google’s AI Principlesinto practice–when they work on artificial intelligence. Her team recently launched a seriesof “AI Explorables,”a collection of interactive articles to better explain machine learning to everyone. 

When she’s not looking into the big questions around emerging technology, she’s also an artist, known for her artistic collaborations with Wattenberg. Their data visualization art is a part of the permanent collection of the Museum of Modern Art in New York.  

I recently sat down with Fernanda via Google Meet to talk about her role and the importance of putting people first when it comes to AI. 

How would you explain your job to someone who isn’t in tech?

As a research scientist, I try to make sure that machine learning (ML) systems can be better understood by people, to help people have the right level of trust in these systems. One of the main ways in which our work makes its way to the public is through the People + AI Guidebook, a set of principles and guidelines for user experience (UX) designers, product managers and engineering teams to create products that are easier to understand from a user’s perspective.

What is a key challenge that you’re focusing on in your research? 

My team builds data visualization tools that help people building AI systems to consider issues like fairness proactively, so that their products can work better for more people. Here’s a generic example: Let’s imagine it’s time for your coffee break and you use an app that uses machine learning for recommendations of coffee places near you at that moment. Your coffee app provides 10 recommendations for cafes in your area, and they’re all well-rated. From an accuracy perspective, the app performed its job: It offered information on a certain number of cafes near you. But it didn’t account for unintended unfair bias. For example: Did you get recommendations only for large businesses? Did the recommendations include only chain coffee shops? Or did they also include small, locally owned shops? How about places with international styles of coffee that might be nearby? 

The tools our team makes help ensure that the recommendations people get aren’t unfairly biased. By making these biases easy to spot with engaging visualizations of the data, we can help identify what might be improved. 

What inspired you to join Google? 

It’s so interesting to consider this because my story comes out of repeated failures, actually! When I was a student in Brazil, where I was born and grew up, I failed repeatedly in figuring out what I wanted to do. After spending three years studying for different things—chemical engineering, linguistics, education—someone said to me, “You should try to get a scholarship to go to the U.S.” I asked them why I should leave my country to study somewhere when I wasn’t even sure of my major. “That’s the thing,” they said. “In the U.S. you can be undecided and change majors.” I loved it! 

So I went to the U.S. and by the time I was graduating, I decided I loved design but I didn’t want to be a traditional graphic designer for the rest of my life. That’s when I heard about the Media Lab at MIT and ended up doing a master’s degree and PhD in data visualization there. That’s what led me to IBM, where I met Martin M. Wattenberg. Martin has been my working partner for 15 years now; we created a startup after IBM and then Google hired us. In joining, I knew it was our chance to work on products that have the possibility of affecting the world and regular people at scale. 

Two years ago, we shared our seven AI Principles to guide our work. How do you apply them to your everyday research?

One recent example is from our work with the Google Flights team. They offered users alerts about the “right time to buy tickets,” but users were asking themselves, Hmm, how do I trust this alert?  So the designers used our PAIR Guidebook to underscore the importance of AI explainability in their discussions with the engineering team. Together, they redesigned the feature to show users how the price for a flight has changed over the past few months and notify them when prices may go up or won’t get any lower. When it launched, people saw our price history graph and responded very well to it. By using our PAIR Guidebook, the team learned that how you explain your technology can significantly shape the user’s trust in your system. 

Historically, ML has been evaluated along the lines of mathematical metrics for accuracy—but that’s not enough. Once systems touch real lives, there’s so much more you have to think about, such as fairness, transparency, bias and explainability—making sure people understand why an algorithm does what it does. These are the challenges that inspire me to stay at Google after more than 10 years. 

What’s been one of the most rewarding moments of your career?

Whenever we talk to students and there are women and minorities who are excited about working in tech, that’s incredibly inspiring to me. I want them to know they belong in tech, they have a place here. 

Also, working with my team on a Google Doodle about the composer Johann Sebastian Bach last year was so rewarding. It was the very first time Google used AI for a Doodle and it was thrilling to tell my family in Brazil, look, there’s an AI Doodle that uses our tech! 

How should aspiring AI thinkers and future technologists prepare for a career in this field? 

Try to be deep in your field of interest. If it’s AI, there are so many different aspects to this technology, so try to make sure you learn about them. AI isn’t just about technology. It’s always useful to be looking at the applications of the technology, how it impacts real people in real situations.

Read More

Make your everyday smarter with Jacquard

Make your everyday smarter with Jacquard

Technology is most helpful when it’s frictionless. That is why we believe that computing should power experiences through the everyday things around you—an idea we call “ambient computing.” That’s why we developed the Jacquard platform to deliver ambient computing in a familiar, natural way: By building it into things you wear, love and use every day. 

The heart of Jacquard is the Jacquard Tag, a tiny computer built to make everyday items more helpful. We first used this on the sleeve of a jacket so that it could recognize the gestures of the person wearing it, and we built that same technology into the Cit-E backpack with Saint Laurent. Then, we collaborated with Adidas and EA on our GMR shoe insert, enabling its wearers to combine real-life play with the EA SPORTS FIFA mobile game. 

Whether it’s touch or movement-based, the tag can interpret different inputs customized for the garments and gear we’ve collaborated with brands to create. And now we’re sharing that two new backpacks, developed with Samsonite, will integrate Jacquard technology. A fine addition to our collection, the Konnect-I Backpack comes in two styles: Slim ($199) and Standard ($219).

  • Jacquard Samsonite
  • Jacquard Samsonite
  • Jacquard Samsonite
  • Jacquard Samsonite

While they might look like regular backpacks, the left strap unlocks tons of capabilities. Using your Jacquard app, you can customize what gestures control which actions—for instance, you can program Jacquard to deliver call and text notifications, trigger a selfie, control your music or prompt Google Assistant to share the latest news. For an added level of interaction, the LED light on your left strap will light up according to the alerts you’ve set.

This is only the beginning for the Jacquard platform, and thanks to updates, you can expect your Jacquard Tag gear to get better over time. Just like Google wants to make the world’s information universally accessible and useful, we at Jacquard want to help people access information through everyday items and natural movements.

Read More

Massively Large-Scale Distributed Reinforcement Learning with Menger

Massively Large-Scale Distributed Reinforcement Learning with Menger

Posted by Amir Yazdanbakhsh, Research Scientist, and Junchaeo Chen, Software Engineer, Google Research

In the last decade, reinforcement learning (RL) has become one of the most promising research areas in machine learning and has demonstrated great potential for solving sophisticated real-world problems, such as chip placement and resource management, and solving challenging games (e.g., Go, Dota 2, and hide-and-seek). In simplest terms, an RL infrastructure is a loop of data collection and training, where actors explore the environment and collect samples, which are then sent to the learners to train and update the model. Most current RL techniques require many iterations over batches of millions of samples from the environment to learn a target task (e.g., Dota 2 learns from batches of 2 million frames every 2 seconds). As such, an RL infrastructure should not only scale efficiently (e.g., increase the number of actors) and collect an immense number of samples, but also be able to swiftly iterate over these extensive amounts of samples during training.

Overview of an RL system in which an actor sends trajectories (e.g., multiple samples) to a learner. The learner trains a model using the sampled data and pushes the updated model back to the actor (e.g. TF-Agents, IMPALA).

Today we introduce Menger1, a massive large-scale distributed RL infrastructure with localized inference that scales up to several thousand actors across multiple processing clusters (e.g., Borg cells), reducing the overall training time in the task of chip placement. In this post we describe how we implement Menger using Google TPU accelerators for fast training iterations, and present its performance and scalability on the challenging task of chip placement. Menger reduces the training time by up to 8.6x compared to a baseline implementation.

Menger System Design
There are various distributed RL systems, such as Acme and SEED RL, each of which focus on optimizing a single particular design point in the space of distributed reinforcement learning systems. For example, while Acme uses local inference on each actor with frequent model retrieval from the learner, SEED RL benefits from a centralized inference design by allocating a portion of TPU cores for performing batched calls. The tradeoffs between these design points are (1) paying the communication cost of sending/receiving observations and actions to/from a centralized inference server or paying the communication cost of model retrieval from a learner and (2) the cost of inference on actors (e.g., CPUs) compared to accelerators (e.g., TPUs/GPUs). Because of the requirements of our target application (e.g., size of observations, actions, and model size), Menger uses local inference in a manner similar to Acme, but pushes the scalability of actors to virtually an unbounded limit. The main challenges to achieving massive scalability and fast training on accelerators include:

  1. Servicing a large number of read requests from actors to a learner for model retrieval can easily throttle the learner and quickly become a major bottleneck (e.g., significantly increasing the convergence time) as the number of actors increases.
  2. The TPU performance is often limited by the efficiency of the input pipeline in feeding the training data to the TPU compute cores. As the number of TPU compute cores increases (e.g., TPU Pod), the performance of the input pipeline becomes even more critical for the overall training runtime.

Efficient Model Retrieval
To address the first challenge, we introduce transparent and distributed caching components between the learner and the actors optimized in TensorFlow and backed by Reverb (similar approach used in Dota). The main responsibility of the caching components is to strike a balance between the large number of requests from actors and the learner job. Adding these caching components not only significantly reduces the pressure on the learner to service the read requests, but also further distributes the actors across multiple Borg cells with a marginal communication overhead. In our study, we show that for a 16 MB model with 512 actors, the introduced caching components reduce the average read latency by a factor of ~4.0x leading to faster training iterations, especially for on-policy algorithms such as PPO.

Overview of a distributed RL system with multiple actors placed in different Borg cells. Servicing the frequent model update requests from a massive number of actors across different Borg cells throttles the learner and the communication network between learner and actors, which leads to a significant increase in the overall convergence time. The dashed lines represent gRPC communication between different machines.
Overview of a distributed RL system with multiple actors placed in different Borg cells with the introduced transparent and distributed caching service. The learner only sends the updated model to the distributed caching services. Each caching service handles the model request updates from the nearby actors (i.e., actors placed on the same Borg cells) and the caching service. The caching service not only reduces the load on the learner for servicing the model update requests, but also reduces the average read latency by the actors.

High Throughput Input Pipeline
To deliver a high throughput input data pipeline, Menger uses Reverb, a recently open-sourced data storage system designed for machine learning applications that provides an efficient and flexible platform to implement experience replay in a variety of on-policy/off-policy algorithms. However, using a single Reverb replay buffer service does not currently scale well in a distributed RL setting with thousands of actors, and simply becomes inefficient in terms of write throughput from actors.

A distributed RL system with a single replay buffer. Servicing a massive number of write requests from actors throttles the replay buffer and reduces its overall throughput. In addition, as we scale the learner to a setting with multiple compute engines (e.g., TPU Pod), feeding the data to these engines from a single replay buffer service becomes inefficient, which negatively impacts the overall convergence time.

To better understand the efficiency of the replay buffer in a distributed setting, we evaluate the average write latency for various payload sizes from 16 MB to 512 MB and a number of actors ranging from 16 to 2048. We repeat the experiment when the replay buffer and actors are placed on the same Borg cell. As the number of actors grows the average write latency also increases significantly. Expanding the number of actors from 16 to 2048, the average write latency increases by a factor of ~6.2x and ~18.9x for payload size 16 MB and 512 MB, respectively. This increase in the write latency negatively impacts the data collection time and leads to inefficiency in the overall training time.

The average write latency to a single Reverb replay buffer for various payload sizes (16 MB – 512 MB) and various number of actors (16 to 2048) when the actors and replay buffer are placed on the same Borg cells.

To mitigate this, we use the sharding capability provided by Reverb to increase the throughput between actors, learner, and replay buffer services. Sharding balances the write load from the large number of actors across multiple replay buffer servers, instead of throttling a single replay buffer server, and also minimizes the average write latency for each replay buffer server (as fewer actors share the same server). This enables Menger to scale efficiently to thousands of actors across multiple Borg cells.

A distributed RL system with sharded replay buffers. Each replay buffer service is a dedicated data storage for a collection of actors, generally located on the same Borg cells. In addition, the sharded replay buffer configuration provides a higher throughput input pipeline to the accelerator cores.

Case Study: Chip Placement
We studied the benefits of Menger in the complex task of chip placement for a large netlist. Using 512 TPU cores, Menger achieves significant improvements in the training time (up to ~8.6x, reducing the training time from ~8.6 hours down to merely one hour in the fastest configuration) compared to a strong baseline. While Menger was optimized for TPUs, that the key factor for this performance gain is the architecture, and we would expect to see similar gains when tailored to use on GPUs.

The improvement in training time using Menger with variable number of TPU cores compared to a baseline in the task of chip placement.

We believe that Menger infrastructure and its promising results in the intricate task of chip placement demonstrate an innovative path forward to further shorten the chip design cycle and has the potential to not only enable further innovations in the chip design process, but other challenging real-world tasks as well.

Acknowledgments
Most of the work was done by Amir Yazdanbakhsh, Junchaeo Chen, and Yu Zheng. We would like to also thank Robert Ormandi, Ebrahim Songhori, Shen Wang, TF-Agents team, Albin Cassirer, Aviral Kumar, James Laudon, John Wilkes, Joe Jiang, Milad Hashemi, Sat Chatterjee, Piotr Stanczyk, Sabela Ramos, Lasse Espeholt, Marcin Michalski, Sam Fishman, Ruoxin Sang, Azalia Mirhosseini, Anna Goldie, and Eric Johnson for their help and support.


1 A Menger cube is a three-dimensional fractal curve, and the inspiration for the name of this system, given that the proposed infrastructure can virtually scale ad infinitum.

Read More

Developing Real-Time, Automatic Sign Language Detection for Video Conferencing

Developing Real-Time, Automatic Sign Language Detection for Video Conferencing

Posted by Amit Moryossef, Research Intern, Google Research

Video conferencing should be accessible to everyone, including users who communicate using sign language. However, since most video conference applications transition window focus to those who speak aloud, it makes it difficult for signers to “get the floor” so they can communicate easily and effectively. Enabling real-time sign language detection in video conferencing is challenging, since applications need to perform classification using the high-volume video feed as the input, which makes the task computationally heavy. In part, due to these challenges, there is only limited research on sign language detection.

In “Real-Time Sign Language Detection using Human Pose Estimation”, presented at SLRTP2020 and demoed at ECCV2020, we present a real-time sign language detection model and demonstrate how it can be used to provide video conferencing systems a mechanism to identify the person signing as the active speaker.

Maayan Gazuli, an Israeli Sign Language interpreter, demonstrates the sign language detection system.

Our Model
To enable a real-time working solution for a variety of video conferencing applications, we needed to design a light weight model that would be simple to “plug and play.” Previous attempts to integrate models for video conferencing applications on the client side demonstrated the importance of a light-weight model that consumes fewer CPU cycles in order to minimize the effect on call quality. To reduce the input dimensionality, we isolated the information the model needs from the video in order to perform the classification of every frame.

Because sign language involves the user’s body and hands, we start by running a pose estimation model, PoseNet. This reduces the input considerably from an entire HD image to a small set of landmarks on the user’s body, including the eyes, nose, shoulders, hands, etc. We use these landmarks to calculate the frame-to-frame optical flow, which quantifies user motion for use by the model without retaining user-specific information. Each pose is normalized by the width of the person’s shoulders in order to ensure that the model attends to the person signing over a range of distances from the camera. The optical flow is then normalized by the video’s frame rate before being passed to the model.

To test this approach, we used the German Sign Language corpus (DGS), which contains long videos of people signing, and includes span annotations that indicate in which frames signing is taking place. As a naïve baseline, we trained a linear regression model to predict when a person is signing using optical flow data. This baseline reached around 80% accuracy, using only ~3μs (0.000003 seconds) of processing time per frame. By including the 50 previous frames’ optical flow as context to the linear model, it is able to reach 83.4%.

To generalize the use of context, we used a long-short-term memory (LSTM) architecture, which contains memory over the previous timesteps, but no lookback. Using a single layer LSTM, followed by a linear layer, the model achieves up to 91.5% accuracy, with 3.5ms (0.0035 seconds) of processing time per frame.

Classification model architecture. (1) Extract poses from each frame; (2) calculate the optical flow from every two consecutive frames; (3) feed through an LSTM; and (4) classify class.

Proof of Concept
Once we had a functioning sign language detection model, we needed to devise a way to use it for triggering the active speaker function in video conferencing applications. We developed a lightweight, real-time, sign language detection web demo that connects to various video conferencing applications and can set the user as the “speaker” when they sign. This demo leverages PoseNet fast human pose estimation and sign language detection models running in the browser using tf.js, which enables it to work reliably in real-time.

When the sign language detection model determines that a user is signing, it passes an ultrasonic audio tone through a virtual audio cable, which can be detected by any video conferencing application as if the signing user is “speaking.” The audio is transmitted at 20kHz, which is normally outside the hearing range for humans. Because video conferencing applications usually detect the audio “volume” as talking rather than only detecting speech, this fools the application into thinking the user is speaking.

The sign language detection demo takes the webcam’s video feed as input, and transmits audio through a virtual microphone when it detects that the user is signing.

You can try our experimental demo right now! By default, the demo acts as a sign language detector. The training code and models as well as the web demo source code is available on GitHub.

Demo
In the following video, we demonstrate how the model might be used. Notice the yellow chart at the top left corner, which reflects the model’s confidence in detecting that activity is indeed sign language. When the user signs, the chart values rise to nearly 100, and when she stops signing, it falls to zero. This process happens in real-time, at 30 frames per second, the maximum frame rate of the camera used.

Maayan Gazuli, an Israeli Sign Language interpreter, demonstrates the sign language detection demo.

User Feedback
To better understand how well the demo works in practice, we conducted a user experience study in which participants were asked to use our experimental demo during a video conference and to communicate via sign language as usual. They were also asked to sign over each other, and over speaking participants to test the speaker switching behavior. Participants responded positively that sign language was being detected and treated as audible speech, and that the demo successfully identified the signing attendee and triggered the conferencing system’s audio meter icon to draw focus to the signing attendee.

Conclusions
We believe video conferencing applications should be accessible to everyone and hope this work is a meaningful step in this direction. We have demonstrated how our model could be leveraged to empower signers to use video conferencing more conveniently.

Acknowledgements
Amit Moryossef, Ioannis Tsochantaridis, Roee Aharoni, Sarah Ebling, Annette Rios, Srini Narayanan, George Sung, Jonathan Baccash, Aidan Bryant, Pavithra Ramasamy and Maayan Gazuli

Read More

Audiovisual Speech Enhancement in YouTube Stories

Audiovisual Speech Enhancement in YouTube Stories

Posted by Inbar Mosseri, Software Engineer and Michael Rubinstein, Research Scientist, Google Research

While tremendous efforts are invested in improving the quality of videos taken with smartphone cameras, the quality of audio in videos is often overlooked. For example, the speech of a subject in a video where there are multiple people speaking or where there is high background noise might be muddled, distorted, or difficult to understand. In an effort to address this, two years ago we introduced Looking to Listen, a machine learning (ML) technology that uses both visual and audio cues to isolate the speech of a video’s subject. By training the model on a large-scale collection of online videos, we are able to capture correlations between speech and visual signals such as mouth movements and facial expressions, which can then be used to separate the speech of one person in a video from another, or to separate speech from background sounds. We showed that this technology not only achieves state-of-the-art results in speech separation and enhancement (a noticeable 1.5dB improvement over audio-only models), but in particular can improve the results over audio-only processing when there are multiple people speaking, as the visual cues in the video help determine who is saying what.

We are now happy to make the Looking to Listen technology available to users through a new audiovisual Speech Enhancement feature in YouTube Stories (on iOS), allowing creators to take better selfie videos by automatically enhancing their voices and reducing background noise. Getting this technology into users’ hands was no easy feat. Over the past year, we worked closely with users to learn how they would like to use such a feature, in what scenarios, and what balance of speech and background sounds they would like to have in their videos. We heavily optimized the Looking to Listen model to make it run efficiently on mobile devices, overall reducing the running time from 10x real-time on a desktop when our paper came out, to 0.5x real-time performance on the phone. We also put the technology through extensive testing to verify that it performs consistently across different recording conditions and for people with different appearances and voices.

From Research to Product
Optimizing Looking to Listen to allow fast and robust operation on mobile devices required us to overcome a number of challenges. First, all processing needed to be done on-device within the client app in order to minimize processing time and to preserve the user’s privacy; no audio or video information would be sent to servers for processing. Further, the model needed to co-exist alongside other ML algorithms used in the YouTube app in addition to the resource-consuming video recording itself. Finally, the algorithm needed to run quickly and efficiently on-device while minimizing battery consumption.

The first step in the Looking to Listen pipeline is to isolate thumbnail images that contain the faces of the speakers from the video stream. By leveraging MediaPipe BlazeFace with GPU accelerated inference, this step is now able to be executed in just a few milliseconds. We then switched the model part that processes each thumbnail separately to a lighter weight MobileNet (v2) architecture, which outputs visual features learned for the purpose of speech enhancement, extracted from the face thumbnails in 10 ms per frame. Because the compute time to embed the visual features is short, it can be done while the video is still being recorded. This avoids the need to keep the frames in memory for further processing, thereby reducing the overall memory footprint. Then, after the video finishes recording, the audio and the computed visual features are streamed to the audio-visual speech separation model which produces the isolated and enhanced speech.

We reduced the total number of parameters in the audio-visual model by replacing “regular” 2D convolutions with separable ones (1D in the frequency dimension, followed by 1D in the time dimension) with fewer filters. We then optimized the model further using TensorFlow Lite — a set of tools that enable running TensorFlow models on mobile devices with low latency and a small binary size. Finally, we reimplemented the model within the Learn2Compress framework in order to take advantage of built-in quantized training and QRNN support.

Our Looking to Listen on-device pipeline for audiovisual speech enhancement

These optimizations and improvements reduced the running time from 10x real-time on a desktop using the original formulation of Looking to Listen, to 0.5x real-time performance using only an iPhone CPU; and brought the model size down from 120MB to 6MB now, which makes it easier to deploy. Since YouTube Stories videos are short — limited to 15 seconds — the result of the video processing is available within a couple of seconds after the recording is finished.

Finally, to avoid processing videos with clean speech (so as to avoid unnecessary computation), we first run our model only on the first two seconds of the video, then compare the speech-enhanced output to the original input audio. If there is sufficient difference (meaning the model cleaned up the speech), then we enhance the speech throughout the rest of the video.

Researching User Needs
Early versions of Looking to Listen were designed to entirely isolate speech from the background noise. In a user study conducted together with YouTube, we found that users prefer to leave in some of the background sounds to give context and to retain some the general ambiance of the scene. Based on this user study, we take a linear combination of the original audio and our produced clean speech channel: output_audio = 0.1 x original_audio + 0.9 x speech. The following video presents clean speech combined with different levels of the background sounds in the scene (10% background is the balance we use in practice).

Below are additional examples of the enhanced speech results from the new Speech Enhancement feature in YouTube Stories. We recommend watching the videos with good speakers or headphones.

Fairness Analysis
Another important requirement is that the model be fair and inclusive. It must be able to handle different types of voices, languages and accents, as well as different visual appearances. To this end, we conducted a series of tests exploring the performance of the model with respect to various visual and speech/auditory attributes: the speaker’s age, skin tone, spoken language, voice pitch, visibility of the speaker’s face (% of video in which the speaker is in frame), head pose throughout the video, facial hair, presence of glasses, and the level of background noise in the (input) video.

For each of the above visual/auditory attributes, we ran our model on segments from our evaluation set (separate from the training set) and measured the speech enhancement accuracy, broken down according to the different attribute values. Results for some of the attributes are summarized in the following plots. Each data point in the plots represents hundreds (in most cases thousands) of videos fitting the criteria.

Speech enhancement quality (signal-to-distortion ratio, SDR, in dB) for different spoken languages, sorted alphabetically. The average SDR was 7.89 dB with a standard deviation of 0.42 dB — deviation that for human listeners is considered hard to notice.
Left: Speech enhancement quality as a function of the speaker’s voice pitch. The fundamental voice frequency (pitch) of an adult male typically ranges from 85 to 180 Hz, and that of an adult female ranges from 165 to 255 Hz. Right: speech enhancement quality as a function of the speaker’s predicted age.
As our method utilizes facial cues and mouth movements to isolate the speech, we tested whether facial hair (e.g., a moustache, beard) may obstruct those visual cues and affect the method’s performance. Our evaluations show that the quality of speech enhancement is maintained well also in the presence of facial hair.

Using the Feature
YouTube creators who are eligible for YouTube Stories creation may record a video on iOS, and select “Enhance speech” from the volume controls editing tool. This will immediately apply speech enhancement to the audio track and will play back the enhanced speech in a loop. It is then possible to toggle the feature on and off multiple times to compare the enhanced speech with the original audio.

In parallel to this new feature in YouTube, we are also exploring additional venues for this technology. More to come later this year — stay tuned!

Acknowledgements
This feature is a collaboration across multiple teams at Google. Key contributors include: from Research-IL: Oran Lang; from VisCAM: Ariel Ephrat, Mike Krainin, JD Velasquez, Inbar Mosseri, Michael Rubinstein; from Learn2Compress: Arun Kandoor; from MediaPipe: Buck Bourdon, Matsvei Zhdanovich, Matthias Grundmann; from YouTube: Andy Poes, Vadim Lavrusik, Aaron La Lau, Willi Geiger, Simona De Rosa, and Tomer Margolin.

Read More

What if you could turn your voice into any instrument?

What if you could turn your voice into any instrument?

Imagine whistling your favorite song. You might not sound like the real deal, right? Now imagine your rendition using auto-tune software. Better, sure, but the result is still your voice. What if there was a way to turn your voice into something like a violin, or a saxophone, or a flute? 

Google Research’s Magenta team, which has been focused on the intersection of machine learning and creative tools for musicians, has been experimenting with exactly this. The team recently created an open source technology called Differentiable Digital Signal Processing (DDSP). DDSP is a new approach to machine learning that enables models to learn the characteristics of a musical instrument and map them to a different sound. The process can lead to so many creative, quirky results. Try replacing a capella singing with a saxophone solo, or a dog barking with a trumpet performance. The options are endless. 

And so are the sounds you can make. This development is important because it enables music technologies to become more inclusive. Machine learning models inherit biases from the datasets they are trained on, and music models are no different. Many are trained on the structure of western musical scores, which excludes much of the music from the rest of the world. Rather than following the formal rules of western music, like the 12 notes on a piano, DDSP transforms sound by modeling frequencies in the audio itself. This opens up machine learning technologies to a wider range of musical cultures. 

In fact, anyone can give it a try.  We created a tool called Tone Transfer to allow musicians and amateurs alike to tap into DDSP as a delightful creative tool. Play with the Tone Transfer showcase to sample sounds, or record your own, and listen to how they can be transformed into a myriad of instruments using DDSP technology. Check out our film that shows artists using Tone Transfer for the first time.

DDSP does not create music on its own; think of it like another instrument that requires skill and thought. It’s an experimental soundscape environment for music, and we’re so excited to see how the world uses it.

Read More