Tools for language access during COVID-19

Translation services make it easier to communicate with someone who doesn’t speak the same language, whether you’re traveling abroad or living in a new country. But in the context of a global pandemic, government and health officials urgently need to deliver vital information to their communities, and every member of the community needs access to information in a language they understand. In the U.S. alone, that means reaching 51 million migrants in at least 350 languages, with information ranging from how to keep people and their families safe, to financial, employment or food resources.

To better understand the challenges in addressing these translation needs, we conducted a research study, and interviewed health and government officials responsible for disseminating critical information. We assessed the current shortcomings in providing this information in the relevant languages, and how translation tools could help mitigate them.

The struggle for language access 

When organizations—from health departments to government agencies—update information on a website, it needs to be quickly accessible in a wide variety of languages. We learned that these organizations are struggling to keep up with the high volume of rapidly-changing content and lack the resources to translate this content into the needed languages. 

Officials, who are already spread thin, can barely keep up with the many updates surrounding COVID-19—from the evolving scientific understanding, to daily policy amendments, to new resources for the public. Nearly all new information is coming in as PDFs several times a day, and many officials report not being able to offer professional translation for all needed languages. This is where machine translation can serve as a useful tool.  

How machine translation can help

Machine translation is an automated way to translate text or speech from one language to another. It can take volumes of data and provide translations into a large number of supported languages. Although not intended to fully replace human translators, it can provide value when immediate translations are needed for a wide variety of languages.

If you’re looking to translate content on the web, you have several options.

Use your browser

Many popular browsers offer translation capabilities, which are either built in (e.g. Chrome) or require installing an add-on or extension (e.g. Microsoft Edge or Firefox). To translate web content in Chrome, all you have to do is go to a webpage in another language, then click “Translate” at the top.

Use a website translation widget

If you are a webmaster of a government, non-profit, and/or non-commercial website (e.g. academic institutions), you may be eligible to sign up for the Google Translate Website Translator widget. This tool translates web page content into 100+ different languages. To find out more, please visit the webmasters blog.

Upload PDFs and documents

Google Translate supports translating many different document formats (.doc, .docx, .odf, .pdf, .ppt, .pptx, .ps, .rtf, .txt, .xls, .xlsx). By simply uploading the document, you can get a translated version in the language that you choose.

Millions of people need translations of resources at this time. Google’s researchers, designers and product developers are listening. We are continuously looking for ways to improve our products and come to people’s aid as we navigate the pandemic. 

Read More

An update on our work on AI and responsible innovation

An update on our work on AI and responsible innovation

AI is a powerful tool that will have a significant impact on society for many years to come, from improving sustainability around the globe to advancing the accuracy of disease screenings. As a leader in AI, we’ve always prioritized the importance of understanding its societal implications and developing it in a way that gets it right for everyone. 

That’s why we first published our AI Principles two years ago and why we continue to provide regular updates on our work. As our CEO Sundar Pichai said in January, developing AI responsibly and with social benefit in mind can help avoid significant challenges and increase the potential to improve billions of lives. 

The world has changed a lot since January, and in many ways our Principles have become even more important to the work of our researchers and product teams. As we develop AI we are committed to testing safety, measuring social benefits, and building strong privacy protections into products. Our Principles give us a clear framework for the kinds of AI applications we will not design or deploy, like those that violate human rights or enable surveillance that violates international norms. For example, we were the first major company to have decided, several years ago, not to make general-purpose facial recognition commercially available.

Over the last 12 months, we’ve shared our point of view on how to develop AI responsibly—see our 2019 annual report and our recent submission to the European Commission’s Consultation on Artificial Intelligence. This year, we’ve also expanded our internal education programs, applied our principles to our tools and research, continued to refine our comprehensive review process, and engaged with external stakeholders around the world, while identifying emerging trends and patterns in AI. 

Building on previous AI Principles updates we shared here on the Keyword in 2018 and 2019, here’s our latest overview of what we’ve learned, and how we’re applying these learnings in practice.

Internal education

In addition to launching the initial Tech Ethics training that 800+ Googlers have taken since its launch last year, this year we developed a new training for AI Principles issue spotting. We piloted the course with more than 2,000 Googlers, and it is now available as an online self-study course to all Googlers across the company. The course coaches employees on asking critical questions to spot potential ethical issues, such as whether an AI application might lead to economic or educational exclusion, or cause physical, psychological, social or environmental harm. We recently released a version of this training as a mandatory course for customer-facing Cloud teams and 5,000 Cloud employees have already taken it.

Tools and research

Our researchers are working on computer science and technology not just for today, but for tomorrow as well. They continue to play a leading role in the field, publishing more than 200 academic papers and articles in the last year on new methods for putting our principles into practice. These publications address technical approaches to fairness, safety, privacy, and accountability to people, including effective techniques for improving fairness in machine learning at scale, a method for incorporating ethical principles into a machine-learned model, and design principles for interpretable machine learning systems.

Over the last year, a team of Google researchers and collaborators published an academic paper proposing a framework called Model Cards that’s similar to a food nutrition label and designed to report an AI model’s intent of use, and its performance for people from a variety of backgrounds. We’ve applied this research by releasing Model Cards for Face Detection and Object Detection models used in Google Cloud’s Vision API product.

Our goal is for Google to be a helpful partner not only to researchers and developers who are building AI applications, but also to the billions of people who use them in everyday products. We’ve gone a step further, releasing 14 new tools that help explain how responsible AI works, from simple data visualizations on algorithmic bias for general audiences to Explainable AIdashboards and tool suites for enterprise users. You’ll find a number of these within our new Responsible AI with TensorFlow toolkit.

Review process 

As we’ve shared previously, Google has a central, dedicated team that reviews proposals for AI research and applications for alignment with our principles. Operationalizing the AI Principles is challenging work. Our review process is iterative, and we continue to refine and improve our assessments as advanced technologies emerge and evolve. The team also consults with internal domain experts in machine-learning fairness, security, privacy, human rights, and other areas. 

Whenever relevant, we conduct additional expert human rights assessments of new products in our review process, before launch. For example, we enlisted the nonprofit organization BSR (Business for Social Responsibility) to conduct a formal human rights assessment of the new Celebrity Recognition tool, offered within Google Cloud Vision and Video Intelligence products. BSR applied the UN’s Guiding Principles on Business and Human Rights as a framework to guide the product team to consider the product’s implications across people’s privacy and freedom of expression, as well as potential harms that could result, such as discrimination. This assessment informed not only the product’s design, but also the policies around its use. 

In addition, because any robust evaluation of AI needs to consider not just technical methods but also social context(s), we consult a wider spectrum of perspectives to inform our AI review process, including social scientists and Google’s employee resource groups.

As one example, consider how we’ve built upon learnings from a case we published in our last AI Principles update: the review of academic research on text-to-speech (TTS) technology. Since then, we have applied what we learned in that earlier review to establish a Google-wide approach to TTS. Google Cloud’s Text-to-Speech service, used in products such as Google Lens, puts this approach into practice.

Because TTS could be used across a variety of products, a group of senior Google technical and business leads were consulted. They considered the proposal against our AI Principles of being socially beneficial and accountable to people, as well as the need to incorporate privacy by design and avoiding technologies that cause or are likely to cause overall harm.

  • Reviewers identified the benefits of an improved user interface for various products, and significant accessibility benefits for people with hearing impairments. 

  • They considered the risks of voice mimicry and impersonation, media manipulation, and defamation.

  • They took into account how an AI model is used, and recognized the importance of adding layers of barriers for potential bad actors, to make harmful outcomes less likely.

  • They recommended on-device privacy and security precautions that serve as barriers to misuse, reducing the risk of overall harm from use of TTS technology for nefarious purposes.  

  • The reviewers recommended approving TTS technology for use in our products, but only with user consent and on-device privacy and security measures.

  • They did not approve open-sourcing of TTS models, due to the risk that someone might misuse them to build harmful deepfakes and distribute misinformation. 

Text to Speech.jpg

External engagement

To increase the number and variety of outside perspectives, this year we launched the Equitable AI Research Roundtable, which brings together advocates for communities of people who are currently underrepresented in the technology industry, and who are most likely to be impacted by the consequences of AI and advanced technology. This group of community-based, non-profit leaders and academics meet with us quarterly to discuss AI ethics issues, and learnings from these discussions help shape operational efforts and decision-making frameworks. 

Our global efforts this year included new programs to support non-technical audiences in their understanding of, and participation in, the creation of responsible AI systems, whether they are policymakers, first-time ML (machine learning) practitioners or domain experts. These included:

 

  • Partnering with Yielding Accomplished African Women to implement the first-ever Women in Machine Learning Conference in Africa. We built a network of 1,250 female machine learning engineers from six different African countries. Using the Google Cloud Platform, we trained and certified 100 women at the conference in Accra, Ghana. More than 30 universities and 50 companies and organizations were represented. The conference schedule included workshops on Qwiklabs, AutoML, TensorFlow, human-centered approach to AI, mindfulness and #IamRemarkable

  • Releasing, in partnership with the Ministry of Public Health in Thailand, the first studyof its kind on how researchers apply nurses’ and patients’ input to make recommendations on future AI applications, based on how nurses deployed a new AI system to screen patients for diabetic retinopathy. 

  • Launching an ML workshop for policymakers featuring content and case studies covering the topics of Explainability, Fairness, Privacy, and Security. We’ve run this workshop, via Google Meet, with over 80 participants in the policy space with more workshops planned for the remainder of the year. 

  • Hosting the PAIR (People + AI Research) Symposium in London, which focused on participatory ML and marked PAIR’s expansion to the EMEA region. The event drew 160 attendees across academia, industry, engineering, and design, and featured cross-disciplinary discussions on human-centered AI and hands-on demos of ML Fairness and interpretability tools. 

We remain committed to external, cross-stakeholder collaboration. We continue to serve on the board and as a member of the Partnership on AI, a multi-stakeholder organization that studies and formulates best practices on AI technologies. As an example of our work together, the Partnership on AI is developing best practices that draw from our Model Cards proposal as a framework for accountability among its member organizations. 

Trends, technologies and patterns emerging in AI

We know no system, whether human or AI powered, will ever be perfect, so we don’t consider the task of improving it to ever be finished. We continue to identify emerging trends and challenges that surface in our AI Principles reviews. These prompt us to ask questions such as when and how to responsibly develop synthetic media, keep humans in an appropriate loop of AI decisions, launch products with strong fairness metrics, deploy affective technologies, and offer explanations on how AI works, within products themselves. 

As Sundar wrote in January, it’s crucial that companies like ours not only build promising new technologies, but also harness them for good—and make them available for everyone. This is why we believe regulation can offer helpful guidelines for AI innovation, and why we share our principled approach to applying AI. As we continue to responsibly develop and use AI to benefit people and society, we look forward to continuing to update you on specific actions we’re taking, and on our progress.

Read More

AutoML-Zero: Evolving Code that Learns

AutoML-Zero: Evolving Code that Learns

Posted by Esteban Real, Staff Software Engineer, and Chen Liang, Software Engineer, Google Research, Brain Team

Machine learning (ML) has seen tremendous successes recently, which were made possible by ML algorithms like deep neural networks that were discovered through years of expert research. The difficulty involved in this research fueled AutoML, a field that aims to automate the design of ML algorithms. So far, AutoML has focused on constructing solutions by combining sophisticated hand-designed components. A typical example is that of neural architecture search, a subfield in which one builds neural networks automatically out of complex layers (e.g., convolutions, batch-norm, and dropout), and the topic of much research.

An alternative approach to using these hand-designed components in AutoML is to search for entire algorithms from scratch. This is challenging because it requires the exploration of vast and sparse search spaces, yet it has great potential benefits — it is not biased toward what we already know and potentially allows for the discovery of new and better ML architectures. By analogy, if one were building a house from scratch, there is more potential for flexibility or improvement than if one was constructing a house using only prefabricated rooms. However, the discovery of such housing designs may be more difficult because there are many more possible ways to combine the bricks and mortar than there are of combining pre-made designs of entire rooms. As such, early research into algorithm learning from scratch focused on one aspect of the algorithm, to reduce the search space and compute required, such as the learning rule, and has not been revisited much since the early 90s. Until now.

Extending our research into evolutionary AutoML, our recent paper, to be published at ICML 2020, demonstrates that it is possible to successfully evolve ML algorithms from scratch. The approach we propose, called AutoML-Zero, starts from empty programs and, using only basic mathematical operations as building blocks, applies evolutionary methods to automatically find the code for complete ML algorithms. Given small image classification problems, our method rediscovered fundamental ML techniques, such as 2-layer neural networks with backpropagation, linear regression and the like, which have been invented by researchers throughout the years. This result demonstrates the plausibility of automatically discovering more novel ML algorithms to address harder problems in the future.

Evolving Learning Algorithms from Scratch
We use a variant of classic evolutionary methods to search the space of algorithms. These methods have proved useful in discovering computer programs since the 80s. Their simplicity and scalability makes them especially suitable for the discovery of learning algorithms.

In our case, a population is initialized with empty programs. It then evolves in repeating cycles to produce better and better learning algorithms. At each cycle, two (or more) random models compete and the most accurate model gets to be a parent. The parent clones itself to produce a child, which gets mutated. That is, the child’s code is modified in a random way, which could mean, for example, arbitrarily inserting, removing or modifying a line in the code. The mutated algorithm is then evaluated on image classification tasks.

A population is initialized with empty programs. Many generations later, we see a more evolved population and two of its algorithms compete. The most accurate wins to produce a child. After many such events, the final population contains highly accurate classifiers.

Exploring a Difficult Search Space
Our AutoML-Zero setup, in contrast to much previous AutoML work, makes the search space very sparse — an accurate algorithm might be as rare as 1 in 1012 candidates. This is due to the granularity of the building blocks provided to the algorithm, which include only basic operations such as variable assignment, addition, and matrix multiplication. In such an environment, a random search will not find a solution in a reasonable amount of time, yet evolution can be tens of thousands of times faster, according to our measurements. We distributed the search on multiple machines that occasionally exchange algorithms (analogous to migration in real life). We also constructed small proxy classification tasks on which to evaluate each child algorithm, and executed this evaluation with highly optimized code.

Despite the sparsity, the evolutionary search discovers more complex and effective techniques as time passes. Initially, the simplest algorithms appear, which represent linear models with hard-coded weights. In time, stochastic gradient descent (SGD) is invented to learn the weights, in spite of the gradient itself not having been provided as a building block. Though flawed at first, SGD gets fixed relatively quickly, starting a series of improvements to the prediction and learning algorithm. Within our toy scenario, the process discovers several concepts known to have been useful to the research community. In the end, our approach manages to construct a model that outperforms hand-designs of comparable complexity.

Progress of an evolution experiment. As time passes, from left to right, we see the algorithms becoming more complex and more accurate.

The Evolved Algorithm
The figure above includes the best evolved algorithm produced by our method. This final algorithm includes techniques such as noise injection as data augmentation, bilinear model, gradient normalization, and weight averaging, and the improvement over the baseline also transfers to datasets that are not used during search. Our paper describes how the different lines in the evolved code implement each of these techniques, and verifies their value through ablation studies.

Through more experiments, we show that it is possible to guide the evolutionary search by controlling “the habitat” — i.e., the tasks on which the evolutionary process evaluates the fitness of the algorithms. For example, when we reduce the amount of data, the noisy ReLU emerges, which helps with regularization. Or when we reduce the number of training steps, we witness the emergence of learning rate decay, which enables faster convergence. Targeted discoveries such as these are important — while it may be interesting if an automatic tool-inventing machine comes up with a hammer or a needle, it is much more interesting if it comes up with a hammer when you show it some nails and a needle when you show it some thread. By analogy, in our work the noisy ReLU (“hammer”) is discovered when in the presence of little data (“nails”) and the learning rate decay when in the presence of few training steps.

Conclusion
We consider this to be preliminary work. We have yet to evolve fundamentally new algorithms, but it is encouraging that the evolved algorithm can surpass simple neural networks that exist within the search space. Right now, the search process requires significant compute.* As the coming years scale up available hardware and as the search methods become more efficient, it is likely that the search space will become more inclusive and the results will improve. We are excited at the prospects of discovering novel machine learning algorithms as we further our understanding of AutoML-Zero.

Acknowledgements
We want to thank our co-authors, David R. So and Quoc V. Le, and the many who helped us through discussions during the project and paper writing, including Samy Bengio, Vincent Vanhoucke, Doug Eck, Charles Sutton, Yanping Huang, Jacques Pienaar, Jeff Dean, and particularly Gabriel Bender, Hanxiao Liu, Rishabh Singh, Chiyuan Zhang, and Hieu Pham. We also want to especially thank Tom Small for contributing the animations in this post.


* The electricity consumption for the experiments (run in 2019) was matched with the purchase of renewable energy.

Read More

Ask a Techspert: How do machine learning models explain themselves?

Ask a Techspert: How do machine learning models explain themselves?

Editor’s Note: Do you ever feel like a fish out of water? Try being a tech novice and talking to an engineer at a place like Google. Ask a Techspert is a series on the Keyword asking Googler experts to explain complicated technology for the rest of us. This isn’t meant to be comprehensive, but just enough to make you sound smart at a dinner party. 

A few years ago, I learned that a translation from Finnish to English using Google Translate led to an unexpected outcome. The sentence “hän on lentäjä” became “he is a pilot” in English, even though “hän” is a gender-neutral word in Finnish. Why did Translate assume it was “he” as the default? 

As I started looking into it, I became aware that just like humans, machines are affected by society’s biases. The machine learning model for Translate relied on training data, which consisted of the input from hundreds of millions of already-translated examples from the web. “He” was more associated with some professions than “she” was, and vice versa. 

Now, Google provides options for both feminine and masculine translations when adapting gender-neutral words in several languages, and there’s a continued effort to roll it out more broadly. But it’s still a good example of how machine learning can reflect the biases we see all around us. Thankfully, there are teams at Google dedicated to finding human-centered solutions to making technology inclusive for everyone. I sat down with Been Kim, a Google researcher working on the People + AI Research (PAIR) team, who devotes her time to making sure artificial intelligence puts people, not machines, at its center, and helping others understand the full spectrum of human interaction with machine intelligence. We talked about how you make machine learning models easy to interpret and understand, and why it’s important for everybody to have a basic idea of how the technology works.

Been Kim

Why is this field of work so important?

Machine learning is such a powerful tool, and because of that, you want to make sure you’re using it responsibly. Let’s take an electric machine saw as an example. It’s a super powerful tool, but you need to learn how to use it in order not to cut your fingers. Once you learn, it’s so useful and efficient that you’ll never want to go back to using a hand saw. And the same goes for machine learning. We want to help you understand and use machine learning correctly, fairly and safely. 

Since machine learning is used in our everyday lives, it’s also important for everyone to understand how it impacts us. No matter whether you’re a coffee shop owner using machine learning to optimize the purchase of your beans based on seasonal trends, or your doctor diagnoses you with a disease with the help of this technology, it’s often crucial to understand why a machine learning model has produced the outcome it has. It’s also important for developers and decision-makers to be able to explain or present a machine learning model to people in order to do so. This is what we call “interpretability.” 

How do you make machine learning models easier to understand and interpret? 

There are many different ways to make an ML model easier to understand. One way is to make the model reflect how humans think from the start, and have the model “trained” to provide explanations along with predictions, meaning when it gives you an outcome, it also has to explain how it got there. 

Another way is to try and explain a model after the training on data is done. This is something you can do when the model has been built to use input to provide an output from its own perspective, optimizing for prediction, without a clear “how” included. This means you’re able to plug things into it and see what comes out, and that can give you some insight into how the model generally makes decisions, but you don’t necessarily know exactly how specific inputs are interpreted by the model in specific cases. 

One way to try and explain models after they’ve been trained is using low level features or high level concepts. Let me give you an example of what this means. Imagine a system that classifies pictures: you give it a picture and it says, “This is a cat.” A low level feature is when I then ask the machine which pixels mattered for that prediction, it can tell us if it was one pixel or the other, and we might be able to see that the pixels in question show the cat’s whiskers. But we might also see that it is a scattering of pixels that don’t appear meaningful to the human eye, or that it’s made the wrong interpretation. High level concepts are more similar to the way humans communicate with one another. Instead of asking about pixels, I’d ask, “Did the whiskers matter for the prediction? or the paws?” and again, the machine can show me what imagery led it to reach this conclusion. Based on the outcome, I can understand the model better. (Together with researchers from Stanford, we’ve published papers that go into further detail on this for those who are interested.)

Can machines understand some things that we humans can’t? 

Yes! This is an area that I am very interested in myself. I am currently working on a way to showcase how technology can help humans learn new things. Machine learning technology is better at some things than we are; for example it can analyze and interpret data at a much larger scale than humans can. Leveraging this technology, I believe we can enlighten human scientists with knowledge they haven’t previously been aware of. 

What do you need to be careful of when you’re making conclusions based on machine learning models?

First of all, we have to be careful that human bias doesn’t come into play. Humans carry biases that we simply cannot help and are often unaware of, so if an explanation is up to a human’s interpretation, and often it is, then we have a problem. Humans read what they want to read. Now, this doesn’t mean that you should remove humans from the loop. Humans communicate with machines, and vice versa. Machines need to communicate their outcomes in the form of a clear statement using quantitative data, not one that is vague and completely open for interpretation. If the latter happens, then the machine hasn’t done a very good job and the human isn’t able to provide good feedback to the machine. It could also be that the outcome simply lacks additional context only the human can provide, or that it could benefit from having caveats, in order for them to make an informed judgement about the results of the model. 

What are some of the main challenges of this work? 

Well, one of the challenges for computer scientists in this field is dealing with non mathematical objectives, which are things you might want to optimize for, but don’t have an equation for. You can’t always define what is good for humans using math. That requires us to test and evaluate methods with rigor, and have a table full of different people to discuss the outcome. Another thing has to do with complexity. Humans are so complex that we have a whole field of work – psychology – to study this. So in my work, we don’t just have computational challenges, but also complex humans that we have to consider. Value-based questions such as “what defines fairness?” are even harder. They require interdisciplinary collaboration, and a diverse group of people in the room to discuss each individual matter.

What’s the most exciting part? 

I think interpretability research and methods are making a huge impact. Machine learning technology is a powerful tool that will transform society as we know it, and helping others to use it safely is very rewarding. 

On a more personal note, I come from South Korea and grew up in circumstances where I feel I didn’t have too many opportunities. I was incredibly lucky to get a scholarship to MIT and come to the U.S. When I think about the people who haven’t had these opportunities to be educated in science or machine learning, and knowing that this machine learning technology can really help and be useful to them in their everyday lives if they use it safely, I feel really motivated to be working on democratizing this technology. There’s many ways to do it, and interpretability is one of the things that I can contribute with.  

Read More

Duality — A New Approach to Reinforcement Learning

Duality — A New Approach to Reinforcement Learning

Posted by Ofir Nachum and Bo Dai, Research Scientists, Google Research

Reinforcement learning (RL) is an approach commonly used to train agents to make sequences of decisions that will be successful in complex environments, including for example, settings such as robotic navigation, where an agent controls the joint motors of a robot to seek a path to a target location, or game-playing, where the goal might be to solve a game level in minimal time. Many modern successful RL algorithms, such as Q-learning and actor-critic, propose to reduce the RL problem to a constraint-satisfaction problem, where a constraint exists for every possible “state” of the environment. For example, in vision-based robotic navigation, the “states” of the environment correspond to every possible camera input.

Despite how ubiquitous the constraint-satisfaction approach is in practice, this strategy is often difficult to reconcile with the complexity of real-world settings. In practical scenarios (like the robotic navigation example) the space of states is large, sometimes even uncountable, so how can one learn to satisfy the tremendous number of constraints associated with arbitrary input? Implementations of Q-learning and actor-critic often ignore these mathematical issues or obscure them through a series of rough approximations, which results in a stark divide between the practical implementations of these algorithms and their mathematical foundations.

In “Reinforcement Learning via Fenchel-Rockafellar Duality” we have developed a new approach to RL that enables algorithms that are both useful in practice and mathematically principled — that is to say, the proposed algorithms avoid the use of exceedingly rough approximations to translate their mathematical foundations to practical implementation. This approach is based on convex duality, which is a well-studied mathematical tool used to transform problems expressed in one form into equivalent problems in distinct forms that may be more computationally friendly. In our case, we develop specific ways to apply duality in RL to transform the traditional constraint-satisfaction mathematical form to an unconstrained, and thus more practical, mathematical problem.

A Duality-Based Solution
The duality-based approach begins by formulating the reinforcement learning problem as a mathematical objective along with a number of constraints, potentially infinite in number. Applying duality to this mathematical problem yields a different formulation of the same problem. Still, this dual formulation has the same format as the original problem — a single objective with a large number of constraints — although the specific objective and constraints are changed.

The next step is key to the duality-based solution. We augment the dual objective with a convex regularizer, a method often used in optimization as a way to smooth a problem and make it easier to solve. The choice of the regularizer is crucial to the final step, in which we apply duality once again to yield another formulation of an equivalent problem. In our case, we use the f-divergence regularizer, which results in a final formulation that is now unconstrained. Although there exist other choices of convex regularizers, regularization via the f-divergence is uniquely desirable for yielding an unconstrained problem that is especially amenable to optimization in practical and real-world settings which require off-policy or offline learning.

Notably in many cases, the applications of duality and regularization prescribed by the duality-based approach do not change the optimality of the original solution. In other words, although the form of the problem has changed, the solution has not. This way, the result obtained with the new formulation is the same result as for the original problem, albeit achieved in a much easier way.

Experimental Evaluation
As a test of our new approach, we implemented duality-based training on a navigational agent. The agent starts at one corner of a multi-room map and must navigate to the opposite corner. We compare our algorithm to an actor-critic approach. Although both of these algorithms are based on the same underlying mathematical problem, actor-critic uses a number of approximations due to the infeasibility of satisfying the large number of constraints. In contrast, our algorithm is more amenable to practical implementation as can be seen by comparing the performance of the two algorithms. In the figure below, we plot the average reward achieved by the learned agent against the number of iterations of training for each algorithm. The duality-based implementation achieves significantly higher reward compared to actor-critic.

A plot of the average reward achieved by an agent using the duality-based approach (blue) compared to an agent using standard actor-critic (orange). In addition to being more mathematically principled, our approach also yields better practical results.

Conclusion
In summary, we’ve shown that if one formulates the RL problem as a mathematical objective with constraints, then repeated applications of convex duality in conjunction with a cleverly chosen convex regularizer yield an equivalent problem without constraints. The resulting unconstrained problem is easy to implement in practice and applicable in a wide range of settings. We’ve already applied our general framework to agent behavior policy optimization as well as policy evaluation, and imitation learning. We’ve found that our algorithms are not only more mathematically principled than existing RL methods, but they also often yield better practical performance, showing the value of unifying mathematical principles with practical implementation.

Read More

Google at ACL 2020

Google at ACL 2020

Posted by Cat Armato and Emily Knapp, Program Managers

This week, the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), a premier conference covering a broad spectrum of research areas that are concerned with computational approaches to natural language, takes place online.

As a leader in natural language processing and understanding, and a Diamond Level sponsor of ACL 2020, Google will showcase the latest research in the field with over 30 publications, and the organization of and participation in a variety of workshops and tutorials.

If you’re registered for ACL 2020, we hope that you’ll visit the Google virtual booth to learn more about the projects and opportunities at Google that go into solving interesting problems for billions of people. You can also learn more about the Google research being presented at ACL 2020 below (Google affiliations bolded).

Committees
Diversity & Inclusion (D&I) Chair: Vinodkumar Prabhakaran
Accessibility Chair: Sushant Kafle
Local Sponsorship Chair: Kristina Toutanova
Virtual Infrastructure Committee: Yi Luan
Area Chairs: Anders Søgaard, Ankur Parikh, Annie Louis, Bhuvana Ramabhadran, Christo Kirov, Daniel Cer, Dipanjan Das, Diyi Yang, Emily Pitler, Eunsol Choi, George Foster, Idan Szpektor, Jacob Eisenstein, Jason Baldridge, Jun Suzuki, Kenton Lee, Luheng He, Marius Pasca, Ming-Wei Chang, Sebastian Gehrmann, Shashi Narayan, Slav Petrov, Vinodkumar Prabhakaran, Waleed Ammar, William Cohen

Long Papers
Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage
Ashish V. Thapliyal, Radu Soricut

Automatic Detection of Generated Text is Easiest when Humans are Fooled
Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, Douglas Eck

On Faithfulness and Factuality in Abstractive Summarization
Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou

BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps
Wang Zhu, Hexiang Hu, Jiacheng Chen, Zhiwei Deng, Vihan Jain, Eugene Ie, Fei Sha

Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation
Xuanli He, Gholamreza Haffari, Mohammad Norouzi

GoEmotions: A Dataset of Fine-Grained Emotions
Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, Sujith Ravi

TaPas: Weakly Supervised Table Parsing via Pre-training (see blog post)
Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, Julian Eisenschlos

Toxicity Detection: Does Context Really Matter?
John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, Ion Androutsopoulos

(Re)construing Meaning in NLP
Sean Trott, Tiago Timponi Torrent, Nancy Chang, Nathan Schneider

Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models
Dan Iter, Kelvin Guu, Larry Lansing, Dan Jurafsky

Probabilistic Assumptions Matter: Improved Models for Distantly-Supervised Document-Level Question Answering
Hao Cheng, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

AdvAug: Robust Adversarial Augmentation for Neural Machine Translation
Yong Cheng, Lu Jiang, Wolfgang Macherey, Jacob Eisenstein

Named Entity Recognition as Dependency Parsing
Juntao Yu, Bernd Bohnet, Massimo Poesio

Cross-modal Coherence Modeling for Caption Generation
Malihe Alikhani, Piyush Sharma, Shengjie Li, Radu Soricut, Matthew Stone

Representation Learning for Information Extraction from Form-like Documents (see blog post)
Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, Marc Najork

Low-Dimensional Hyperbolic Knowledge Graph Embeddings
Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic Sala, Sujith Ravi, Christopher Ré

What Question Answering can Learn from Trivia Nerds
Jordan Boyd-Graber, Benjamin Börschinger

Learning a Multi-Domain Curriculum for Neural Machine Translation (see blog post)
Wei Wang, Ye Tian, Jiquan Ngiam, Yinfei Yang, Isaac Caswell, Zarana Parekh

Translationese as a Language in “Multilingual” NMT
Parker Riley, Isaac Caswell, Markus Freitag, David Grangier

Mapping Natural Language Instructions to Mobile UI Action Sequences
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, Jason Baldridge

BLEURT: Learning Robust Metrics for Text Generation (see blog post)
Thibault Sellam, Dipanjan Das, Ankur Parikh

Exploring Unexplored Generalization Challenges for Cross-Database Semantic Parsing
Alane Suhr, Ming-Wei Chang, Peter Shaw, Kenton Lee

Frugal Paradigm Completion
Alexander Erdmann, Tom Kenter, Markus Becker, Christian Schallhart

Short Papers
Reverse Engineering Configurations of Neural Text Generation Models
Yi Tay, Dara Bahri, Che Zheng, Clifford Brunk, Donald Metzler, Andrew Tomkins

Syntactic Data Augmentation Increases Robustness to Inference Heuristics
Junghyun Min, R. Thomas McCoy, Dipanjan Das, Emily Pitler, Tal Linzen

Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation
Aditya Siddhant, Ankur Bapna, Yuan Cao, Orhan Firat, Mia Chen, Sneha Kudugunta, Naveen Arivazhagan, Yonghui Wu

Social Biases in NLP Models as Barriers for Persons with Disabilities
Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, Stephen Denuyl

Toward Better Storylines with Sentence-Level Language Models
Daphne Ippolito, David Grangier, Douglas Eck, Chris Callison-Burch

TACL Papers
TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages (see blog post)
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, Jennimaria Palomaki

Phonotactic Complexity and Its Trade-offs
Tiago Pimentel, Brian Roark, Ryan Cotterell

Demos
Multilingual Universal Sentence Encoder for Semantic Retrieval (see blog post)
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil

Workshops
IWPT – The 16th International Conference on Parsing Technologies
Yuji Matsumoto, Stephan Oepen, Kenji Sagae, Anders Søgaard, Weiwei Sun and Reut Tsarfaty

ALVR – Workshop on Advances in Language and Vision Research
Xin Wang, Jesse Thomason, Ronghang Hu, Xinlei Chen, Peter Anderson, Qi Wu, Asli Celikyilmaz, Jason Baldridge and William Yang Wang

WNGT – The 4th Workshop on Neural Generation and Translation
Alexandra Birch, Graham Neubig, Andrew Finch, Hiroaki Hayashi, Kenneth Heafield, Ioannis Konstas, Yusuke Oda and Xian Li

NLPMC – NLP for Medical Conversations
Parminder Bhatia, Chaitanya Shivade, Mona Diab, Byron Wallace, Rashmi Gangadharaiah, Nan Du, Izhak Shafran and Steven Lin

AutoSimTrans – The 1st Workshop on Automatic Simultaneous Translation
Hua Wu, Colin Cherry, James Cross, Liang Huang, Zhongjun He, Mark Liberman and Yang Liu

Tutorials
Interpretability and Analysis in Neural NLP (cutting-edge)
Yonatan Belinkov, Sebastian Gehrmann, Ellie Pavlick

Commonsense Reasoning for Natural Language Processing (Introductory)
Maarten Sap, Vered Shwartz, Antoine Bosselut, Yejin Choi, Dan Roth

SmartReply for YouTube Creators

SmartReply for YouTube Creators

Posted by Rami Al-Rfou, Research Scientist, Google Research

It has been more than 4 years since SmartReply was launched, and since then, it has expanded to more users with the Gmail launch and Android Messages and to more devices with Android Wear. Developers now use SmartReply to respond to reviews within the Play Developer Console and can set up their own versions using APIs offered within MLKit and TFLite. With each launch there has been a unique challenge in modeling and serving that required customizing SmartReply for the task requirements.

We are now excited to share an updated SmartReply built for YouTube and implemented in YouTube Studio that helps creators engage more easily with their viewers. This model learns comment and reply representation through a computationally efficient dilated self-attention network, and represents the first cross-lingual and character byte-based SmartReply model. SmartReply for YouTube is currently available for English and Spanish creators, and this approach simplifies the process of extending the SmartReply feature to many more languages in the future.

YouTube creators receive a large volume of responses to their videos. Moreover, the community of creators and viewers on YouTube is diverse, as reflected by the creativity of their comments, discussions and videos. In comparison to emails, which tend to be long and dominated by formal language, YouTube comments reveal complex patterns of language switching, abbreviated words, slang, inconsistent usage of punctuation, and heavy utilization of emoji. Following is a sample of comments that illustrate this challenge:

Deep Retrieval
The initial release of SmartReply for Inbox encoded input emails word-by-word with a recurrent neural network, and then decoded potential replies with yet another word-level recurrent neural network. Despite the expressivity of this approach, it was computationally expensive. Instead, we found that one can achieve the same ends by designing a system that searches through a predefined list of suggestions for the most appropriate response.

This retrieval system encoded the message and its suggestion independently. First, the text was preprocessed to extract words and short phrases. This preprocessing included, but was not limited to, language identification, tokenization, and normalization. Two neural networks then simultaneously and independently encoded the message and the suggestion. This factorization allowed one to pre-compute the suggestion encodings and then search through the set of suggestions using an efficient maximum inner product search data structure. This deep retrieval approach enabled us to expand SmartReply to Gmail and since then, it has been the foundation for several SmartReply systems including the current YouTube system.

Beyond Words
The previous SmartReply systems described above relied on word level preprocessing that is well tuned for a limited number of languages and narrow genres of writing. Such systems face significant challenges in the YouTube case, where a typical comment might include heterogeneous content, like emoji, ASCII art, language switching, etc. In light of this, and taking inspiration from our recent work on byte and character language modeling, we decided to encode the text without any preprocessing. This approach is supported by research demonstrating that a deep Transformer network is able to model words and phrases from the ground up just by feeding it text as a sequence of characters or bytes, with comparable quality to word-based models.

Although initial results were promising, especially for processing comments with emoji or typos, the inference speed was too slow for production due to the fact that character sequences are longer than word equivalents and the computational complexity of self-attention layers grows quadratically as a function of sequence length. We found that shrinking the sequence length by applying temporal reduction layers at each layer of the network, similar to the dilation technique applied in WaveNet, provides a good trade-off between computation and quality.

The figure below presents a dual encoder network that encodes both the comment and the reply to maximize the mutual information between their latent representations by training the network with a contrastive objective. The encoding starts with feeding the transformer a sequence of bytes after they have been embedded. The input for each subsequent layer will be reduced by dropping a percentage of characters at equal offsets. After applying several transformer layers the sequence length is greatly truncated, significantly reducing the computational complexity. This sequence compression scheme could be substituted by other operators such as average pooling, though we did not notice any gains from more sophisticated methods, and therefore, opted to use dilation for simplicity.

A dual encoder network that maximizes the mutual information between the comments and their replies through a contrastive objective. Each encoder is fed a sequence of bytes and is implemented as a computationally efficient dilated transformer network.

A Model to Learn Them All
Instead of training a separate model for each language, we opted to train a single cross-lingual model for all supported languages. This allows the support of mixed-language usage in the comments, and enables the model to utilize the learning of common elements in one language for understanding another, such as emoji and numbers. Moreover, having a single model simplifies the logistics of maintenance and updates. While the model has been rolled out to English and Spanish, the flexibility inherent in this approach will enable it to be expanded to other languages in the future.

Inspecting the encodings of a multilingual set of suggestions produced by the model reveals that the model clusters appropriate replies, regardless of the language to which they belong. This cross-lingual capability emerged without exposing the model during training to any parallel corpus. We demonstrate in the figure below for three languages how the replies are clustered by their meaning when the model is probed with an input comment. For example, the English comment “This is a great video,” is surrounded by appropriate replies, such as “Thanks!” Moreover, inspection of the nearest replies in other languages reveal them also to be appropriate and similar in meaning to the English reply. The 2D projection also shows several other cross-lingual clusters that consist of replies of similar meaning. This clustering demonstrates how the model can support a rich cross-lingual user experience in the supported languages.

A 2D projection of the model encodings when presented with a hypothetical comment and a small list of potential replies. The neighborhood surrounding English comments (black color) consists of appropriate replies in English and their counterparts in Spanish and Arabic. Note that the network learned to align English replies with their translations without access to any parallel corpus.

When to Suggest?
Our goal is to help creators, so we have to make sure that SmartReply only makes suggestions when it is very likely to be useful. Ideally, suggestions would only be displayed when it is likely that the creator would reply to the comment and when the model has a high chance of providing a sensible and specific response. To accomplish this, we trained auxiliary models to identify which comments should trigger the SmartReply feature.

Conclusion
We’ve launched YouTube SmartReply, starting with English and Spanish comments, the first cross-lingual and character byte-based SmartReply. YouTube is a global product with a diverse user base that generates heterogeneous content. Consequently, it is important that we continuously improve comments for this global audience, and SmartReply represents a strong step in this direction.

Acknowledgements
SmartReply for YouTube creators was developed by Golnaz Farhadi, Ezequiel Baril, Cheng Lee, Claire Yuan, Coty Morrison‎, Joe Simunic‎, Rachel Bransom‎, Rajvi Mehta, Jorge Gonzalez‎, Mark Williams, Uma Roy and many more. We are grateful for the leadership support from Nikhil Dandekar, Eileen Long, Siobhan Quinn, Yun-hsuan Sung, Rachel Bernstein, and Ray Kurzweil.

SpineNet: A Novel Architecture for Object Detection Discovered with Neural Architecture Search

SpineNet: A Novel Architecture for Object Detection Discovered with Neural Architecture Search

Posted by Xianzhi Du, Software Engineer and Jaeyoun Kim, Technical Program Manager, Google Research

Convolutional neural networks created for image tasks typically encode an input image into a sequence of intermediate features that capture the semantics of an image (from local to global), where each subsequent layer has a lower spatial dimension. However, this scale-decreased model may not be able to deliver strong features for multi-scale visual recognition tasks where recognition and localization are both important (e.g., object detection and segmentation). Several works including FPN and DeepLabv3+ propose multi-scale encoder-decoder architectures to address this issue, where a scale-decreased network (e.g., a ResNet) is taken as the encoder (commonly referred to as a backbone model). A decoder network is then applied to the backbone to recover the spatial information.

While this architecture has yielded improved success for image recognition and localization tasks, it still relies on a scale-decreased backbone that throws away spatial information by down-sampling, which the decoder then must attempt to recover. What if one were to design an alternate backbone model that avoids this loss of spatial information, and is thus inherently well-suited for simultaneous image recognition and localization?

In our recent CVPR 2020 paper “SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization”, we propose a meta architecture called a scale-permuted model that enables two major improvements on backbone architecture design. First, the spatial resolution of intermediate feature maps should be able to increase or decrease anytime so that the model can retain spatial information as it grows deeper. Second, the connections between feature maps should be able to go across feature scales to facilitate multi-scale feature fusion. We then use neural architecture search (NAS) with a novel search space design that includes these features to discover an effective scale-permuted model. We demonstrate that this model is successful in multi-scale visual recognition tasks, outperforming networks with standard, scale-reduced backbones. To facilitate continued work in this space, we have open sourced the SpineNet code to the Tensorflow TPU GitHub repository in Tensorflow 1 and TensorFlow Model Garden GitHub repository in Tensorflow 2.

A scale-decreased backbone is shown on the left and a scale-permuted backbone is shown on the right. Each rectangle represents a building block. Colors and shapes represent different spatial resolutions and feature dimensions. Arrows represent connections among building blocks.

Design of SpineNet Architecture
In order to efficiently design the architecture for SpineNet, and avoid a time-intensive manual search of what is optimal, we leverage NAS to determine an optimal architecture. The backbone model is learned on the object detection task using the COCO dataset, which requires simultaneous recognition and localization. During architecture search, we learn three things:

  • Scale permutations: The orderings of network building blocks are important because each block can only be built from those that already exist (i.e., with a “lower ordering”). We define the search space of scale permutations by rearranging intermediate and output blocks, respectively.
  • Cross-scale connections: We define two input connections for each block in the search space. The parent blocks can be any block with a lower ordering or a block from the stem network.
  • Block adjustments (optional): We allow the block to adjust its scale level and type.
The architecture search process from a scale-decreased backbone to a scale-permuted backbone.

Taking the ResNet-50 backbone as the seed for the NAS search, we first learn scale-permutation and cross-scale connections. All candidate models in the search space have roughly the same computation as ResNet-50 since we just permute the ordering of feature blocks to obtain candidate models. The learned scale-permuted model outperforms ResNet-50-FPN by +2.9% average precision (AP) in the object detection task. The efficiency can be further improved (-10% FLOPs) by adding search options to adjust scale and type (e.g., residual block or bottleneck block, used in the ResNet model family) of each candidate feature block.

We name the learned 49-layer scale-permuted backbone architecture SpineNet-49. SpineNet-49 can be further scaled up to SpineNet-96/143/190 by repeating blocks two, three, or four times and increasing the feature dimension. An architecture comparison between ResNet-50-FPN and the final SpineNet-49 is shown below.

The architecture comparison between a ResNet backbone (left) and the SpineNet backbone (right) derived from it using NAS.

Performance
We demonstrate the performance of SpineNet models through comparison with ResNet-FPN. Using similar building blocks, SpineNet models outperform their ResNet-FPN counterparts by ~3% AP at various scales while using 10-20% fewer FLOPs. In particular, our largest model, SpineNet-190, achieves 52.1% AP on COCO for a single model without multi-scale testing during inference, significantly outperforming prior detectors. SpineNet also transfers to classification tasks, achieving 5% top-1 accuracy improvement on the challenging iNaturalist fine-grained dataset.

Performance comparisons of SpineNet models and ResNet-FPN models adopting the RetinaNet detection framework on COCO bounding box detection.
Performance comparisons of SpineNet models and ResNet models on ImageNet classification and iNaturalist fine-grained image classification.

Conclusion
In this work, we identify that the conventional scale-decreased model, even with a decoder network, is not effective for simultaneous recognition and localization. We propose the scale-permuted model, a new meta-architecture, to address the issue. To prove the effectiveness of scale-permuted models, we learn SpineNet by Neural Architecture Search in object detection and demonstrate it can be used directly in image classification. In the future, we hope the scale-permuted model will become the meta-architecture design of backbones across many visual tasks beyond detection and classification.

Acknowledgements
Special thanks to the co-authors of the paper: Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, and Xiaodan Song. We also would like to acknowledge Yeqing Li, Youlong Cheng, Jing Li, Jianwei Xie, Russell Power, Hongkun Yu, Chad Richards, Liang-Chieh Chen, Anelia Angelova, and the larger Google Brain Team for their help.

Leveraging Temporal Context for Object Detection

Leveraging Temporal Context for Object Detection

Posted by Sara Beery, Student Researcher, and Jonathan Huang, Research Scientist, Google Research

Ecological monitoring helps researchers to understand the dynamics of global ecosystems, quantify biodiversity, and measure the effects of climate change and human activity, including the efficacy of conservation and remediation efforts. In order to monitor effectively, ecologists need high-quality data, often expending significant efforts to place monitoring sensors, such as static cameras, in the field. While it is increasingly cost effective to build and operate networks of such sensors, the manual data analysis of global biodiversity data remains a bottleneck to accurate, global, real-time ecological monitoring. While there are ways to automate this analysis via machine learning, the data from static cameras, widely used to monitor the world around us for purposes ranging from mountain pass road conditions to ecosystem phenology, still pose a strong challenge for traditional computer vision systems — due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger.

In order to perform well in this setting, computer vision models must be robust to objects of interest that are often off-center, out of focus, poorly lit, or at a variety of scales. In addition, a static camera will always take images of the same scene unless it is moved, which causes the data from any one camera to be highly repetitive. Without sufficient data variability, machine learning models may learn to focus on correlations in the background, leading to poor generalization to novel deployments. The machine learning and ecological communities have been working together through venues like LILA BC and Wildlife Insights to curate expert-labeled training data from many research groups, each of which may operate anywhere from one to hundreds of camera traps, in order to increase data variability. This process of data collection and annotation is slow, and is confounded by the need to have diverse, representative data across geographic regions and taxonomic groups.

What’s in this image? Objects in images from static cameras can be very challenging to detect and categorize. Here, a foggy morning has made it very difficult to see a herd of wildebeest walking along the crest of a hill. [Image from Snapshot Serengeti]

In Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection, we present a complementary approach that increases global scalability by improving generalization to novel camera deployments algorithmically. This new object detection architecture leverages contextual clues across time for each camera deployment in a network, improving recognition of objects in novel camera deployments without relying on additional training data from a large number of cameras. Echoing the approach a person might use when faced with challenging images, Context R-CNN leverages up to a month’s worth of images from the same camera for context to determine what objects might be present and identify them. Using this method, the model outperforms a single-frame Faster R-CNN baseline by significant margins across multiple domains, including wildlife camera traps. We have open sourced the code and models for this work as part of the TF Object Detection API to make it easy to train and test Context R-CNN models on new static camera datasets.

Here, we can see how additional examples from the same scene help experts determine that the object is an animal and not background. Context such as the shape & size of the object, its attachment to a herd, and habitual grazing at certain times of day help determine that the species is a wildebeest. Useful examples occur throughout the month.

The Context R-CNN Model
Context R-CNN is designed to take advantage of the high degree of correlation within images taken by a static camera to boost performance on challenging data and improve generalization to new camera deployments without additional human data labeling. It is an adaptation of Faster R-CNN, a popular two-stage object detection architecture. To extract context for a camera, we first use a frozen feature extractor to build up a contextual memory bank from images across a large time horizon (up to a month or more). Next, objects are detected in each image using Context R-CNN which aggregates relevant context from the memory bank to help detect objects under challenging conditions (such as the heavy fog obscuring the wildebeests in our previous example). This aggregation is performed using attention, which is robust to the sparse and irregular sampling rates often seen in static monitoring cameras.

High-level architecture diagram, showing how Context R-CNN incorporates long-term context within the Faster R-CNN model architecture.

The first stage of Faster R-CNN proposes potential objects, and the second stage categorizes each proposal as either background or one of the target classes. In Context R-CNN, we take the proposed objects from the first stage of Faster R-CNN, and for each one we use similarity-based attention to determine how relevant each of the features in our memory bank (M) is to the current object, and construct a per-object context feature by taking a relevance-weighted sum over M and adding it back to the original object features. Then each object, now with added contextual information, is finally categorized using the second stage of Faster R-CNN.

Context R-CNN is able to leverage context (spanning up to 1 month) to correctly categorize the challenging wildebeest example we saw above. The green values are the corresponding attention weights for each boxed object.
Compared to a Faster R-CNN baseline (left), Context R-CNN (right) is able to capture challenging objects such as an elephant occluded by a tree, two poorly-lit impala, and a vervet monkey leaving the frame. [Images from Snapshot Serengeti]

Results
We have tested Context R-CNN on Snapshot Serengeti (SS) and Caltech Camera Traps (CCT), both ecological datasets of animal species in camera traps but from highly different geographic regions (Tanzania vs. the Southwestern United States). Improvements over the Faster R-CNN baseline for each dataset can be seen in the table below. Notably, we see a 47.5% relative increase in mean average precision (mAP) on SS, and a 34.3% relative mAP increase on CCT. We also compare Context R-CNN to S3D (a 3D convolution based baseline) and see performance improve from 44.7% mAP to 55.9% mAP (a 25.1% relative increase). Finally, we find that the performance increases as the contextual time horizon increases, from a minute of context to a month.

Comparison to a single frame Faster R-CNN baseline, showing both mean average precision (mAP) and average recall (AR) detection metrics.

Ongoing and Future Work
We are working to implement Context R-CNN within the Wildlife Insights platform, to facilitate large-scale, global ecological monitoring via camera traps. We also host competitions such as the yearly iWildCam species identification competition at the CVPR Fine-Grained Visual Recognition Workshop to help bring these challenges to the attention of the computer vision community. The challenges seen in automatic species identification in static cameras are shared by numerous applications of static cameras outside of the ecological monitoring domain, as well as other static sensors used to monitor biodiversity, such as audio and sonar devices. Our method is general, and we anticipate the per-sensor context approach taken by Context R-CNN would be beneficial for any static sensor.

Acknowledgements
This post reflects the work of the authors as well as the following group of core contributors: Vivek Rathod, Guanhang Wu, Ronny Votel. We are also grateful to Zhichao Lu, David Ross, Tanya Birch and the Wildlife Insights AI team, and Pietro Perona and the Caltech Computational Vision Lab.

Sensing Force-Based Gestures on the Pixel 4

Sensing Force-Based Gestures on the Pixel 4

Posted by Philip Quinn and Wenxin Feng, Research Scientists, Android UX

Touch input has traditionally focussed on two-dimensional finger pointing. Beyond tapping and swiping gestures, long pressing has been the main alternative path for interaction. However, a long press is sensed with a time-based threshold where a user’s finger must remain stationary for 400–500 ms. By its nature, a time-based threshold has negative effects for usability and discoverability as the lack of immediate feedback disconnects the user’s action from the system’s response. Fortunately, fingers are dynamic input devices that can express more than just location: when a user touches a surface, their finger can also express some level of force, which can be used as an alternative to a time-based threshold.

While a variety of force-based interactions have been pursued, sensing touch force requires dedicated hardware sensors that are expensive to design and integrate. Further, research indicates that touch force is difficult for people to control, and so most practical force-based interactions focus on discrete levels of force (e.g., a soft vs. firm touch) — which do not require the full capabilities of a hardware force sensor.

For a recent update to the Pixel 4, we developed a method for sensing force gestures that allowed us to deliver a more expressive touch interaction experience By studying how the human finger interacts with touch sensors, we designed the experience to complement and support the long-press interactions that apps already have, but with a more natural gesture. In this post we describe the core principles of touch sensing and finger interaction, how we designed a machine learning algorithm to recognise press gestures from touch sensor data, and how we integrated it into the user experience for Pixel devices.

Touch Sensor Technology and Finger Biomechanics
A capacitive touch sensor is constructed from two conductive electrodes (a drive electrode and a sense electrode) that are separated by a non-conductive dielectric (e.g., glass). The two electrodes form a tiny capacitor (a cell) that can hold some charge. When a finger (or another conductive object) approaches this cell, it ‘steals’ some of the charge, which can be measured as a drop in capacitance. Importantly, the finger doesn’t have to come into contact with the electrodes (which are protected under another layer of glass) as the amount of charge stolen is inversely proportional to the distance between the finger and the electrodes.

Left: A finger interacts with a touch sensor cell by ‘stealing’ charge from the projected field around two electrodes. Right: A capacitive touch sensor is constructed from rows and columns of electrodes, separated by a dielectric. The electrodes overlap at cells, where capacitance is measured.

The cells are arranged as a matrix over the display of a device, but with a much lower density than the display pixels. For instance, the Pixel 4 has a 2280 × 1080 pixel display, but a 32 × 15 cell touch sensor. When scanned at a high resolution (at least 120 Hz), readings from these cells form a video of the finger’s interaction.

Slowed touch sensor recordings of a user tapping (left), pressing (middle), and scrolling (right).

Capacitive touch sensors don’t respond to changes in force per se, but are tuned to be highly sensitive to changes in distance within a couple of millimeters above the display. That is, a finger contact on the display glass should saturate the sensor near its centre, but will retain a high dynamic range around the perimeter of the finger’s contact (where the finger curls up).

When a user’s finger presses against a surface, its soft tissue deforms and spreads out. The nature of this spread depends on the size and shape of the user’s finger, and its angle to the screen. At a high level, we can observe a couple of key features in this spread (shown in the figures): it is asymmetric around the initial contact point, and the overall centre of mass shifts along the axis of the finger. This is also a dynamic change that occurs over some period of time, which differentiates it from contacts that have a long duration or a large area.

Touch sensor signals are saturated around the centre of the finger’s contact, but fall off at the edges. This allows us to sense small deformations in the finger’s contact shape caused by changes in the finger’s force.

However, the differences between users (and fingers) makes it difficult to encode these observations with heuristic rules. We therefore designed a machine learning solution that would allow us to learn these features and their variances directly from user interaction samples.

Machine Learning for Touch Interaction
We approached the analysis of these touch signals as a gesture classification problem. That is, rather than trying to predict an abstract parameter, such as force or contact spread, we wanted to sense a press gesture — as if engaging a button or a switch. This allowed us to connect the classification to a well-defined user experience, and allowed users to perform the gesture during training at a comfortable force and posture.

Any classification model we designed had to operate within users’ high expectations for touch experiences. In particular, touch interaction is extremely latency-sensitive and demands real-time feedback. Users expect applications to be responsive to their finger movements as they make them, and application developers expect the system to deliver timely information about the gestures a user is performing. This means that classification of a press gesture needs to occur in real-time, and be able to trigger an interaction at the moment the finger’s force reaches its apex.

We therefore designed a neural network that combined convolutional (CNN) and recurrent (RNN) components. The CNN could attend to the spatial features we observed in the signal, while the RNN could attend to their temporal development. The RNN also helps provide a consistent runtime experience: each frame is processed by the network as it is received from the touch sensor, and the RNN state vectors are preserved between frames (rather than processing them in batches). The network was intentionally kept simple to minimise on-device inference costs when running concurrently with other applications (taking approximately 50 µs of processing per frame and less than 1 MB of memory using TensorFlow Lite).

An overview of the classification model’s architecture.

The model was trained on a dataset of press gestures and other common touch interactions (tapping, scrolling, dragging, and long-pressing without force). As the model would be evaluated after each frame, we designed a loss function that temporally shaped the label probability distribution of each sample, and applied a time-increasing weight to errors. This ensured that the output probabilities were temporally smooth and converged towards the correct gesture classification.

User Experience Integration
Our UX research found that it was hard for users to discover force-based interactions, and that users frequently confused a force press with a long press because of the difficulty in coordinating the amount of force they were applying with the duration of their contact. Rather than creating a new interaction modality based on force, we therefore focussed on improving the user experience of long press interactions by accelerating them with force in a unified press gesture. A press gesture has the same outcome as a long press gesture, whose time threshold remains effective, but provides a stronger connection between the outcome and the user’s action when force is used.

A user long pressing (left) and firmly pressing (right) on a launcher icon.

This also means that users can take advantage of this gesture without developers needing to update their apps. Applications that use Android’s GestureDetector or View APIs will automatically get these press signals through their existing long-press handlers. Developers that implement custom long-press detection logic can receive these press signals through the MotionEvent classification API introduced in Android Q.

Through this integration of machine-learning algorithms and careful interaction design, we were able to deliver a more expressive touch experience for Pixel users. We plan to continue researching and developing these capabilities to refine the touch experience on Pixel, and explore new forms of touch interaction.

Acknowledgements
This project is a collaborative effort between the Android UX, Pixel software, and Android framework teams.