How The Trevor Project is using AI to help prevent suicide

How The Trevor Project is using AI to help prevent suicide

Suicide disproportionately affects LGBTQ+ youth. In the U.S. alone, more than 1.8 million LGBTQ+ youth between the ages of 13 and 24 seriously consider suicide or experience a significant crisis each year. Additionally, LGBTQ+ youth are over four times more likely to attempt suicide than their peers, while up to 50 percent of all trans people have made a suicide attempt—most before the age of 25. Black LGBTQ+ young people are even more impacted as they hold multiple marginalized identities, and research shows that Black youth ages five to 12 are dying by suicide at roughly twice the rate of their white peers. 

To support this particularly vulnerable and diverse community, The Trevor Project takes an intersectional approach to crisis intervention and suicide prevention. The organization offers free and confidential crisis services that they provide 24/7 via phone, chat, and text. In this time of emotional stress, isolation and civil unrest, these services offer much needed support to LGBTQ youth experiencing fear, hopelessness, confusion, and race-based trauma. Sadly, the volume of callers sometimes outnumbers the available crisis counselors who are trained to assist. With support from Google.org, The Trevor Project is incorporating artificial intelligence into its crisis services to connect more people to the resources they need.  

Last year, Google.org provided The Trevor Project with $1.5 million and 11 Googlers from the Google.org Fellowship, a pro-bono program that matches teams of Googlers with Google.org grantees and civic entities for up to six months to work full-time on technical projects. Google.org Fellows assisted The Trevor Project in building an artificial intelligence system that could identify and prioritize high-risk contacts while simultaneously reaching more people. 

Here’s how it works. When someone first contacts The Trevor Project, they’re asked a few intake questions like: “What’s going on?” After that, they talk to a crisis counselor who assesses their risk using a clinical assessment model. Looking at anonymized historical data, the team used natural language processing (NLP) to train the system to learn which types of responses on the intake form were most likely linked to a particular diagnosis risk level. While some specific words or phrases are known to correlate with high risk, the NLP model interprets the entire sentence to determine risk level. Now if a person is identified as a high or imminent risk based on their initial intake questions, they are automatically placed in a priority queue and connected to a counselor sooner. 

To help accelerate this work, Google.org has committed an additional $1.2 million in grant funding and is planning to engage a new cohort of Google.org Fellows set to start in July to expand Trevor’s application of NLP to new contexts. This will include developing a conversation simulator to enhance and scale Trevor’s virtual counselor training program, and automating the moderation of TrevorSpace, the organization’s affirming international online community, to flag and address unsafe content. At the same time, Google.org is partnering with Google’s LGBTQ+ employee groups to build a pool of volunteer digital crisis counselors to help respond to Trevor’s increased need for crisis services due to COVID-19 impacts. More than fifty Googlers have signed up already. 

The Trevor Project is the world’s largest suicide prevention and crisis intervention organization for LGBTQ+ youth. We’re honored to support their critical mission and stand with LGBTQ+ people of color, trans and non-binary communities, LGBTQ+ families, and so many more

Read More

Google at CVPR 2020

Google at CVPR 2020

Posted by Emily Knapp, Program Manager and Benjamin Hütteroth, Program Specialist

This week marks the start of the fully virtual 2020 Conference on Computer Vision and Pattern Recognition (CVPR 2020), the premier annual computer vision event consisting of the main conference, workshops and tutorials. As a leader in computer vision research and a Supporter Level Virtual Sponsor, Google will have a strong presence at CVPR 2020, with nearly 70 publications accepted, along with the organization of, and participation in, multiple workshops/tutorials.

If you are participating in CVPR this year, please visit our virtual booth to learn about what Google is actively pursuing for the next generation of intelligent systems that utilize the latest machine learning techniques applied to various areas of machine perception.

You can also learn more about our research being presented at CVPR 2020 in the list below (Google affiliations are bolded).

Organizing Committee

General Chairs: Terry Boult, Gerard Medioni, Ramin Zabih
Program Chairs: Ce Liu, Greg Mori, Kate Saenko, Silvio Savarese
Workshop Chairs: Tal Hassner, Tali Dekel
Website Chairs: Tianfan Xue, Tian Lan
Technical Chair: Daniel Vlasic
Area Chairs include: Alexander Toshev, Alexey Dosovitskiy, Boqing Gong, Caroline Pantofaru, Chen Sun, Deqing Sun, Dilip Krishnan, Feng Yang, Liang-Chieh Chen, Michael Rubinstein, Rodrigo Benenson, Timnit Gebru, Thomas Funkhouser, Varun Jampani, Vittorio Ferrari, William Freeman

Oral Presentations

Evolving Losses for Unsupervised Video Representation Learning
AJ Piergiovanni, Anelia Angelova, Michael Ryoo

CvxNet: Learnable Convex Decomposition
Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, Andrea Tagliasacchi

Neural SDE: Stabilizing Neural ODE Networks with Stochastic Noise
Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, Cho-Jui Hsieh

Scalability in Perception for Autonomous Driving: Waymo Open Dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla‎, Aurélien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev‎, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi‎, Sheng Zhao, Shuyang Chen, Yu Zhang, Jon Shlens, Zhifeng Chen, Dragomir Anguelov

Deep Implicit Volume Compression
Saurabh Singh, Danhang Tang, Cem Keskin, Philip Chou, Christian Haene, Mingsong Dou, Sean Fanello, Jonathan Taylor, Andrea Tagliasacchi, Philip Davidson, Yinda Zhang, Onur Guleryuz, Shahram Izadi, Sofien Bouaziz

Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model
Dongdong Wan, Yandong Li, Liqiang Wang, and Boqing Gong

Google Landmarks Dataset v2 – A Large-Scale Benchmark for Instance-Level Recognition and Retrieval (see the blog post)
Tobias Weyand, Andre Araujo, Jack Sim, Bingyi Cao

CycleISP: Real Image Restoration via Improved Data Synthesis
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, Ling Shao

Dynamic Graph Message Passing Networks
Li Zhang, Dan Xu, Anurag Arnab, Philip Torr

Local Deep Implicit Functions for 3D Shape
Kyle Genova, Forrester Cole, Avneesh Sud, Aaron Sarna, Thomas Funkhouser

GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models
Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, William Freeman, Rahul Sukthankar, Cristian Sminchisescu

Search to Distill: Pearls are Everywhere but not the Eyes
Yu Liu, Xuhui Jia, Mingxing Tan, Raviteja Vemulapalli, Yukun Zhu, Bradley Green, Xiaogang Wang

Semantic Pyramid for Image Generation
Assaf Shocher, Yossi Gandelsman, Inbar Mosseri, Michal Yarom, Michal Irani, William Freeman, Tali Dekel

Flow Contrastive Estimation of Energy-Based Models
Ruiqi Gao, Erik Nijkamp, Diederik Kingma, Zhen Xu, Andrew Dai, Ying Nian Wu

Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition from A Domain Adaptation Perspective
Muhammad Abdullah Jamal, Matthew Brown, Ming-Hsuan Yang, Liqiang Wang, Boqing Gong

Category-Level Articulated Object Pose Estimation
Xiaolong Li, He Wang, Li Yi, Leonidas Guibas, Amos Abbott, Shuran Song

AdaCoSeg: Adaptive Shape Co-Segmentation with Group Consistency Loss
Chenyang Zhu, Kai Xu, Siddhartha Chaudhuri, Li Yi, Leonidas Guibas, Hao Zhang

SpeedNet: Learning the Speediness in Videos
Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William Freeman, Michael Rubinstein, Michal Irani, Tali Dekel

BSP-Net: Generating Compact Meshes via Binary Space Partitioning
Zhiqin Chen, Andrea Tagliasacchi, Hao Zhang

SAPIEN: A SimulAted Part-based Interactive ENvironment
Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel Chang, Leonidas Guibas, Hao Su

SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving
Zhenpei Yang, Yuning Chai, Dragomir Anguelov, Yin Zhou, Pei Sun, Dumitru Erhan, Sean Rafferty, Henrik Kretzschmar

Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks
Saurabh Singh, Shankar Krishnan

RL-CycleGAN: Reinforcement Learning Aware Simulation-To-Real
Kanishka Rao, Chris Harris, Alex Irpan, Sergey Levine, Julian Ibarz, Mohi Khansari

Open Compound Domain Adaptation
Ziwei Liu, Zhongqi Miao, Xingang Pan, Xiaohang Zhan, Dahua Lin, Stella X.Yu, and Boqing Gong

Posters
Single-view view synthesis with multiplane images
Richard Tucker, Noah Snavely

Adversarial Examples Improve Image Recognition
Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan Yuille, Quoc V. Le

Adversarial Texture Optimization from RGB-D Scans
Jingwei Huang, Justus Thies, Angela Dai, Abhijit Kundu, Chiyu “Max” Jiang,Leonidas Guibas, Matthias Niessner, Thomas Funkhouser

Single-Image HDR Reconstruction by Learning to Reverse the Camera Pipeline
Yu-Lun Liu, Wei-Sheng Lai, Yu-Sheng Chen, Yi-Lung Kao, Ming-Hsuan Yang,Yung-Yu Chuang, Jia-Bin Huang

Collaborative Distillation for Ultra-Resolution Universal Style Transfer
Huan Wang, Yijun Li, Yuehai Wang, Haoji Hu, Ming-Hsuan Yang

Learning to Autofocus
Charles Herrmann, Richard Strong Bowen, Neal Wadhwa, Rahul Garg, Qiurui He, Jonathan T. Barron, Ramin Zabih

Multi-Scale Boosted Dehazing Network with Dense Feature Fusion
Hang Dong, Jinshan Pan, Lei Xiang, Zhe Hu, Xinyi Zhang, Fei Wang, Ming-Hsuan Yang

Composing Good Shots by Exploiting Mutual Relations
Debang Li, Junge Zhang, Kaiqi Huang, Ming-Hsuan Yang

PatchVAE: Learning Local Latent Codes for Recognition
Kamal Gupta, Saurabh Singh, Abhinav Shrivastava

Neural Voxel Renderer: Learning an Accurate and Controllable Rendering Tool
Konstantinos Rematas, Vittorio Ferrari

Local Implicit Grid Representations for 3D Scenes
Chiyu “Max” Jiang, Avneesh Sud, Ameesh Makadia, Jingwei Huang, Matthias Niessner, Thomas Funkhouser

Large Scale Video Representation Learning via Relational Graph Clustering
Hyodong Lee, Joonseok Lee, Joe Yue-Hei Ng, Apostol (Paul) Natsev

Deep Homography Estimation for Dynamic Scenes
Hoang Le, Feng Liu, Shu Zhang, Aseem Agarwala

C-Flow: Conditional Generative Flow Models for Images and 3D Point Clouds
Albert Pumarola, Stefan Popov, Francesc Moreno-Noguer, Vittorio Ferrari

Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination
Pratul Srinivasan, Ben Mildenhall, Matthew Tancik, Jonathan T. Barron, Richard Tucker, Noah Snavely

Scale-space flow for end-to-end optimized video compression
Eirikur Agustsson, David Minnen, Nick Johnston, Johannes Ballé, Sung Jin Hwang, George Toderici

StructEdit: Learning Structural Shape Variations
Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy Mitra, Leonidas Guibas

3D-MPA: Multi Proposal Aggregation for 3D Semantic Instance Segmentation
Francis Engelmann, Martin Bokeloh, Alireza Fathi, Bastian Leibe, Matthias Niessner

Sequential mastery of multiple tasks: Networks naturally learn to learn and forget to forget
Guy Davidson, Michael C. Mozer

Distilling Effective Supervision from Severe Label Noise
Zizhao Zhang, Han Zhang, Sercan Ö. Arik, Honglak Lee, Tomas Pfister

ViewAL: Active Learning With Viewpoint Entropy for Semantic Segmentation
Yawar Siddiqui, Julien Valentin, Matthias Niessner

Attribution in Scale and Space
Shawn Xu, Subhashini Venugopalan, Mukund Sundararajan

Weakly-Supervised Semantic Segmentation via Sub-category Exploration
Yu-Ting Chang, Qiaosong Wang, Wei-Chih Hung, Robinson Piramuthu, Yi-Hsuan Tsai, Ming-Hsuan Yang

Speech2Action: Cross-modal Supervision for Action Recognition
Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

Counting Out Time: Class Agnostic Video Repetition Counting in the Wild
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction
Junwei Liang, Lu Jiang, Kevin Murphy, Ting Yu, Alexander Hauptmann

Self-training with Noisy Student improves ImageNet classification
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le

EfficientDet: Scalable and Efficient Object Detection (see the blog post)
Mingxing Tan, Ruoming Pang, Quoc Le

ACNe: Attentive Context Normalization for Robust Permutation-Equivariant Learning
Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi, Kwang Moo Yi

VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation
Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Cordelia Schmid, Congcong Li

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization
Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc Le, Xiaodan Song

KeyPose: Multi-View 3D Labeling and Keypoint Estimation for Transparent Objects
Xingyu Liu, Rico Jonschkowski, Anelia Angelova, Kurt Konolige

Structured Multi-Hashing for Model Compression
Elad Eban, Yair Movshovitz-Attias, Hao Wu, Mark Sandler, Andrew Poon, Yerlan Idelbayev, Miguel A. Carreira-Perpinan

DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes
Mahyar Najibi, Guangda Lai, Abhijit Kundu, Zhichao Lu, Vivek Rathod, Tom Funkhouser, Caroline Pantofaru, David Ross, Larry Davis, Alireza Fathi

Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation
Bowen Cheng, Maxwell Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh Chen

Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection
Sara Beery, Guanhang Wu, Vivek Rathod, Ronny Votel, Jonathan Huang

Distortion Agnostic Deep Watermarking
Xiyang Luo, Ruohan Zhan, Huiwen Chang, Feng Yang, Peyman Milanfar

Can weight sharing outperform random architecture search? An investigation with TuNAS
Gabriel Bender, Hanxiao Liu, Bo Chen, Grace Chu, Shuyang Cheng, Pieter-Jan Kindermans, Quoc Le

GIFnets: Differentiable GIF Encoding Framework
Innfarn Yoo, Xiyang Luo, Yilin Wang, Feng Yang, Peyman Milanfar

Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models
Giannis Daras, Augustus Odena, Han Zhang, Alex Dimakis

Fast Sparse ConvNets
Erich Elsen, Marat Dukhan, Trevor Gale, Karen Simonyan

RetinaTrack: Online Single Stage Joint Detection and Tracking
Zhichao Lu, Vivek Rathod, Ronny Votel, Jonathan Huang

Learning to See Through Obstructions
Yu-Lun Liu, Wei-Sheng Lai, Ming-Hsuan Yang,Yung-Yu Chuang, Jia-Bin Huang

Self-Supervised Learning of Video-Induced Visual Invariances
Michael Tschannen, Josip Djolonga, Marvin Ritter, Aravindh Mahendran, Neil Houlsby, Sylvain Gelly, Mario Lucic

Workshops

3rd Workshop and Challenge on Learned Image Compression
Organizers include: George Toderici, Eirikur Agustsson, Lucas Theis, Johannes Ballé, Nick Johnston

CLVISION 1st Workshop on Continual Learning in Computer Vision
Organizers include: Zhiyuan (Brett) Chen, Marc Pickett

Embodied AI
Organizers include: Alexander Toshev, Jie Tan, Aleksandra Faust, Anelia Angelova

The 1st International Workshop and Prize Challenge on Agriculture-Vision: Challenges & Opportunities for Computer Vision in Agriculture
Organizers include: Zhen Li, Jim Yuan

Embodied AI
Organizers include: Alexander Toshev, Jie Tan, Aleksandra Faust, Anelia Angelova

New Trends in Image Restoration and Enhancement workshop and challenges on image and video restoration and enhancement (NTIRE)
Talk: “Sky Optimization: Semantically aware image processing of skies in low-light photography”
Orly Liba, Longqi Cai, Yun-Ta Tsai, Elad Eban, Yair Movshovitz-Attias, Yael Pritch, Huizhong Chen, Jonathan Barron

The End-of-End-to-End A Video Understanding Pentathlon
Organizers include: Rahul Sukthankar

4th Workshop on Media Forensics
Organizers include: Christoph Bregler

4th Workshop on Visual Understanding by Learning from Web Data
Organizers include: Jesse Berent, Rahul Sukthankar

AI for Content Creation
Organizers include: Deqing Sun, Lu Jiang, Weilong Yang

Fourth Workshop on Computer Vision for AR/VR
Organizers include: Sofien Bouaziz

Low-Power Computer Vision Competition (LPCVC)
Organizers include: Bo Chen, Andrew Howard, Jaeyoun Kim

Sight and Sound
Organizers include: William Freeman

Workshop on Efficient Deep Learning for Computer Vision
Organizers include: Pete Warden

Extreme classification in computer vision
Organizers include: Ramin Zabih, Zhen Li

Image Matching: Local Features and Beyond (see the blog post)
Organizers include: Eduard Trulls

The DAVIS Challenge on Video Object Segmentation
Organizers include: Alberto Montes, Jordi Pont-Tuset, Kevis-Kokitsi Maninis

2nd Workshop on Precognition: Seeing through the Future
Organizers include: Utsav Prabhu

Computational Cameras and Displays (CCD)
Talk: Orly Liba

2nd Workshop on Learning from Unlabeled Videos (LUV)
Organizers include:Honglak Lee, Rahul Sukthankar

7th Workshop on Fine Grained Visual Categorization (FGVC7) (see the blog post)
Organizers include: Christine Kaeser-Chen, Serge Belongie

Language & Vision with applications to Video Understanding
Organizers include: Lu Jiang

Neural Architecture Search and Beyond for Representation Learning
Organizers include: Barret Zoph

Tutorials

Disentangled 3D Representations for Relightable Performance Capture of Humans
Organizers include: Sean Fanello, Christoph Rhemann, Jonathan Taylor, Sofien Bouaziz, Adarsh Kowdle, Rohit Pandey, Sergio Orts-Escolano, Paul Debevec, Shahram Izadi

Learning Representations via Graph-Structured Networks
Organizers include:Chen Sun, Ming-Hsuan Yang

Novel View Synthesis: From Depth-Based Warping to Multi-Plane Images and Beyond
Organizers include:Varun Jampani

How to Write a Good Review
Talks by:Vittorio Ferrari, Bill Freeman, Jordi Pont-Tuset

Neural Rendering
Organizers include:Ricardo Martin-Brualla, Rohit K. Pandey, Sean Fanello,Maneesh Agrawala, Dan B. Goldman

Fairness Accountability Transparency and Ethics and Computer Vision
Organizers: Timnit Gebru, Emily Denton

Extracting Structured Data from Templatic Documents

Extracting Structured Data from Templatic Documents

Posted by Sandeep Tata, Software Engineer, Google Research

Templatic documents, such as receipts, bills, insurance quotes, and others, are extremely common and critical in a diverse range of business workflows. Currently, processing these documents is largely a manual effort, and automated systems that do exist are based on brittle and error-prone heuristics. Consider a document type like invoices, which can be laid out in thousands of different ways — invoices from different companies, or even different departments within the same company, may have slightly different formatting. However, there is a common understanding of the structured information that an invoice should contain, such as an invoice number, an invoice date, the amount due, the pay-by date, and the list of items for which the invoice was sent. A system that can automatically extract all this data has the potential to dramatically improve the efficiency of many business workflows by avoiding error-prone, manual work.

In “Representation Learning for Information Extraction from Form-like Documents”, accepted to ACL 2020, we present an approach to automatically extract structured data from templatic documents. In contrast to previous work on extraction from plain-text documents, we propose an approach that uses knowledge of target field types to identify candidate fields. These are then scored using a neural network that learns a dense representation of each candidate using the words in its neighborhood. Experiments on two corpora (invoices and receipts) show that we’re able to generalize well to unseen layouts.

Why Is This Hard?
The challenge in this information extraction problem arises because it straddles the natural language processing (NLP) and computer vision worlds. Unlike classic NLP tasks, such documents do not contain “natural language” as might be found in regular sentences and paragraphs, but instead resemble forms. Data is often presented in tables, but in addition many documents have multiple pages, frequently with a varying number of sections, and have a variety of layout and formatting clues to organize the information. An understanding of the two-dimensional layout of text on the page is key to understanding such documents. On the other hand, treating this purely as an image segmentation problem makes it difficult to take advantage of the semantics of the text.

Solution Overview
Our approach to this problem allows developers to train and deploy an extraction system for a given domain (like invoices) using two inputs — a target schema (i.e., a list of fields to extract and their corresponding types) and a small collection of documents labeled with the ground truth for use as a training set. Supported field types include basics, such as dates, integers, alphanumeric codes, currency amounts, phone-numbers, and URLs. We also take advantage of entity types commonly detected by the Google Knowledge Graph, such as addresses, names of companies, etc.

The input document is first run through an Optical Character Recognition (OCR) service to extract the text and layout information, which allows this to work with native digital documents, such as PDFs, and document images (e.g., scanned documents). We then run a candidate generator that identifies spans of text in the OCR output that might correspond to an instance of a given field. The candidate generator utilizes pre-existing libraries associated with each field type (date, number, phone-number, etc.), which avoids the need to write new code for each candidate generator. Each of these candidates is then scored using a trained neural network (the “scorer”, described below) to estimate the likelihood that it is indeed a value one might extract for that field. Finally, an assigner module matches the scored candidates to the target fields. By default, the assigner simply chooses the highest scoring candidate for the field, but additional domain-specific constraints can be incorporated, such as requiring that the invoice date field is chronologically before the payment date field.

The processing steps in the extraction system using a toy schema with two fields on an input invoice document. Blue boxes show the candidates for the invoice_date field and gold boxes for the amount_due field.

Scorer
The scorer is a neural model that is trained as a binary classifier. It takes as input the target field from the schema along with the extraction candidate and produces a prediction score between 0 and 1. The target label for a candidate is determined by whether the candidate matches the ground truth for that document and field. The model learns how to represent each field and each candidate in a vector space in which the nearer a field and candidate are in the vector space, the more likely it is that the candidate is the true extraction value for that field and document.

Candidate Representation
A candidate is represented by the tokens in its neighborhood along with the relative position of the token on the page with respect to the centroid of the bounding box identified for the candidate. Using the invoice_date field as an example, phrases in the neighborhood like “Invoice Date’” or “Inv Date” might indicate to the scorer that this is a likely candidate, while phrases like “Delivery Date” would indicate that this is likely not the invoice_date. We do not include the value of the candidate in its representation in order to avoid overfitting to values that happen to be present in a small training data set — e.g., “2019” for the invoice date, if the training corpus happened to include only invoices from that year.

A small snippet of an invoice. The green box shows a candidate for the invoice_date field, and the red box is a token in the neighborhood along with the arrow representing the relative position. Each of the other tokens (‘number’, ‘date’, ‘page’, ‘of’, etc along with the other occurrences of ‘invoice’) are part of the neighborhood for the invoice candidate.

Model Architecture
The figure below shows the general structure of the network. In order to construct the candidate encoding (i), each token in the neighborhood is embedded using a word embedding table (a). The relative position of each neighbor (b) is embedded using two fully connected ReLU layers that capture fine-grained non-linearities. The text and position embeddings for each neighbor are concatenated to form a neighbor encoding (d). A self attention mechanism is used to incorporate the neighborhood context for each neighbor (e), which is combined into a neighborhood encoding (f) using max-pooling. The absolute position of the candidate on the page (g) is embedded in a manner similar to the positional embedding for a neighbor, and concatenated with the neighborhood encoding for the candidate encoding (i). The final scoring layer computes the cosine similarity between the field embedding (k) and the candidate encoding (i) and then rescales it to be between 0 and 1.

Results
For training and validation, we used an internal dataset of invoices with a large variety of layouts. In order to test the ability of the model to generalize to unseen layouts, we used a test-set of invoices with layouts that were disjoint from the training and validation set. We report the F1 score of the extractions from this system on a few key fields below (higher is better):

Field F1 Score
amount_due 0.801
delivery_date 0.667
due_date 0.861
invoice_date 0.940
invoice_id 0.949
purchase_order 0.896
total_amount 0.858
total_tax_amount 0.839

As you can see from the table above, the model does well on most fields. However, there’s room for improvement for fields like delivery_date. Additional investigation revealed that this field was present in a very small subset of the examples in our training data. We expect that gathering additional training data will help us improve on it.

What’s next?
Google Cloud recently announced an invoice parsing service as part of the Document AI product. The service uses the methods described above, along with other recent research breakthroughs like BERT, to extract more than a dozen key fields from invoices. You can upload an invoice at the demo page and see this technology in action!

For a given document type we expect to be able to build an extraction system given a modest sized labeled corpus. There are several follow-ons we are currently pursuing, including the improvement of data efficiency and accurately handling nested and repeated fields, and fields for which it is difficult to define a good candidate generator.

Acknowledgements
This work was a collaboration between Google Research and several engineers in Google Cloud. I’d like to thank Navneet Potti, James Wendt, Marc Najork, Qi Zhao, and Ivan Kuznetsov in Google Research as well as Lauro Costa, Evan Huang, Will Lu, Lukas Rutishauser, Mu Wang, and Yang Xu on the Cloud AI team for their support. And finally, our research interns Bodhisattwa Majumder and Beliz Gunel for their tireless experimentation on dozens of ideas.

Unlocking the "Chemome" with DNA-Encoded Chemistry and Machine Learning

Unlocking the “Chemome” with DNA-Encoded Chemistry and Machine Learning

Posted by Patrick Riley, Principal Engineer, Accelerated Science Team, Google Research

Much of the development of therapeutics for human disease is built around understanding and modulating the function of proteins, which are the main workhorses of many biological activities. Small molecule drugs such as ibuprofen often work by inhibiting or promoting the function of proteins or their interactions with other biomolecules. Developing useful “virtual screening” methods where potential small molecules can be evaluated computationally rather than in a lab, has long been an area of research. However, the persistent challenge is to build a method that works well enough across a wide range of chemical space to be useful for finding small molecules with physically verified useful interaction with a protein of interest, i.e., “hits”.

In “Machine learning on DNA-encoded libraries: A new paradigm for hit-finding”, recently published in the Journal of Medicinal Chemistry, we worked in collaboration with X-Chem Pharmaceuticals to demonstrate an effective new method for finding biologically active molecules using a combination of physical screening with DNA-encoded small molecule libraries and virtual screening using a graph convolutional neural network (GCNN). This research has led to the creation of the Chemome initiative, a cooperative project between our Accelerated Science team and ZebiAI that will enable the discovery of many more small molecule chemical probes for biological research.

Background on Chemical Probes
Making sense of the biological networks that support life and produce disease is an immensely complex task. One approach to study these processes is using chemical probes, small molecules that aren’t necessarily useful as drugs, but that selectively inhibit or promote the function of specific proteins. When you have a biological system to study (such as cancer cells growing in a dish), you can add the chemical probe at a specific time and observe how the biological system responds differently when the targeted protein has increased or decreased activity. But, despite how useful chemical probes are for this kind of basic biomedical research, only 4% of human proteins have a known chemical probe available.

The process of finding chemical probes begins similarly to the earliest stages of small molecule drug discovery. Given a protein target of interest, the space of small molecules is scanned to find “hit” molecules that can be further tested. Robotic assisted high throughput screening where up to hundred of thousands or millions of molecules are physically tested is a cornerstone of modern drug research. However, the number of small molecules you can easily purchase (1.2×109) is much larger than that, which in turn is much smaller than the number of small drug like molecules (estimates from 1020 to 1060). “Virtual screening” could possibly quickly and efficiently search this vast space of potentially synthesizable molecules and greatly speed up the discovery of therapeutic compounds.

DNA-Encoded Small Molecule Library Screening
The physical part of the screening process uses DNA-encoded small molecule libraries (DELs), which contain many distinct small molecules in one pool, each of which is attached to a fragment of DNA serving as a unique barcode for that molecule. While this basic technique has been around for several decades, the quality of the library and screening process is key to producing meaningful results.

DELs are a very clever idea to solve a biochemical challenge, which is how to collect small molecules into one place with an easy way to identify each. The key is to use DNA as a barcode to identify each molecule, similar to Nobel Prize winning phage display technology. First, one generates many chemical fragments, each with a unique DNA barcode attached, along with a common chemical handle (the NH2 in this case). The results are then pooled and split into separate reactions where a set of distinct chemical fragments with another common chemical handle (e.g., OH) are added. The chemical fragments from the two steps react and fuse together at the common chemical handles. The DNA fragments are also connected to build one continuous barcode for each molecule. The net result is that by performing 2N operations, one gets N2 unique molecules, each of which is identified by its own unique DNA barcode. By using more fragments or more cycles, it’s relatively easy to make libraries with millions or even billions of distinct molecules.

An overview of the process of creating a DNA encoded small molecule library. First, DNA “barcodes” (represented here with numbered helices) are attached to small chemical fragments (the blue shapes) which expose a common chemical “handle” (e.g. the NH2 shown here). When mixed with other chemical fragments (the orange shapes) each of which has another exposed chemical “handle” (the OH) with attached DNA fragments, reactions merge the sets of chemical and DNA fragments, resulting in a voluminous library of small molecules of interest, each with a unique DNA “barcode”.

Once the library has been generated, it can be used to find the small molecules that bind to the protein of interest by mixing the DEL together with the protein and washing away the small molecules that do not attach. Sequencing the remaining DNA barcodes produces millions of individual reads of DNA fragments, which can then be carefully processed to estimate which of the billions of molecules in the original DEL interact with the protein.

Machine Learning on DEL Data
Given the physical screening data returned for a particular protein, we build an ML model to predict whether an arbitrarily chosen small molecule will bind to that protein. The physical screening with the DEL provides positive and negative examples for an ML classifier. To simplify slightly, the small molecules that remain at the end of the screening process are positive examples and everything else are negative examples. We use a graph convolutional neural network, which is a type of neural network specially designed for small graph-like inputs, such as the small molecules in which we are interested.

Results
We physically screened three diverse proteins using DEL libraries: sEH (a hydrolase), ERα (a nuclear receptor), and c-KIT (a kinase). Using the DEL-trained models, we virtually screened large make-on-demand libraries from Mcule and an internal molecule library at X-Chem to identify a diverse set of molecules predicted to show affinity with each target. We compared the results of the GCNN models to a random forest (RF) model, a common method for virtual screening that uses standard chemical fingerprints, which we use as baseline. We find that the GCNN model significantly outperforms the RF model in discovering more potent candidates.

Fraction of molecules (“hit rates”) from those tested showing various levels of activity, comparing predictions from two different machine learned models (a GCNN and random forests, RF) on three distinct protein targets. The color scale on the right uses a common metric IC50 for representing the potency of a molecule. nM means “nanomolar” and µM means “micromolar”. Smaller values / darker colors are generally better molecules. Note that typical virtual screening approaches not built with DEL data normally only reach a few percent on this scale.

Importantly, unlike many other uses of virtual screening, the process to select the molecules to test was automated or easily automatable given the results of the model, and we did not rely on review and selection of the most promising molecules by a trained chemist. In addition, we tested almost 2000 molecules across the three targets, the largest published prospective study of virtual screening of which we are aware. While providing high confidence on the hit rates above, this also allows one to carefully examine the diversity of hits and the usefulness of the model for molecules near and far from the training set.

The Chemome Initiative
ZebiAI Therapeutics was founded based on the results of this research and has partnered with our team and X-Chem Pharmaceuticals to apply these techniques to efficiently deliver new chemical probes to the research community for human proteins of interest, an effort called the Chemome Initiative.

As part of the Chemome Initiative, ZebiAI will work with researchers to identify proteins of interest and source screening data, which our team will use to build machine learning models and make predictions on commercially available libraries of small molecules. ZebiAI will provide the predicted molecules to researchers for activity testing and will collaborate with researchers to advance some programs through discovery. Participation in the program requires that the validated hits be published within a reasonable time frame so that the whole community can benefit. While more validation must be done to make the hit molecules useful as chemical probes, especially for specifically targeting the protein of interest and the ability to function correctly in common assays, having potent hits is a big step forward in the process.

We’re excited to be a part of the Chemome Initiative enabled by the effective ML techniques described here and look forward to its discovery of many new chemical probes. We expect the Chemome will spur significant new biological discoveries and ultimately accelerate new therapeutic discovery for the world.

Acknowledgements
This work represents a multi-year effort between the Accelerated Science Team and X-Chem Pharmaceuticals with many people involved. This project would not have worked without the combined diverse skills of biologists, chemists, and ML researchers. We should especially acknowledge Eric Sigel (of X-Chem, now at ZebiAI) and Kevin McCloskey (of Google), the first authors on the paper and Steve Kearnes (of Google) for core modelling ideas and technical work.

PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization

PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization

Posted by Peter J. Liu and Yao Zhao, Software Engineers, Google Research

Students are often tasked with reading a document and producing a summary (for example, a book report) to demonstrate both reading comprehension and writing ability. This abstractive text summarization is one of the most challenging tasks in natural language processing, involving understanding of long passages, information compression, and language generation. The dominant paradigm for training machine learning models to do this is sequence-to-sequence (seq2seq) learning, where a neural network learns to map input sequences to output sequences. While these seq2seq models were initially developed using recurrent neural networks, Transformer encoder-decoder models have recently become favored as they are more effective at modeling the dependencies present in the long sequences encountered in summarization.

Transformer models combined with self-supervised pre-training (e.g., BERT, GPT-2, RoBERTa, XLNet, ALBERT, T5, ELECTRA) have shown to be a powerful framework for producing general language learning, achieving state-of-the-art performance when fine-tuned on a wide array of language tasks. In prior work, the self-supervised objectives used in pre-training have been somewhat agnostic to the down-stream application in favor of generality; we wondered whether better performance could be achieved if the self-supervised objective more closely mirrored the final task.

In “PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization” (to appear at the 2020 International Conference on Machine Learning), we designed a pre-training self-supervised objective (called gap-sentence generation) for Transformer encoder-decoder models to improve fine-tuning performance on abstractive summarization, achieving state-of-the-art results on 12 diverse summarization datasets. Supplementary to the paper, we are also releasing the training code and model checkpoints on GitHub.

A Self-Supervised Objective for Summarization
Our hypothesis is that the closer the pre-training self-supervised objective is to the final down-stream task, the better the fine-tuning performance. In PEGASUS pre-training, several whole sentences are removed from documents and the model is tasked with recovering them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together. This is an incredibly difficult task that may seem impossible, even for people, and we don’t expect the model to solve it perfectly. However, such a challenging task encourages the model to learn about language and general facts about the world, as well as how to distill information taken from throughout a document in order to generate output that closely resembles the fine-tuning summarization task. The advantage of this self-supervision is that you can create as many examples as there are documents, without any human annotation, which is often the bottleneck in purely supervised systems.

A self-supervised example for PEGASUS during pre-training. The model is trained to output all the masked sentences.

We found that choosing “important” sentences to mask worked best, making the output of self-supervised examples even more similar to a summary. We automatically identified these sentences by finding those that were most similar to the rest of the document according to a metric called ROUGE. ROUGE computes the similarity of two texts by computing n-gram overlaps using a score from 0 to 100 (ROUGE-1, ROUGE-2, and ROUGE-L are three common variants).

Similar to other recent methods, such as T5, we pre-trained our model on a very large corpus of web-crawled documents, then we fine-tuned the model on 12 public down-stream abstractive summarization datasets, resulting in new state-of-the-art results as measured by automatic metrics, while using only 5% of the number of parameters of T5. The datasets were chosen to be diverse, including news articles, scientific papers, patents, short stories, e-mails, legal documents, and how-to directions, showing that the model framework is adaptive to a wide-variety of topics.

Fine-Tuning with Small Numbers of Examples
While PEGASUS showed remarkable performance with large datasets, we were surprised to learn that the model didn’t require a large number of examples for fine-tuning to get near state-of-the-art performance:

ROUGE scores (three variants, higher is better) vs. the number of supervised examples across four selected summarization datasets. The dotted-line shows the Transformer encoder-decoder performance with full-supervision, but without pre-training.

With only 1000 fine-tuning examples, we were able to perform better in most tasks than a strong baseline (Transformer encoder-decoder) that used the full supervised data, which in some cases had many orders of magnitude more examples. This “sample efficiency” greatly increases the usefulness of text summarization models as it significantly lowers the scale and cost of supervised data collection, which in the case of summarization is very expensive.

Human-Quality summaries
While we find automatic metrics such as ROUGE are useful proxies for measuring progress during model development, they only provide limited information and don’t tell us the whole story, such as fluency or a comparison to human performance. To this end, we conducted a human evaluation, where raters were asked to compare summaries from our model with human ones (without knowing which is which). This has some similarities to the Turing test.

Human raters were asked to rate model and human-written summaries without knowing which was which. The document is truncated here for illustration, but raters see the full text.

We performed the experiment with 3 different datasets and found that human raters do not consistently prefer the human summaries to those from our model. Furthermore, our models trained with only 1000 examples performed nearly as well. In particular, with the much studied XSum and CNN/Dailymail datasets, the model achieves human-like performance using only 1000 examples. This suggests large datasets of supervised examples are no longer necessary for summarization, opening up many low-cost use-cases.

A Test of Comprehension: Counting Ships
Following this post is an example article from the XSum dataset along with the model-generated abstractive summary. The model correctly abstracts and paraphrases four named frigates (HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall) as “four Royal Navy frigates”, something an extractive approach could not do since “four” is not mentioned anywhere. Was this a fluke or did the model actually count? One way to find out is to add and remove ships to see if the count changes.

As can be seen below, the model successfully “counts” ships from 2 to 5. However, when we add a sixth ship, the “HMS Alphabet”, it miscounts it as “seven”. So it appears the model has learned to count small numbers of items in a list, but does not yet generalize as elegantly as we would hope. Still, we think this rudimentary counting ability is impressive as it was not explicitly programmed into the model, and it demonstrates a limited amount of “symbolic reasoning” by the model.

PEGASUS code and model release
To support on-going research in this field and ensure reproducibility, we are releasing the PEGASUS code and model checkpoints on GitHub. This includes fine-tuning code which can be used to adapt PEGASUS to other summarization datasets.

Acknowledgements
This work has been a collaborative effort involving Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. We thank the T5 and Google News teams for providing datasets for pre-training PEGASUS.

  • The decommissioned Type 22 frigates
    HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall
    are currently moored in Portsmouth Harbour.
    Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government’s Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK’s industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: “For anyone that has served on a ship it’s your home, you’ve literally been through the wars with it… and you want them to have a noble second life. “My preference is to go for the reef and diving attraction. “We’ve got to get best value for the budget but a reef would also generate income for part of the country through tourism.” The Ministry of Defence has previously said it will “consider all options” for the frigates to ensure “best financial return for the taxpayer”. A spokeswoman would not comment on the number or nature of the bids received due to “commercial sensitivity”. Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.


    Model Summary: No proposals have been submitted to preserve four Royal Navy frigates for reuse, the BBC has learned.

  • The decommissioned Type 22 frigates
    HMS Cumberland, HMS Campbeltown, HMS Chatham, HMS Google and HMS Cornwall
    are currently moored in Portsmouth Harbour.
    Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government’s Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK’s industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: “For anyone that has served on a ship it’s your home, you’ve literally been through the wars with it… and you want them to have a noble second life. “My preference is to go for the reef and diving attraction. “We’ve got to get best value for the budget but a reef would also generate income for part of the country through tourism.” The Ministry of Defence has previously said it will “consider all options” for the frigates to ensure “best financial return for the taxpayer”. A spokeswoman would not comment on the number or nature of the bids received due to “commercial sensitivity”. Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.


    Model Summary: No bids have been submitted for the sale of five Royal Navy frigates, the BBC understands.

  • The decommissioned Type 22 frigates
    HMS Google and HMS Alphabet
    are currently moored in Portsmouth Harbour.
    Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government’s Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK’s industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: “For anyone that has served on a ship it’s your home, you’ve literally been through the wars with it… and you want them to have a noble second life. “My preference is to go for the reef and diving attraction. “We’ve got to get best value for the budget but a reef would also generate income for part of the country through tourism.” The Ministry of Defence has previously said it will “consider all options” for the frigates to ensure “best financial return for the taxpayer”. A spokeswoman would not comment on the number or nature of the bids received due to “commercial sensitivity”. Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.


    Model Summary: Two Royal Navy frigates set to be sold as scrap are unlikely to be preserved, the BBC understands.

  • The decommissioned Type 22 frigates
    HMS Cumberland, HMS Campbeltown and HMS Cornwall
    are currently moored in Portsmouth Harbour.
    Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government’s Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK’s industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: “For anyone that has served on a ship it’s your home, you’ve literally been through the wars with it… and you want them to have a noble second life. “My preference is to go for the reef and diving attraction. “We’ve got to get best value for the budget but a reef would also generate income for part of the country through tourism.” The Ministry of Defence has previously said it will “consider all options” for the frigates to ensure “best financial return for the taxpayer”. A spokeswoman would not comment on the number or nature of the bids received due to “commercial sensitivity”. Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.


    Model Summary: No proposals have been submitted to preserve three Royal Navy frigates for reuse, the BBC has learned.

  • The decommissioned Type 22 frigates
    HMS Cumberland, HMS Campbeltown, HMS Chatham, HMS Google, HMS Alphabet and HMS Cornwall
    are currently moored in Portsmouth Harbour.
    Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government’s Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK’s industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: “For anyone that has served on a ship it’s your home, you’ve literally been through the wars with it… and you want them to have a noble second life. “My preference is to go for the reef and diving attraction. “We’ve got to get best value for the budget but a reef would also generate income for part of the country through tourism.” The Ministry of Defence has previously said it will “consider all options” for the frigates to ensure “best financial return for the taxpayer”. A spokeswoman would not comment on the number or nature of the bids received due to “commercial sensitivity”. Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.


    Model Summary: Seven Royal Navy frigates are set to be put up for sale.

Recent Advances in Google Translate

Recent Advances in Google Translate

Posted by Isaac Caswell and Bowen Liang, Software Engineers, Google Research

Advances in machine learning (ML) have driven improvements to automated translation, including the GNMT neural translation model introduced in Translate in 2016, that have enabled great improvements to the quality of translation for over 100 languages. Nevertheless, state-of-the-art systems lag significantly behind human performance in all but the most specific translation tasks. And while the research community has developed techniques that are successful for high-resource languages like Spanish and German, for which there exist copious amounts of training data, performance on low-resource languages, like Yoruba or Malayalam, still leaves much to be desired. Many techniques have demonstrated significant gains for low-resource languages in controlled research settings (e.g., the WMT Evaluation Campaign), however these results on smaller, publicly available datasets may not easily transition to large, web-crawled datasets.

In this post, we share some recent progress we have made in translation quality for supported languages, especially for those that are low-resource, by synthesizing and expanding a variety of recent advances, and demonstrate how they can be applied at scale to noisy, web-mined data. These techniques span improvements to model architecture and training, improved treatment of noise in datasets, increased multilingual transfer learning through M4 modeling, and use of monolingual data. The quality improvements, which averaged +5 BLEU score over all 100+ languages, are visualized below.

BLEU score of Google Translate models since shortly after its inception in 2006. The improvements since the implementation of the new techniques over the last year are highlighted at the end of the animation.

Advances for Both High- and Low-Resource Languages
Hybrid Model Architecture: Four years ago we introduced the RNN-based GNMT model, which yielded large quality improvements and enabled Translate to cover many more languages. Following our work decoupling different aspects of model performance, we have replaced the original GNMT system, instead training models with a transformer encoder and an RNN decoder, implemented in Lingvo (a TensorFlow framework). Transformer models have been demonstrated to be generally more effective at machine translation than RNN models, but our work suggested that most of these quality gains were from the transformer encoder, and that the transformer decoder was not significantly better than the RNN decoder. Since the RNN decoder is much faster at inference time, we applied a variety of optimizations before coupling it with the transformer encoder. The resulting hybrid models are higher-quality, more stable in training, and exhibit lower latency.

Web Crawl: Neural Machine Translation (NMT) models are trained using examples of translated sentences and documents, which are typically collected from the public web. Compared to phrase-based machine translation, NMT has been found to be more sensitive to data quality. As such, we replaced the previous data collection system with a new data miner that focuses more on precision than recall, which allows the collection of higher quality training data from the public web. Additionally, we switched the web crawler from a dictionary-based model to an embedding based model for 14 large language pairs, which increased the number of sentences collected by an average of 29 percent, without loss of precision.

Modeling Data Noise: Data with significant noise is not only redundant but also lowers the quality of models trained on it. In order to address data noise, we used our results on denoising NMT training to assign a score to every training example using preliminary models trained on noisy data and fine-tuned on clean data. We then treat training as a curriculum learning problem — the models start out training on all data, and then gradually train on smaller and cleaner subsets.

Advances That Benefited Low-Resource Languages in Particular
Back-Translation: Widely adopted in state-of-the-art machine translation systems, back-translation is especially helpful for low-resource languages, where parallel data is scarce. This technique augments parallel training data (where each sentence in one language is paired with its translation) with synthetic parallel data, where the sentences in one language are written by a human, but their translations have been generated by a neural translation model. By incorporating back-translation into Google Translate, we can make use of the more abundant monolingual text data for low-resource languages on the web for training our models. This is especially helpful in increasing fluency of model output, which is an area in which low-resource translation models underperform.

M4 Modeling: A technique that has been especially helpful for low-resource languages has been M4, which uses a single, giant model to translate between all languages and English. This allows for transfer learning at a massive scale. As an example, a lower-resource language like Yiddish has the benefit of co-training with a wide array of other related Germanic languages (e.g., German, Dutch, Danish, etc.), as well as almost a hundred other languages that may not share a known linguistic connection, but may provide useful signal to the model.

Judging Translation Quality
A popular metric for automatic quality evaluation of machine translation systems is the BLEU score, which is based on the similarity between a system’s translation and reference translations that were generated by people. With these latest updates, we see an average BLEU gain of +5 points over the previous GNMT models, with the 50 lowest-resource languages seeing an average gain of +7 BLEU. This improvement is comparable to the gain observed four years ago when transitioning from phrase-based translation to NMT.

Although BLEU score is a well-known approximate measure, it is known to have various pitfalls for systems that are already high-quality. For instance, several works have demonstrated how the BLEU score can be biased by translationese effects on the source side or target side, a phenomenon where translated text can sound awkward, containing attributes (like word order) from the source language. For this reason, we performed human side-by-side evaluations on all new models, which confirmed the gains in BLEU.

In addition to general quality improvements, the new models show increased robustness to machine translation hallucination, a phenomenon in which models produce strange “translations” when given nonsense input. This is a common problem for models that have been trained on small amounts of data, and affects many low-resource languages. For example, when given the string of Telugu characters “ష ష ష ష ష ష ష ష ష ష ష ష ష ష ష”, the old model produced the nonsensical output “Shenzhen Shenzhen Shaw International Airport (SSH)”, seemingly trying to make sense of the sounds, whereas the new model correctly learns to transliterate this as “Sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh”.

Conclusion
Although these are impressive strides forward for a machine, one must remember that, especially for low-resource languages, automatic translation quality is far from perfect. These models still fall prey to typical machine translation errors, including poor performance on particular genres of subject matter (“domains”), conflating different dialects of a language, producing overly literal translations, and poor performance on informal and spoken language.

Nonetheless, with this update, we are proud to provide automatic translations that are relatively coherent, even for the lowest-resource of the 108 supported languages. We are grateful for the research that has enabled this from the active community of machine translation researchers in academia and industry.

Acknowledgements
This effort is built on contributions from Tao Yu, Ali Dabirmoghaddam, Klaus Macherey, Pidong Wang, Ye Tian, Jeff Klingner, Jumpei Takeuchi, Yuichiro Sawai, Hideto Kazawa, Apu Shah, Manisha Jain, Keith Stevens, Fangxiaoyu Feng, Chao Tian, John Richardson, Rajat Tibrewal, Orhan Firat, Mia Chen, Ankur Bapna, Naveen Arivazhagan, Dmitry Lepikhin, Wei Wang, Wolfgang Macherey, Katrin Tomanek, Qin Gao, Mengmeng Niu, and Macduff Hughes.

DADS: Unsupervised Reinforcement Learning for Skill Discovery

DADS: Unsupervised Reinforcement Learning for Skill Discovery

Posted by Archit Sharma, AI Resident, Google Research

Recent research has demonstrated that supervised reinforcement learning (RL) is capable of going beyond simulation scenarios to synthesize complex behaviors in the real world, such as grasping arbitrary objects or learning agile locomotion. However, the limitations of teaching an agent to perform complex behaviors using well-designed task-specific reward functions are also becoming apparent. Designing reward functions can require significant engineering effort, which becomes untenable for a large number of tasks. For many practical scenarios, designing a reward function can be complicated, for example, requiring additional instrumentation for the environment (e.g., sensors to detect the orientation of doors) or manual-labelling of “goal” states. Considering that the ability to generate complex behaviors is limited by this form of reward-engineering, unsupervised learning presents itself as an interesting direction for RL.

In supervised RL, the extrinsic reward function from the environment guides the agent towards the desired behaviors, reinforcing the actions which bring the desired changes in the environment. With unsupervised RL, the agent uses an intrinsic reward function (such as curiosity to try different things in the environment) to generate its own training signals to acquire a broad set of task-agnostic behaviors. The intrinsic reward functions can bypass the problems of the engineering extrinsic reward functions, while being generic and broadly applicable to several agents and problems without any additional design. While much research has recently focused on different approaches to unsupervised reinforcement learning, it is still a severely under-constrained problem — without the guidance of rewards from the environment, it can be hard to learn behaviors which will be useful. Are there meaningful properties of the agent-environment interaction that can help discover better behaviors (“skills”) for the agents?

In this post, we present two recent publications that develop novel unsupervised RL methods for skill discovery. In “Dynamics-Aware Unsupervised Discovery of Skills” (DADS), we introduce the notion of “predictability” to the optimization objective for unsupervised learning. In this work we posit that a fundamental attribute of skills is that they bring about a predictable change in the environment. We capture this idea in our unsupervised skill discovery algorithm, and show applicability in a broad range of simulated robotic setups. In our follow-up work “Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning”, we improve the sample-efficiency of DADS to demonstrate that unsupervised skill discovery is feasible in the real world.

The behavior on the left is random and unpredictable, while the behavior on the right demonstrates systematic motion with predictable changes in the environment. Our goal is to learn potentially useful behaviors such as those on the right, without engineered reward functions.

Overview of DADS
DADS designs an intrinsic reward function that encourages discovery of “predictable” and “diverse” skills. The intrinsic reward function is high if (a) the changes in the environment are different for different skills (encouraging diversity) and (b) changes in the environment for a given skill are predictable (predictability). Since DADS does not obtain any rewards from the environment, optimizing the skills to be diverse enables the agent to capture as many potentially useful behaviors as possible.

In order to determine if a skill is predictable, we train another neural network, called the skill-dynamics network, to predict the changes in the environment state when given the current state and the skill being executed. The better the skill-dynamics network can predict the change of state in the environment, the more “predictable” the skill is. The intrinsic reward defined by DADS can be maximized using any conventional reinforcement learning algorithm.

An overview of DADS.

The algorithm enables several different agents to discover predictable skills purely from reward-free interaction with the environment. DADS, unlike prior work, can scale to high-dimensional continuous control environments such as Humanoid, a simulated bipedal robot. Since DADS is environment agnostic, it can be applied to both locomotion and manipulation oriented environments. We show some of the skills discovered by different continuous control agents.

Ant discovers galloping (top left) and skipping (bottom left), Humanoid discovers different locomotive gaits (middle, sped up 2x), and D’Claw from ROBEL (right) discovers different ways to rotate an object, all using DADS. More sample videos are available here.

Model-Based Control Using Skill-Dynamics
Not only does DADS enable the discovery of predictable and potentially useful skills, it allows for an efficient approach to apply the learned skills to downstream tasks. We can leverage the learned skill-dynamics to predict the state-transitions for each skill. The predicted state-transitions can be chained together to simulate the complete trajectory of states for any learned skill without executing it in the environment. Therefore, we can simulate the trajectory for different skills and choose the skill which gets the highest reward for the given task. The model-based planning approach described here can be very sample-efficient as no additional training is required for the skills. This is a significant step up from the prior approaches, which require additional training on the environment to combine the learned skills.

Using the skills discovered by the agents, we can traverse an arbitrary sequence of checkpoints without any additional training. The plot on the right follows the agent’s traversal from one checkpoint to another.

Real-World Results
The demonstration of unsupervised learning in real-world robotics has been fairly limited, with results being restricted to simulation environments. In “Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning”, we develop a sample-efficient version of our earlier algorithm, called off-DADS, through algorithmic and systematic improvements in an off-policy learning setup. Off-policy learning enables the use of data collected from different policies to improve the current policy. In particular, reusing the previously collected data can dramatically improve the sample-efficiency of reinforcement learning algorithms. Leveraging the improvement from off-policy learning, we train D’Kitty (a quadruped from ROBEL) in the real-world starting from random policy initialization without any rewards from the environment or hand-crafted exploration strategies. We observe the emergence of complex behaviors with diverse gaits and directions by optimizing the intrinsic reward defined by DADS.

Using off-DADS, we train D’Kitty from ROBEL to acquire diverse locomotion behaviors, which can then be used for goal-navigation through model-based control.

Future Work
We have contributed a novel unsupervised skill discovery algorithm with broad applicability that is feasible to be executed in the real-world. This work provides a foundation for future work, where robots can solve a broad range of tasks with minimal human effort. One possibility is to study the relationship between the state-representation and the skills discovered by DADS in order to learn a state-representation that encourages discovery of skills for a known distribution of downstream tasks. Another interesting direction for exploration is provided by the formulation of skill-dynamics that separates high-level planning and low-level control, and study its general applicability to reinforcement learning problems.

Acknowledgements
We would like to thank our coauthors, Michael Ahn, Sergey Levine, Vikash Kumar, Shixiang Gu and Karol Hausman. We would also like to acknowledge the support and feedback provided by various members of the Google Brain team and the Robotics at Google team.

Responding to the European Commission’s AI white paper

In January, our CEO Sundar Pichai visited Brussels to talk about artificial intelligence and how Google could help people and businesses succeed in the digital age through partnership. Much has changed since then due to COVID-19, but one thing hasn’t—our commitment to the potential of partnership with Europe on AI, especially to tackle the pandemic and help people and the economy recover. 

As part of that effort, we earlier today filed our response to the European Commission’s Consultation on Artificial Intelligence, giving our feedback on the Commission’s initial proposal for how to regulate and accelerate the adoption of AI. 

Excellence, skills, trust

Our filing applauds the Commission’s focus on building out the European “ecosystem of excellence.” European universities already boast renowned leaders in dozens of areas of AI research—Google partners with some of them via our machine learning research hubs in Zurich, Amsterdam, Berlin, Paris and London—and many of their students go on to make important contributions to European businesses.  

We support the Commission’s plans to help businesses develop the AI skills they need to thrive in the new digital economy. Next month, we’ll contribute to those efforts by extending our machine learning check-up tool to 11 European countries to help small businesses implement AI and grow their businesses. Google Cloud already works closely with scores of businesses across Europe to help them innovate using AI.  

We also support the Commission’s goal of building a framework for AI innovation that will create trust and guide ethical development and use of this widely applicable technology. We appreciate the Commission’s proportionate, risk-based approach. It’s important that AI applications in sensitive fields—such as medicine or transportation—are held to the appropriate standards. 

Based on our experience working with AI, we also offered a couple of suggestions for making future regulation more effective. We want to be a helpful and engaged partner to policymakers, and we have provided more details in our formal response to the consultation.

Definition of high-risk AI applications

AI has a broad range of current and future applications, including some that involve significant benefits and risks.  We think any future regulation would benefit from a more carefully nuanced definition of “high-risk” applications of AI. We agree that some uses warrant extra scrutiny and safeguards to address genuine and complex challenges around safety, fairness, explainability, accountability, and human interactions. 

Assessment of AI applications

When thinking about how to assess high-risk AI applications, it’s important to strike a balance. While AI won’t always be perfect, it has great potential to help us improve over the performance of existing systems and processes. But the development process for AI must give people confidence that the AI system they’re using is reliable and safe. That’s especially true for applications like new medical diagnostic techniques, which potentially allow skilled medical practitioners to offer more accurate diagnoses, earlier interventions, and better patient outcomes. But the requirements need to be proportionate to the risk, and shouldn’t unduly limit innovation, adoption, and impact. 

This is not an easy needle to thread. The Commission’s proposal suggests “ex ante” assessment of AI applications (i.e., upfront assessment, based on forecasted rather than actual use cases). Our contribution recommends having established due diligence and regulatory review processes expand to include the assessment of AI applications. This would avoid unnecessary duplication of efforts and likely speed up implementation.

For the (probably) rare instances when high-risk applications of AI are not obviously covered by existing regulations, we would encourage clear guidance on the “due diligence” criteria companies should use in their development processes. This would enable robust upfront self-assessment and documentation of any risks and their mitigations, and could also include further scrutiny after launch.

This approach would give European citizens confidence about the trustworthiness of AI applications, while also fostering innovation across the region. And it would encourage companies—especially smaller ones—to launch a range of valuable new services. 

Principles and process

Responsible development of AI presents new challenges and critical questions for all of us. In 2018 we published our own AI Principles to help guide our ethical development and use of AI, and also established internal review processes to help us avoid bias, test rigorously for safety, design with privacy top of mind.  Our principles also specify areas where we will not design or deploy AI, such as to support mass surveillance or violate human rights. Look out for an update on our work around these principles in the coming weeks. 

AI is an important part of Google’s business and our aspirations for the future. We share a common goal with policymakers—a desire to build trust in AI through responsible innovation and thoughtful regulation, so that European citizens can safely enjoy the full social and economic benefits of AI. We hope that our contribution to the consultation is useful, and we look forward to participating in the discussion in coming months.

Read More

Federated Analytics: Collaborative Data Science without Data Collection

Federated Analytics: Collaborative Data Science without Data Collection

Posted by Daniel Ramage, Research Scientist and Stefano Mazzocchi, Software Engineer, Google Research

Federated learning, introduced in 2017, enables developers to train machine learning (ML) models across many devices without centralized data collection, ensuring that only the user has a copy of their data, and is used to power experiences like suggesting next words and expressions in Gboard for Android and improving the quality of smart replies in Android Messages. Following the success of these applications, there is a growing interest in using federated technologies to answer more basic questions about decentralized data — like computing counts or rates — that often don’t involve ML at all. Analyzing user behavior through these techniques can lead to better products, but it is essential to ensure that the underlying data remains private and secure.

Today we’re introducing federated analytics, the practice of applying data science methods to the analysis of raw data that is stored locally on users’ devices. Like federated learning, it works by running local computations over each device’s data, and only making the aggregated results — and never any data from a particular device — available to product engineers. Unlike federated learning, however, federated analytics aims to support basic data science needs. This post describes the basic methodologies of federated analytics that were developed in the pursuit of federated learning, how we extended those insights into new domains, and how recent advances in federated technologies enable better accuracy and privacy for a growing range of data science needs.

Origin of Federated Analytics
The first exploration into federated analytics was in support of federated learning: how can engineers measure the quality of federated learning models against real-world data when that data is not available in a data center? The answer was to re-use the federated learning infrastructure but without the learning part. In federated learning, the model definition can include not only the loss function that is to be optimized, but also code to compute metrics that indicate the quality of the model’s predictions. We could use this code to directly evaluate model quality on phones’ data.

As an example, Gboard engineers measured the overall quality of next word prediction models against raw typing data held on users’ phones. Participating phones downloaded a candidate model, locally computed a metric of how well the model’s predictions matched the words that were actually typed, and then uploaded the metric without any adjustment to the model’s weights or any change to the Gboard typing experience. By averaging the metrics uploaded by many phones, engineers learned a population-level summary of model performance. The technique also easily extended to estimate basic statistics like dataset sizes.

Federated Analytics for Song Recognition Measurement
Beyond model evaluation, federated analytics is used to support the Now Playing feature on Google’s Pixel phones, a tool that shows you what song is playing in the room around you. Under the hood, Now Playing uses an on-device database of song fingerprints to identify music playing near the phone without the need for a network connection. The architecture is good for privacy and for users — it is fast, works offline, and no raw or processed audio data leaves the phone. Because every phone in a region receives the same database, and only songs in the database can be recognized, it’s important for the database to hold the right songs.

To measure and improve each regional database quality, engineers needed to answer a basic question: which of its songs are most often recognized? Federated analytics provides an answer without revealing which songs are heard by any individual phone. It is enabled for users who agreed to send device related usage and diagnostics information to Google.

When Now Playing recognizes a song, it records the track name into the on-device Now Playing history, where users can see recently recognized songs and add them to a music app’s playlist. Later, when the phone is idle, plugged in, and connected to WiFi, Google’s federated learning and analytics server may invite the phone to join a “round” of federated analytics computation, along with several hundred other phones. Each phone in the round computes the recognition rate for the songs in its Now Playing History, and uses the secure aggregation protocol to encrypt the results. The encrypted rates are sent to the federated analytics server, which does not have the keys to decrypt them individually. But when combined with the encrypted counts from the other phones in the round, the final tally of all song counts (and nothing else) can be decrypted by the server.

The result enables Google engineers to improve the song database (for example, by making sure the database contains truly popular songs), without any phone revealing which songs were heard. In its first improvement iteration, this resulted in a 5% increase in overall song recognition across all Pixel phones globally.

Protecting Federated Analytics with Secure Aggregation
Secure aggregation can enable stronger privacy properties for federated analytics applications. For intuition about the secure aggregation protocol, consider a simpler version of the song recognition measurement problem. Let’s say that Rakshita wants to know how often her friends Emily and Zheng have listened to a particular song. Emily has heard it SEmily times and Zheng SZheng times, but neither is comfortable sharing their counts with Rakshita or each other. Instead, the trio could perform a secure aggregation: Emily and Zheng meet to decide on a random number M, which they keep secret from Rakshita. Emily reveals to Rakshita the sum SEmily + M, while Zheng reveals the difference SZheng M. Rakshita sees two numbers that are effectively random (they are masked by M), but she can add them together (SEmily + M) + (SZheng M) = SEmily + SZheng to reveal the total number of times that the song was heard by both Emily and Zheng.

The privacy properties of this approach can be strengthened by summing over more people or by adding small random values to the counts (e.g. in support of differential privacy). For Now Playing, song recognition rates from hundreds of devices are summed together, before the result is revealed to the engineers.

An illustration of the secure aggregation protocol, from the federated learning comic book.

Toward Learning and Analytics with Greater Privacy
The methods of federated analytics are an active area of research and already go beyond analyzing metrics and counts. Sometimes, training ML models with federated learning can be used for obtaining aggregate insights about on-device data, without any of the raw data leaving the devices. For example, Gboard engineers wanted to discover new words commonly typed by users and add them to dictionaries used for spell-checking and typing suggestions, all without being able to see any words that users typed. They did it by training a character-level recurrent neural network on phones, using only the words typed on these phones that were not already in the global dictionary. No typed words ever left the phones, but the resulting model could then be used in the datacenter to generate samples of frequently typed character sequences – the new words!

We are also developing techniques for answering even more ambiguous questions on decentralized datasets like “what patterns in the data are difficult for my model to recognize?” by training federated generative models. And we’re exploring ways to apply user-level differentially private model training to further ensure that these models do not encode information unique to any one user.

Google’s commitment to our privacy principles means pushing the state of the art in safeguarding user data, be it through differential privacy in the data center or advances in privacy during data collection. Google’s earliest system for decentralized data analysis, RAPPOR, was introduced in 2014, and we’ve learned a lot about making effective decisions even with a great deal of noise (often introduced for local differential privacy) since. Federated analytics continues this line of work.

It’s still early days for the federated analytics approach and more progress is needed to answer many common data science questions with good accuracy. The recent Advances and Open Problems in Federated Learning paper offers a comprehensive survey of federated research, while Federated Heavy Hitters Discovery with Differential Privacy introduces a federated analytics method for the discovery of most frequent items in the dataset. Federated analytics enables us to think about data science differently, with decentralized data and privacy-preserving aggregation in a central role. We welcome new contributions and extensions in this emerging field.

Acknowledgments
This post reflects the work of many people, including Blaise Agüera y Arcas, Galen Andrew, Sean Augenstein, Françoise Beaufays, Kallista Bonawitz, Mingqing Chen, Hubert Eichner, Úlfar Erlingsson, Christian Frank, Anna Goralska, Marco Gruteser, Alex Ingerman, Vladimir Ivanov, Peter Kairouz, Chloé Kiddon, Ben Kreuter, Alison Lentz, Wei Li, Xu Liu, Antonio Marcedone, Rajiv Mathews, Brendan McMahan, Tom Ouyang, Sarvar Patel, Swaroop Ramaswamy, Aaron Segal, Karn Seth, Haicheng Sun, Timon Van Overveldt, Sergei Vassilvitskii, Scott Wegner, Yuanbo Zhang, Li Zhang, and Wennan Zhu.

Evaluating Natural Language Generation with BLEURT

Evaluating Natural Language Generation with BLEURT

Posted by Thibault Sellam, Software Engineer, and Ankur P. Parikh, Research Scientist, Google Research

In the last few years, research in natural language generation (NLG) has made tremendous progress, with models now able to translate text, summarize articles, engage in conversation, and comment on pictures with unprecedented accuracy, using approaches with increasingly high levels of sophistication. Currently, there are two methods to evaluate these NLG systems: human evaluation and automatic metrics. With human evaluation, one runs a large-scale quality survey for each new version of a model using human annotators, but that approach can be prohibitively labor intensive. In contrast, one can use popular automatic metrics (e.g., BLEU), but these are oftentimes unreliable substitutes for human interpretation and judgement. The rapid progress of NLG and the drawbacks of existing evaluation methods calls for the development of novel ways to assess the quality and success of NLG systems.

In “BLEURT: Learning Robust Metrics for Text Generation” (presented during ACL 2020), we introduce a novel automatic metric that delivers ratings that are robust and reach an unprecedented level of quality, much closer to human annotation. BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) builds upon recent advances in transfer learning to capture widespread linguistic phenomena, such as paraphrasing. The metric is available on Github.

Evaluating NLG Systems
In human evaluation, a piece of generated text is presented to annotators, who are tasked with assessing its quality with respect to its fluency and meaning. The text is typically shown side-by-side with a reference, authored by a human or mined from the Web.

An example questionnaire used for human evaluation in machine translation.

The advantage of this method is that it is accurate: people are still unrivaled when it comes to evaluating the quality of a piece of text. However, this method of evaluation can easily take days and involve dozens of people for just a few thousand examples, which disrupts the model development workflow.

In contrast, the idea behind automatic metrics is to provide a cheap, low-latency proxy for human-quality measurements. Automatic metrics often take two sentences as input, a candidate and a reference, and they return a score that indicates to what extent the former resembles the latter, typically using lexical overlap. A popular metric is BLEU, which counts the sequences of words in the candidate that also appear in the reference (the BLEU score is very similar to precision).

The advantages and weaknesses of automatic metrics are the opposite of those that come with human evaluation. Automatic metrics are convenient — they can be computed in real-time throughout the training process (e.g., for plotting with Tensorboard). However, they are often inaccurate due to their focus on surface-level similarities and they fail to capture the diversity of human language. Frequently, there are many perfectly valid sentences that can convey the same meaning. Overlap-based metrics that rely exclusively on lexical matches unfairly reward those that resemble the reference in their surface form, even if they do not accurately capture meaning, and penalize other paraphrases.

BLEU scores for three candidate sentences. Candidate 2 is semantically close to the reference, and yet its score is lower than Candidate 3.

Ideally, an evaluation method for NLG should combine the advantages of both human evaluation and automatic metrics — it should be relatively cheap to compute, but flexible enough to cope with linguistic diversity.

Introducing BLEURT
BLEURT is a novel, machine learning-based automatic metric that can capture non-trivial semantic similarities between sentences. It is trained on a public collection of ratings (the WMT Metrics Shared Task dataset) as well as additional ratings provided by the user.

Three candidate sentences rated by BLEURT. BLEURT captures that candidate 2 is similar to the reference, even though it contains more non-reference words than candidate 3.

Creating a metric based on machine learning poses a fundamental challenge: the metric should do well consistently on a wide range of tasks and domains, and over time. However, there is only a limited amount of training data. Indeed, public data is sparse — the WMT Metrics Task dataset, the largest collection of human ratings at the time of writing, contains ~260K human ratings covering the news domain only. This is too limited to train a metric suited for the evaluation of NLG systems of the future.

To address this problem, we employ transfer learning. First, we use the contextual word representations of BERT, a state-of-the-art unsupervised representation learning method for language understanding that has already been successfully incorporated into NLG metrics (e.g., YiSi or BERTscore).

Second, we introduce a novel pre-training scheme to increase BLEURT’s robustness. Our experiments reveal that training a regression model directly over publicly available human ratings is a brittle approach, since we cannot control in what domain and across what time span the metric will be used. The accuracy is likely to drop in the presence of domain drift, i.e., when the text used comes from a different domain than the training sentence pairs. It may also drop when there is a quality drift, when the ratings to be predicted are higher than those used during training — a feature which would normally be good news because it indicates that ML research is making progress.

The success of BLEURT relies on “warming-up” the model using millions of synthetic sentence pairs before fine-tuning on human ratings. We generated training data by applying random perturbations to sentences from Wikipedia. Instead of collecting human ratings, we use a collection of metrics and models from the literature (including BLEU), which allows the number of training examples to be scaled up at very low cost.

BLEURT’s data generation process combines random perturbations and scoring with pre-existing metrics and models.

Experiments reveal that pre-training significantly increases BLEURT’s accuracy, especially when the test data is out-of-distribution.

We pre-train BLEURT twice, first with a language modelling objective (as explained in the original BERT paper), then with a collection of NLG evaluation objectives. We then fine-tune the model on the WMT Metrics dataset, on a set of ratings provided by the user, or a combination of both.The following figure illustrates BLEURT’s training procedure end-to-end.

Results
We benchmark BLEURT against competing approaches and show that it offers superior performance, correlating well with human ratings on the WMT Metrics Shared Task (machine translation) and the WebNLG Challenge (data-to-text). For example, BLEURT is ~48% more accurate than BLEU on the WMT Metrics Shared Task of 2019. We also demonstrate that pre-training helps BLEURT cope with quality drift.

Correlation between different metrics and human ratings on the WMT’19 Metrics Shared Task.

Conclusion
As NLG models have gotten better over time, evaluation metrics have become an important bottleneck for the research in this field. There are good reasons why overlap-based metrics are so popular: they are simple, consistent, and they do not require any training data. In the use cases where multiple reference sentences are available for each candidate, they can be very accurate. While they play a critical part in our infrastructure, they are also very conservative, and only give an incomplete picture of NLG systems’ performance. Our view is that ML engineers should enrich their evaluation toolkits with more flexible, semantic-level metrics.

BLEURT is our attempt to capture NLG quality beyond surface overlap. Thanks to BERT’s representations and a novel pre-training scheme, our metric yields SOTA performance on two academic benchmarks, and we are currently investigating how it can improve Google products. Future research includes investigating multilinguality and multimodality.

Acknowledgements
This project was co-advised by Dipanjan Das. We thank Slav Petrov, Eunsol Choi, Nicholas FitzGerald, Jacob Devlin, Madhavan Kidambi, Ming-Wei Chang, and all the members of the Google Research Language team.