Multi-Task Robotic Reinforcement Learning at Scale

Posted by Karol Hausman, Senior Research Scientist and Yevgen Chebotar, Research Scientist, Robotics at Google

For general-purpose robots to be most useful, they would need to be able to perform a range of tasks, such as cleaning, maintenance and delivery. But training even a single task (e.g., grasping) using offline reinforcement learning (RL), a trial and error learning method where the agent uses training previously collected data, can take thousands of robot-hours, in addition to the significant engineering needed to enable autonomous operation of a large-scale robotic system. Thus, the computational costs of building general-purpose everyday robots using current robot learning methods becomes prohibitive as the number of tasks grows.

Multi-task data collection across multiple robots where different robots collect data for different tasks.

In other large-scale machine learning domains, such as natural language processing and computer vision, a number of strategies have been applied to amortize the effort of learning over multiple skills. For example, pre-training on large natural language datasets can enable few- or zero-shot learning of multiple tasks, such as question answering and sentiment analysis. However, because robots collect their own data, robotic skill learning presents a unique set of opportunities and challenges. Automating this process is a large engineering endeavour, and effectively reusing past robotic data collected by different robots remains an open problem.

Today we present two new advances for robotic RL at scale, MT-Opt, a new multi-task RL system for automated data collection and multi-task RL training, and Actionable Models, which leverages the acquired data for goal-conditioned RL. MT-Opt introduces a scalable data-collection mechanism that is used to collect over 800,000 episodes of various tasks on real robots and demonstrates a successful application of multi-task RL that yields ~3x average improvement over baseline. Additionally, it enables robots to master new tasks quickly through use of its extensive multi-task dataset (new task fine-tuning in <1 day of data collection). Actionable Models enables learning in the absence of specific tasks and rewards by training an implicit model of the world that is also an actionable robotic policy. This drastically increases the number of tasks the robot can perform (via visual goal specification) and enables more efficient learning of downstream tasks.

Large-Scale Multi-Task Data Collection System
The cornerstone for both MT-Opt and Actionable Models is the volume and quality of training data. To collect diverse, multi-task data at scale, users need a way to specify tasks, decide for which tasks to collect the data, and finally, manage and balance the resulting dataset. To that end, we create a scalable and intuitive multi-task success detector using data from all of the chosen tasks. The multi-task success is trained using supervised learning to detect the outcome of a given task and it allows users to quickly define new tasks and their rewards. When this success detector is being applied to collect data, it is periodically updated to accommodate distribution shifts caused by various real-world factors, such as varying lighting conditions, changing background surroundings, and novel states that the robots discover.

Second, we simultaneously collect data for multiple distinct tasks across multiple robots by using solutions to easier tasks to effectively bootstrap learning of more complex tasks. This allows training of a policy for the harder tasks and improves the data collected for them. As such, the amount of per-task data and the number of successful episodes for each task grows over time. To further improve the performance, we focus data collection on underperforming tasks, rather than collecting data uniformly across tasks.

This system collected 9600 robot hours of data (from 57 continuous data collection days on seven robots). However, while this data collection strategy was effective at collecting data for a large number of tasks, the success rate and data volume was imbalanced between tasks.

Learning with MT-Opt
We address the data collection imbalance by transferring data across tasks and re-balancing the per-task data. The robots generate episodes that are labelled as success or failure for each task and are then copied and shared across other tasks. The balanced batch of episodes is then sent to our multi-task RL training pipeline to train the MT-Opt policy.

Data sharing and task re-balancing strategy used by MT-Opt. The robots generate episodes which then get labelled as success or failure for the current task and are then shared across other tasks.

MT-Opt uses Q-learning, a popular RL method that learns a function that estimates the future sum of rewards, called the Q-function. The learned policy then picks the action that maximizes this learned Q-function. For multi-task policy training, we specify the task as an extra input to a large Q-learning network (inspired by our previous work on large-scale single-task learning with QT-Opt) and then train all of the tasks simultaneously with offline RL using the entire multi-task dataset. In this way, MT-Opt is able to train on a wide variety of skills that include picking specific objects, placing them into various fixtures, aligning items on a rack, rearranging and covering objects with towels, etc.

Compared to single-task baselines, MT-Opt performs similarly on the tasks that have the most data and significantly improves performance on underrepresented tasks. So, for a generic lifting task, which has the most supporting data, MT-Opt achieved an 89% success rate (compared to 88% for QT-Opt) and achieved a 50% average success rate across rare tasks, compared to 1% with a single-task QT-Opt baseline and 18% using a naïve, multi-task QT-Opt baseline. Using MT-Opt not only enables zero-shot generalization to new but similar tasks, but also can quickly (in about 1 day of data collection on seven robots) be fine-tuned to new, previously unseen tasks. For example, when applied to an unseen towel-covering task, the system achieved a zero-shot success rate of 92% for towel-picking and 79% for object-covering, which wasn’t present in the original dataset.

Example tasks that MT-Opt is able to learn, such as instance and indiscriminate grasping, chasing, placing, aligning and rearranging.

<!–

Example tasks that MT-Opt is able to learn, such as instance and indiscriminate grasping, chasing, placing, aligning and rearranging.

–>

Towel-covering task that was not present in the original dataset. We fine-tune MT-Opt on this novel task in 1 day to achieve a high (>90%) success rate.

Learning with Actionable Models
While supplying a rigid definition of tasks facilitates autonomous data collection for MT-Opt, it limits the number of learnable behaviors to a fixed set. To enable learning a wider range of tasks from the same data, we use goal-conditioned learning, i.e., learning to reach given goal configurations of a scene in front of the robot, which we specify with goal images. In contrast to explicit model-based methods that learn predictive models of future world observations, or approaches that employ online data collection, this approach learns goal-conditioned policies via offline model-free RL.

To learn to reach any goal state, we perform hindsight relabeling of all trajectories and sub-sequences in our collected dataset and train a goal-conditioned Q-function in a fully offline manner (in contrast to learning online using a fixed set of success examples as in recursive classification). One challenge in this setting is the distributional shift caused by learning only from “positive” hindsight relabeled examples. This we address by employing a conservative strategy to minimize Q-values of unseen actions using artificial negative actions. Furthermore, to enable reaching temporary-extended goals, we introduce a technique for chaining goals across multiple episodes.

Actionable Models relabel sub-sequences with all intermediate goals and regularize Q-values with artificial negative actions.

Training with Actionable Models allows the system to learn a large repertoire of visually indicated skills, such as object grasping, container placing and object rearrangement. The model is also able to generalize to novel objects and visual objectives not seen in the training data, which demonstrates its ability to learn general functional knowledge about the world. We also show that downstream reinforcement learning tasks can be learned more efficiently by either fine-tuning a pre-trained goal-conditioned model or through a goal-reaching auxiliary objective during training.

Example tasks (specified by goal-images) that our Actionable Model is able to learn.

Conclusion
The results of both MT-Opt and Actionable Models indicate that it is possible to collect and then learn many distinct tasks from large diverse real-robot datasets within a single model, effectively amortizing the cost of learning across many skills. We see this an important step towards general robot learning systems that can be further scaled up to perform many useful services and serve as a starting point for learning downstream tasks.

This post is based on two papers, “MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale” and “Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills,” with additional information and videos on the project websites for MT-Opt and Actionable Models.

Acknowledgements
This research was conducted by Dmitry Kalashnikov, Jake Varley, Yevgen Chebotar, Ben Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, Yao Lu, Alex Irpan, Ben Eysenbach, Ryan Julian and Ted Xiao. We’d like to give special thanks to Josh Weaver, Noah Brown, Khem Holden, Linda Luu and Brandon Kinman for their robot operation support; Anthony Brohan for help with distributed learning and testing infrastructure; Tom Small for help with videos and project media; Julian Ibarz, Kanishka Rao, Vikas Sindhwani and Vincent Vanhoucke for their support; Tuna Toksoz and Garrett Peake for improving the bin reset mechanisms; Satoshi Kataoka, Michael Ahn, and Ken Oslund for help with the underlying control stack, and the rest of the Robotics at Google team for their overall support and encouragement. All the above contributions were incredibly enabling for this research.

Read More

Presenting the iGibson Challenge on Interactive and Social Navigation

Posted by Anthony Francis, Software Engineer and Alexander Toshev, Staff Research Scientist, Google Research

Computer vision has significantly advanced over the past decade thanks to large-scale benchmarks, such as ImageNet for image classification or COCO for object detection, which provide vast datasets and criteria for evaluating models. However, these traditional benchmarks evaluate passive tasks in which the emphasis is on perception alone, whereas more recent computer vision research has tackled active tasks, which require both perception and action (often called “embodied AI”).

The First Embodied AI Workshop, co-organized by Google at CVPR 2020, hosted several benchmark challenges for active tasks, including the Stanford and Google organized Sim2Real Challenge with iGibson, which provided a real-world setup to test navigation policies trained in photo-realistic simulation environments. An open-source setup in the challenge enabled the community to train policies in simulation, which could then be run in repeatable real world navigation experiments, enabling the evaluation of the “sim-to-real gap” — the difference between simulation and the real world. Many research teams submitted solutions during the pandemic, which were run safely by challenge organizers on real robots, with winners presenting their results virtually at the workshop.

This year, Stanford and Google are proud to announce a new version of the iGibson Challenge on Interactive and Social Navigation, one of the 10 active visual challenges affiliated with the Second Embodied AI Workshop at CVPR 2021. This year’s Embodied AI Workshop is co-organized by Google and nine other research organizations, and explores issues such as simulation, sim-to-real transfer, visual navigation, semantic mapping and change detection, object rearrangement and restoration, auditory navigation, and following instructions for navigation and interaction tasks. In addition, this year’s interactive and social iGibson challenge explores interactive navigation and social navigation — how robots can learn to interact with people and objects in their environments — by combining the iGibson simulator, the Google Scanned Objects Dataset, and simulated pedestrians within realistic human environments.

New Challenges in Navigation
Active perception tasks are challenging, as they require both perception and actions in response. For example, point navigation involves navigating through mapped space, such as driving robots over kilometers in human-friendly buildings, while recognizing and avoiding obstacles. Similarly object navigation involves looking for objects in buildings, requiring domain invariant representations and object search behaviors. Additionally, visual language instruction navigation involves navigating through buildings based on visual images and commands in natural language. These problems become even harder in a real-world environment, where robots must be able to handle a variety of physical and social interactions that are much more dynamic and challenging to solve. In this year’s iGibson Challenge, we focus on two of those settings:

  • Interactive Navigation: In a cluttered environment, an agent navigating to a goal must physically interact with objects to succeed. For example, an agent should recognize that a shoe can be pushed aside, but that an end table should not be moved and a sofa cannot be moved.
  • Social Navigation: In a crowded environment in which people are also moving about, an agent navigating to a goal must move politely around the people present with as little disruption as possible.

New Features of the iGibson 2021 Dataset
To facilitate research into techniques that address these problems, the iGibson Challenge 2021 dataset provides simulated interactive scenes for training. The dataset includes eight fully interactive scenes derived from real-world apartments, and another seven scenes held back for testing and evaluation.

iGibson provides eight fully interactive scenes derived from real-world apartments.

To enable interactive navigation, these scenes are populated with small objects drawn from the Google Scanned Objects Dataset, a dataset of common household objects scanned in 3D for use in robot simulation and computer vision research, licensed under a Creative Commons license to give researchers the freedom to use them in their research.

The Google Scanned Objects Dataset contains 3D models of many common objects.

The challenge is implemented in Stanford’s open-source iGibson simulation platform, a fast, interactive, photorealistic robotic simulator with physics based on Bullet. For this year’s challenge, iGibson has been expanded with fully interactive environments and pedestrian behaviors based on the ORCA crowd simulation algorithm.

iGibson environments include ORCA crowd simulations and movable objects.

Participating in the Challenge
The iGibson Challenge has launched and its leaderboard is open in the Dev phase, in which participants are encouraged to submit robotic control to the development leaderboard, where they will be tested on the Interactive and Social Navigation challenges on our holdout dataset. The Test phase opens for teams to submit final solutions on May 16th and closes on May 31st, with the winner demo scheduled for June 20th, 2021. For more details on participating, please check out the iGibson Challenge Page.

Acknowledgements
We’d like to thank our colleagues at at the Stanford Vision and Learning Lab (SVL) for working with us to advance the state of interactive and social robot navigation, including Chengshu Li, Claudia Pérez D’Arpino, Fei Xia, Jaewoo Jang, Roberto Martin-Martin and Silvio Savarese. At Google, we would like to thank Aleksandra Faust, Anelia Angelova, Carolina Parada, Edward Lee, Jie Tan, Krista Reyman and the rest of our collaborators on mobile robotics. We would also like to thank our co-organizers on the Embodied AI Workshop, including AI2, Facebook, Georgia Tech, Intel, MIT, SFU, Stanford, UC Berkeley, and University of Washington.

Read More

Monster Mash: A Sketch-Based Tool for Casual 3D Modeling and Animation

Posted by Cassidy Curtis, Visual Designer and David Salesin, Principal Scientist, Google Research

3D computer animation is a time-consuming and highly technical medium — to complete even a single animated scene requires numerous steps, like modeling, rigging and animating, each of which is itself a sub-discipline that can take years to master. Because of its complexity, 3D animation is generally practiced by teams of skilled specialists and is inaccessible to almost everyone else, despite decades of advances in technology and tools. With the recent development of tools that facilitate game character creation and game balance, a natural question arises: is it possible to democratize the 3D animation process so it’s accessible to everyone?

To explore this concept, we start with the observation that most forms of artistic expression have a casual mode: a classical guitarist might jam without any written music, a trained actor could ad-lib a line or two while rehearsing, and an oil painter can jot down a quick gesture drawing. What these casual modes have in common is that they allow an artist to express a complete thought quickly and intuitively without fear of making a mistake. This turns out to be essential to the creative process — when each sketch is nearly effortless, it is possible to iteratively explore the space of possibilities far more effectively.

In this post, we describe Monster Mash, an open source tool presented at SIGGRAPH Asia 2020 that allows experts and amateurs alike to create rich, expressive, deformable 3D models from scratch — and to animate them — all in a casual mode, without ever having to leave the 2D plane. With Monster Mash, the user sketches out a character, and the software automatically converts it to a soft, deformable 3D model that the user can immediately animate by grabbing parts of it and moving them around in real time. There is also an online demo, where you can try it out for yourself.

Creating a walk cycle using Monster Mash. Step 1: Draw a character. Step 2: Animate it.

Creating a 2D Sketch
The insight that makes this casual sketching approach possible is that many 3D models, particularly those of organic forms, can be described by an ordered set of overlapping 2D regions. This abstraction makes the complex task of 3D modeling much easier: the user creates 2D regions by drawing their outlines, then the algorithm creates a 3D model by stitching the regions together and inflating them. The result is a simple and intuitive user interface for sketching 3D figures.

For example, suppose the user wants to create a 3D model of an elephant. The first step is to draw the body as a closed stroke (a). Then the user adds strokes to depict other body parts such as legs (b). Drawing those additional strokes as open curves provides a hint to the system that they are meant to be smoothly connected with the regions they overlap. The user can also specify that some new parts should go behind the existing ones by drawing them with the right mouse button (c), and mark other parts as symmetrical by double-clicking on them (d). The result is an ordered list of 2D regions.

Steps in creating a 2D sketch of an elephant.

Stitching and Inflation
To understand how a 3D model is created from these 2D regions, let’s look more closely at one part of the elephant. First, the system identifies where the leg must be connected to the body (a) by finding the segment (red) that completes the open curve. The system cuts the body’s front surface along that segment, and then stitches the front of the leg together with the body (b). It then inflates the model into 3D by solving a modified form of Poisson’s equation to produce a surface with a rounded cross-section (c). The resulting model (d) is smooth and well-shaped, but because all of the 3D parts are rooted in the drawing plane, they may intersect each other, resulting in a somewhat odd-looking “elephant”. These intersections will be resolved by the deformation system.

Illustration of the details of the stitching and inflation process. The schematic illustrations (b, c) are cross-sections viewed from the elephant’s front.

Layered Deformation
At this point we just have a static model — we need to give the user an easy way to pose the model, and also separate the intersecting parts somehow. Monster Mash’s layered deformation system, based on the well-known smooth deformation method as-rigid-as-possible (ARAP), solves both of these problems at once. What’s novel about our layered “ARAP-L” approach is that it combines deformation and other constraints into a single optimization framework, allowing these processes to run in parallel at interactive speed, so that the user can manipulate the model in real time.

The framework incorporates a set of layering and equality constraints, which move body parts along the z axis to prevent them from visibly intersecting each other. These constraints are applied only at the silhouettes of overlapping parts, and are dynamically updated each frame.

In steps (d) through (h) above, ARAP-L transforms a model from one with intersecting 3D parts to one with the depth ordering specified by the user. The layering constraints force the leg’s silhouette to stay in front of the body (green), and the body’s silhouette to stay behind the leg (yellow). Equality constraints (red) seal together the loose boundaries between the leg and the body.

Meanwhile, in a separate thread of the framework, we satisfy point constraints to make the model follow user-defined control points (described in the section below) in the xy-plane. This ARAP-L method allows us to combine modeling, rigging, deformation, and animation all into a single process that is much more approachable to the non-specialist user.

The model deforms to match the point constraints (red dots) while the layering constraints prevent the parts from visibly intersecting.

Animation
To pose the model, the user can create control points anywhere on the model’s surface and move them. The deformation system converges over multiple frames, which gives the model’s movement a soft and floppy quality, allowing the user to intuitively grasp its dynamic properties — an essential prerequisite for kinesthetic learning.

Because the effect of deformations converges over multiple frames, our system lends 3D models a soft and dynamic quality.

To create animation, the system records the user’s movements in real time. The user can animate one control point, then play back that movement while recording additional control points. In this way, the user can build up a complex action like a walk by layering animation, one body part at a time. At every stage of the animation process, the only task required of the user is to move points around in 2D, a low-risk workflow meant to encourage experimentation and play.

Conclusion
We believe this new way of creating animation is intuitive and can thus help democratize the field of computer animation, encouraging novices who would normally be unable to try it on their own as well as experts who often require fast iteration under tight deadlines. Here you can see a few of the animated characters that have been created using Monster Mash. Most of these were created in a matter of minutes.

A selection of animated characters created using Monster Mash. The original hand-drawn outline used to create each 3D model is visible as an inset above each character.

All of the code for Monster Mash is available as open source, and you can watch our presentation and read our paper from SIGGRAPH Asia 2020 to learn more. We hope this software will make creating 3D animations more broadly accessible. Try out the online demo and see for yourself!

Acknowledgements
Monster Mash is the result of a collaboration between Google Research, Czech Technical University in Prague, ETH Zürich, and the University of Washington. Key contributors include Marek Dvorožňák, Daniel Sýkora, Cassidy Curtis, Brian Curless, Olga Sorkine-Hornung, and David Salesin. We are also grateful to Hélène Leroux, Neth Nom, David Murphy, Samuel Leather, Pavla Sýkorová, and Jakub Javora for participating in the early interactive sessions.

Read More

Announcing the 2021 Research Scholar Program Recipients

Posted by Negar Saei, Program Manager, University Relations

In March 2020 we introduced the Research Scholar Program, an effort focused on developing collaborations with new professors and encouraging the formation of long-term relationships with the academic community. In November we opened the inaugural call for proposals for this program, which was received with enthusiastic interest from faculty who are working on cutting edge research across many research areas in computer science, including machine learning, human computer interaction, health research, systems and more.

Today, we are pleased to announce that in this first year of the program we have granted 77 awards, which included 86 principal investigators representing 15+ countries and over 50 universities. Of the 86 award recipients, 43% identify as an historically marginalized group within technology. Please see the full list of 2021 recipients on our web page, as well as in the list below.

We offer our congratulations to this year’s recipients, and look forward to seeing what they achieve!

Algorithms and Optimization
Alexandros Psomas, Purdue University
Auction Theory Beyond Independent, Quasi-Linear Bidders
Julian Shun, Massachusetts Institute of Technology
Scalable Parallel Subgraph Finding and Peeling Algorithms
Mary Wootters, Stanford University
The Role of Redundancy in Algorithm Design
Pravesh K. Kothari, Carnegie Mellon University
Efficient Algorithms for Robust Machine Learning
Sepehr Assadi, Rutgers University
Graph Clustering at Scale via Improved Massively Parallel Algorithms

Augmented Reality and Virtual Reality
Srinath Sridhar, Brown University
Perception and Generation of Interactive Objects

Geo
Miriam E. Marlier, University of California, Los Angeles
Mapping California’s Compound Climate Hazards in Google Earth Engine
Suining He, The University of Connecticut
Fairness-Aware and Cross-Modality Traffic Learning and Predictive Modeling for Urban Smart Mobility Systems

Human Computer Interaction
Arvind Satyanarayan, Massachusetts Institute of Technology
Generating Semantically Rich Natural Language Captions for Data Visualizations to Promote Accessibility
Dina EL-Zanfaly, Carnegie Mellon University
In-the-making: An intelligence mediated collaboration system for creative practices
Katharina Reinecke, University of Washington
Providing Science-Backed Answers to Health-related Questions in Google Search
Misha Sra, University of California, Santa Barbara
Hands-free Game Controller for Quadriplegic Individuals
Mohsen Mosleh, University of Exeter Business School
Effective Strategies to Debunk False Claims on Social Media: A large-scale digital field experiments approach
Tanushree Mitra, University of Washington
Supporting Scalable Value-Sensitive Fact-Checking through Human-AI Intelligence

Health Research
Catarina Barata, Instituto Superior Técnico, Universidade de Lisboa
DeepMutation – A CNN Model To Predict Genetic Mutations In Melanoma Patients
Emma Pierson, Cornell Tech, the Jacobs Institute, Technion-Israel Institute of Technology, and Cornell University
Using cell phone mobility data to reduce inequality and improve public health
Jasmine Jones, Berea College
Reachout: Co-Designing Social Connection Technologies for Isolated Young Adults
Mojtaba Golzan, University of Technology Sydney, Jack Phu, University of New South Wales
Autonomous Grading of Dynamic Blood Vessel Markers in the Eye using Deep Learning
Serena Yeung, Stanford University
Artificial Intelligence Analysis of Surgical Technique in the Operating Room

Machine Learning and data mining
Aravindan Vijayaraghavan, Northwestern University, Sivaraman Balakrishnan, Carnegie Mellon University
Principled Approaches for Learning with Test-time Robustness
Cho-Jui Hsieh, University of California, Los Angeles
Scalability and Tunability for Neural Network Optimizers
Golnoosh Farnadi, University of Montreal, HEC Montreal/MILA
Addressing Algorithmic Fairness in Decision-focused Deep Learning
Harrie Oosterhuis, Radboud University
Search and Recommendation Systems that Learn from Diverse User Preferences
Jimmy Ba, University of Toronto
Model-based Reinforcement Learning with Causal World Models
Nadav Cohen, Tel-Aviv University
A Dynamical Theory of Deep Learning
Nihar Shah, Carnegie Mellon University
Addressing Unfairness in Distributed Human Decisions
Nima Fazeli, University of Michigan
Semi-Implicit Methods for Deformable Object Manipulation
Qingyao Ai, University of Utah
Metric-agnostic Ranking Optimization
Stefanie Jegelka, Massachusetts Institute of Technology
Generalization of Graph Neural Networks under Distribution Shifts
Virginia Smith, Carnegie Mellon University
A Multi-Task Approach for Trustworthy Federated Learning

Mobile
Aruna Balasubramanian, State University of New York – Stony Brook
AccessWear: Ubiquitous Accessibility using Wearables
Tingjun Chen, Duke University
Machine Learning- and Optical-enabled Mobile Millimeter-Wave Networks

Machine Perception
Amir Patel, University of Cape Town
WildPose: 3D Animal Biomechanics in the Field using Multi-Sensor Data Fusion
Angjoo Kanazawa, University of California, Berkeley
Practical Volumetric Capture of People and Scenes
Emanuele Rodolà, Sapienza University of Rome
Fair Geometry: Toward Algorithmic Debiasing in Geometric Deep Learning
Minchen Wei, The Hong Kong Polytechnic University
Accurate Capture of Perceived Object Colors for Smart Phone Cameras
Mohsen Ali, Information Technology University of the Punjab, Pakistan, Izza Aftab, Information Technology University of the Punjab, Pakistan
Is Economics From Afar Domain Generalizable?
Vineeth N Balasubramanian, Indian Institute of Technology Hyderabad
Bridging Perspectives of Explainability and Adversarial Robustness
Xin Yu, University of Technology Sydney, Linchao Zhu, University of Technology Sydney
Sign Language Translation in the Wild

Networking
Aurojit Panda, New York University
Bertha: Network APIs for the Programmable Network Era
Cristina Klippel Dominicini, Instituto Federal do Espirito Santo
Polynomial Key-based Architecture for Source Routing in Network Fabrics
Noa Zilberman, University of Oxford
Exposing Vulnerabilities in Programmable Network Devices
Rachit Agarwal, Cornell University
Designing Datacenter Transport for Terabit Ethernet

Natural Language Processing
Danqi Chen, Princeton University
Improving Training and Inference Efficiency of NLP Models
Derry Tanti Wijaya, Boston University, Anietie Andy, University of Pennsylvania
Exploring the evolution of racial biases over time through framing analysis
Eunsol Choi, University of Texas at Austin
Answering Information Seeking Questions In The Wild
Kai-Wei Chang, University of California, Los Angeles
Certified Robustness to against language differences in Cross-Lingual Transfer
Mohohlo Samuel Tsoeu, University of Cape Town
Corpora collection and complete natural language processing of isiXhosa, Sesotho and South African Sign languages
Natalia Diaz Rodriguez, University of Granada (Spain) + ENSTA, Institut Polytechnique Paris, Inria. Lorenzo Baraldi, University of Modena and Reggio Emilia
SignNet: Towards democratizing content accessibility for the deaf by aligning multi-modal sign representations

Other Research Areas
John Dickerson, University of Maryland – College Park, Nicholas Mattei, Tulane University
Fairness and Diversity in Graduate Admissions
Mor Nitzan, Hebrew University
Learning representations of tissue design principles from single-cell data
Nikolai Matni, University of Pennsylvania
Robust Learning for Safe Control

Privacy
Foteini Baldimtsi, George Mason University
Improved Single-Use Anonymous Credentials with Private Metabit
Yu-Xiang Wang, University of California, Santa Barbara
Stronger, Better and More Accessible Differential Privacy with autodp

Quantum Computing
Ashok Ajoy, University of California, Berkeley
Accelerating NMR spectroscopy with a Quantum Computer
John Nichol, University of Rochester
Coherent spin-photon coupling
Jordi Tura i Brugués, Leiden University
RAGECLIQ – Randomness Generation with Certification via Limited Quantum Devices
Nathan Wiebe, University of Toronto
New Frameworks for Quantum Simulation and Machine Learning
Philipp Hauke, University of Trento
ProGauge: Protecting Gauge Symmetry in Quantum Hardware
Shruti Puri, Yale University
Surface Code Co-Design for Practical Fault-Tolerant Quantum Computing

Structured data, extraction, semantic graph, and database management
Abolfazl Asudeh, University Of Illinois, Chicago
An end-to-end system for detecting cherry-picked trendlines
Eugene Wu, Columbia University
Interactive training data debugging for ML analytics
Jingbo Shang, University of California, San Diego
Structuring Massive Text Corpora via Extremely Weak Supervision

Security
Chitchanok Chuengsatiansup, The University of Adelaide, Markus Wagner, The University of Adelaide
Automatic Post-Quantum Cryptographic Code Generation and Optimization
Elette Boyle, IDC Herzliya, Israel
Cheaper Private Set Intersection via Advances in “Silent OT”
Joseph Bonneau, New York University
Zeroizing keys in secure messaging implementations
Yu Feng , University of California, Santa Barbara, Yuan Tian, University of Virginia
Exploit Generation Using Reinforcement Learning

Software engineering and programming languages
Kelly Blincoe, University of Auckland
Towards more inclusive software engineering practices to retain women in software engineering
Fredrik Kjolstad, Stanford University
Sparse Tensor Algebra Compilation to Domain-Specific Architectures
Milos Gligoric, University of Texas at Austin
Adaptive Regression Test Selection
Sarah E. Chasins, University of California, Berkeley
If you break it, you fix it: Synthesizing program transformations so that library maintainers can make breaking changes

Systems
Adwait Jog, College of William & Mary
Enabling Efficient Sharing of Emerging GPUs
Heiner Litz, University of California, Santa Cruz
Software Prefetching Irregular Memory Access Patterns
Malte Schwarzkopf, Brown University
Privacy-Compliant Web Services by Construction
Mehdi Saligane, University of Michigan
Autonomous generation of Open Source Analog & Mixed Signal IC
Nathan Beckmann, Carnegie Mellon University
Making Data Access Faster and Cheaper with Smarter Flash Caches
Yanjing Li, University of Chicago
Resilient Accelerators for Deep Learning Training Tasks

Read More

How fact checkers and Google.org are fighting misinformation

Misinformation can have dramatic consequences on people’s lives — from finding reliable information on everything from elections to vaccinations — and the pandemic has only exacerbated the problem as accurate information can save lives. To help fight the rise in minsformation, Full Fact, a nonprofit that provides tools and resources to fact checkers, turned to Google.org for help. Today, ahead of International Fact Checking Day, we’re sharing the impact of this work.

Every day, millions of claims, like where to vote and COVID-19 vaccination rates, are made across a multitude of platforms and media. It was becoming increasingly difficult for fact checkers to identify the most important claims to investigate.

We’re not just fighting an epidemic; we’re fighting an infodemic. Fake news spreads faster and more easily than this virus and is just as dangerous. Tedros Adhanom
Director General of the World Health Organization

Last year, Google.org provided Full Fact with $2 million and seven Googlers from the Google.org Fellowship, a pro-bono program that matches teams of Googlers with nonprofits for up to six months to work full-time on technical projects. The Fellows helped Full Fact build AI tools to help fact checkers detect claims made by key politicians, then group them by topic and match them with similar claims from across press, social networks and even radio using speech to text technology. Over the past year, Full Fact boosted the amount of claims they could process by 1000x, detecting and clustering over 100,000 claims per day — that’s more than 36.5 million total claims per year!

The AI-powered tools empower fact checkers to be more efficient, so that they can spend more time actually checking and debunking facts rather than identifying which facts to check. Using a machine learning BERT-based model, the technology now works across four languages (English, French, Portuguese and Spanish). And Full Fact’s work has expanded to South Africa, Nigeria, Kenya with their partner Africa Check and Argentina with Chequeado. In total in 2020, Full Fact’s fact checks appeared 237 million times across the internet. 

Graphic showing the following impact statistics: 1000x increase in detected claims, fact checks appeared 237 million times in search results, the technology works across 4 languages, and  50K claims were detected per day in the UK election.

If you’re interested in learning more about how you can use Google to fact check and spot misinformation, check out some of our tips and tricks. Right now more than ever we need to empower citizens to find reliable authoritative information, and we’re excited about the impact that Full Fact and its partners have had in making the internet a safer place for everyone. 

Read More

Redefining what a map can be with new information and AI

Sixteen years ago, many of us held a printout of directions in one hand and the steering wheel in the other to get around— without information about the traffic along your route or details about when your favorite restaurant was open. Since then, we’ve been pushing the boundaries of what a map can do, propelled by the latest machine learning. This year, we’re on track to bring over 100 AI-powered improvements to Google Maps so you can get the most accurate, up-to-date information about the world, exactly when you need it. Here’s a snapshot of how we’re using AI to make Maps work better for you with a number of updates coming this year.

Navigate indoors with Live View

We all know that awkward moment when you’re walking in the opposite direction of where you want to go — Live View uses AR cues to avoid just that. Live View is powered by a technology called global localization, which uses AI to scan tens of billions of Street View images to understand your orientation. Thanks to new advancements that help us understand the precise altitude and placement of objects inside a building, we’re now able to bring Live View to some of the trickiest-to-navigate places indoors: airports, transit stations and malls. 

If you’re catching a plane or train, Live View can help you find the nearest elevator and escalators, your gate, platform, baggage claim, check-in counters, ticket office, restrooms, ATMs and more. Arrows and accompanying directions will point you the right way. And if you need to pick something up from the mall, use Live View to see what floor a store is on and how to get there so you can get in and out in a snap. Indoor Live View is live now on Android and iOS in a number of malls in Chicago, Long Island, Los Angeles, Newark, San Francisco, San Jose, and Seattle. It starts rolling out in the coming months in select airports, malls, and transit stations in Tokyo and Zurich, with more cities on the way. 

  

Indoor live view 2

Find your way inside airports, train stations, and malls with Indoor Live View

Plan ahead with more information about weather and air quality 

With the new weather layer, you can quickly see current and forecasted temperature and weather conditions in an area — so you’ll never get caught in the rain without an umbrella. And the new air quality layer shows you how healthy (or unhealthy) the air is —  information that’s especially helpful if you have allergies or are in a smoggy or fire-prone area. Data from partners like The Weather Company, AirNow.gov and the Central Pollution Board power these layers that start rolling out on Android and iOS in the coming months. The weather layer will be available globally and the air quality layer will launch in Australia, India, and the U.S., with more countries to come. 

Head outside with the new weather and air quality layers

See helpful air quality and weather information with new layers in Google Maps

Find more eco-friendly options to get around

With insights from the U.S. Department of Energy’s National Renewable Energy Lab, we’re building a new routing model that optimizes for lower fuel consumption based on factors like road incline and traffic congestion. This is all part of the commitment we made last September to help one billion people who use our products take action to reduce their environmental footprint. Soon, Google Maps will default to the route with the lowest carbon footprint when it has approximately the same ETA as the fastest route. In cases where the eco-friendly route could significantly increase your ETA, we’ll let you compare the relative CO2 impact between routes so you can choose. Always want the fastest route? That’s OK too — simply adjust your preferences in Settings. Eco-friendly routes launch in the U.S. on Android and iOS later this year, with a global expansion on the way.

Eco-friendly routes let you choose the route with the lowest carbon footprint

Moe eco-friendly routes let you choose the route with the lowest carbon footprint

From Amsterdam to Jakarta, cities around the world have established low emission zones — areas that restrict polluting vehicles like certain diesel cars or cars with specific emissions stickers —  to help keep the air clean. To support these efforts, we’re working on alerts to help drivers better understand when they’ll be navigating through one of these zones. You can quickly know if your vehicle is allowed in the area, choose an alternative mode of transportation, or take another route. Low emission zone alerts launch this June in Germany, the Netherlands, France, Spain, and the UK on Android and iOS, with more countries coming soon. 

Low emission zone alerts launch in

Quickly know if your vehicle is allowed in the area, choose an alternative mode of transportation, or take another route with low emission zone alerts

But we know that getting around sustainably goes beyond driving. So we’re making it easier to choose more sustainable options when you’re on the go. Soon you’ll get a comprehensive view of all routes and transportation modes available to your destination — you can compare how long it’ll take to get there by car, transit or bike without toggling between tabs. Using advanced machine learning models, Maps will automatically prioritize your preferred modes  — and even boost modes that are popular in your city. For example, if you bike a lot, we’ll automatically show you more biking routes. And if you live in a city like New York, London, Tokyo, or Buenos Aires where taking the subway is popular, we’ll rank that mode higher. This rolls out globally in the coming months on Android and iOS.

new directions experience

Easily compare different routes and modes of transportation with the new directions experience

Save time with curbside grocery pickup on Maps

Delivery and curbside pickup have grown in popularity during the pandemic — they’re convenient and minimize contact. To make this process easier, we’re bringing helpful shopping information to stores’ Business Profiles on Maps and Search, like delivery providers, pickup and delivery windows, fees, and order minimums. We’re rolling this out on mobile Search starting with Instacart and Albertsons Cos. stores in the U.S., with plans to expand to Maps and other partners.

pickup and delivery actions

Check out helpful information about grocery delivery providers, pickup and delivery windows, fees, and order minimums

This summer, we’re also teaming up with U.S. supermarket Fred Meyer, a division of The Kroger Co., on a pilot in select stores in Portland, Oregon to make grocery pickup easier. After you place an order for pickup on the store’s app, you can add it to Maps. We’ll send you a notification when it’s time to leave, and let you share your arrival time with the store. Your ETA is continuously updated, based on location and traffic. This helps the store prioritize your order so it’s ready as soon as you get there. Check in on the Google Maps app, and they’ll bring your order right out for a seamless, fast, no-contact pickup. 

pickup with google maps

Track your grocery order status, share your ETA, and let the store know you’ve arrived – all from Google Maps

All of these updates are possible thanks to AI advancements that have transformed Google Maps into a map that can reflect the millions of changes made around the world every day —  in the biggest cities and the smallest towns. Whether you’re getting around, exploring an area, or knocking out errands, let Google Maps help you find your way.

Read More

Constructing Transformers For Longer Sequences with Sparse Attention Methods

Posted by Avinava Dubey, Research Scientist, Google Research

Natural language processing (NLP) models based on Transformers, such as BERT, RoBERTa, T5, or GPT3, are successful for a wide variety of tasks and a mainstay of modern NLP research. The versatility and robustness of Transformers are the primary drivers behind their wide-scale adoption, leading them to be easily adapted for a diverse range of sequence-based tasks — as a seq2seq model for translation, summarization, generation, and others, or as a standalone encoder for sentiment analysis, POS tagging, machine reading comprehension, etc. The key innovation in Transformers is the introduction of a self-attention mechanism, which computes similarity scores for all pairs of positions in an input sequence, and can be evaluated in parallel for each token of the input sequence, avoiding the sequential dependency of recurrent neural networks, and enabling Transformers to vastly outperform previous sequence models like LSTM.

A limitation of existing Transformer models and their derivatives, however, is that the full self-attention mechanism has computational and memory requirements that are quadratic with the input sequence length. With commonly available current hardware and model sizes, this typically limits the input sequence to roughly 512 tokens, and prevents Transformers from being directly applicable to tasks that require larger context, like question answering, document summarization or genome fragment classification. Two natural questions arise: 1) Can we achieve the empirical benefits of quadratic full Transformers using sparse models with computational and memory requirements that scale linearly with the input sequence length? 2) Is it possible to show theoretically that these linear Transformers preserve the expressivity and flexibility of the quadratic full Transformers?

We address both of these questions in a recent pair of papers. In “ETC: Encoding Long and Structured Inputs in Transformers”, presented at EMNLP 2020, we present the Extended Transformer Construction (ETC), which is a novel method for sparse attention, in which one uses structural information to limit the number of computed pairs of similarity scores. This reduces the quadratic dependency on input length to linear and yields strong empirical results in the NLP domain. Then, in “Big Bird: Transformers for Longer Sequences”, presented at NeurIPS 2020, we introduce another sparse attention method, called BigBird that extends ETC to more generic scenarios where prerequisite domain knowledge about structure present in the source data may be unavailable. Moreover, we also show that theoretically our proposed sparse attention mechanism preserves the expressivity and flexibility of the quadratic full Transformers. Our proposed methods achieve a new state of the art on challenging long-sequence tasks, including question answering, document summarization and genome fragment classification.

Attention as a Graph
The attention module used in Transformer models computes similarity scores for all pairs of positions in an input sequence. It is useful to think of the attention mechanism as a directed graph, with tokens represented by nodes and the similarity score computed between a pair of tokens represented by an edge. In this view, the full attention model is a complete graph. The core idea behind our approach is to carefully design sparse graphs, such that one only computes a linear number of similarity scores.

Full attention can be viewed as a complete graph.

Extended Transformer Construction (ETC)
On NLP tasks that require long and structured inputs, we propose a structured sparse attention mechanism, which we call Extended Transformer Construction (ETC). To achieve structured sparsification of self attention, we developed the global-local attention mechanism. Here the input to the Transformer is split into two parts: a global input where tokens have unrestricted attention, and a long input where tokens can only attend to either the global input or to a local neighborhood. This achieves linear scaling of attention, which allows ETC to significantly scale input length.

In order to further exploit the structure of long documents, ETC combines additional ideas: representing the positional information of the tokens in a relative way, rather than using their absolute position in the sequence; using an additional training objective beyond the usual masked language model (MLM) used in models like BERT; and flexible masking of tokens to control which tokens can attend to which other tokens. For example, given a long selection of text, a global token is applied to each sentence, which connects to all tokens within the sentence, and a global token is also applied to each paragraph, which connects to all tokens within the same paragraph.

An example of document structure based sparse attention of ETC model. The global variables are denoted by C (in blue) for paragraph, S (yellow) for sentence while the local variables are denoted by X (grey) for tokens corresponding to the long input.

With this approach, we report state-of-the-art results in five challenging NLP datasets requiring long or structured inputs: TriviaQA, Natural Questions (NQ), HotpotQA, WikiHop, and OpenKP.

Test set result on Question Answering. For both verified TriviaQA and WikiHop, using ETC achieved a new state of the art.

BigBird
Extending the work of ETC, we propose BigBird — a sparse attention mechanism that is also linear in the number of tokens and is a generic replacement for the attention mechanism used in Transformers. In contrast to ETC, BigBird doesn’t require any prerequisite knowledge about structure present in the source data. Sparse attention in the BigBird model consists of three main parts:

  • A set of global tokens attending to all parts of the input sequence
  • All tokens attending to a set of local neighboring tokens
  • All tokens attending to a set of random tokens
BigBird sparse attention can be seen as adding few global tokens on Watts-Strogatz graph.

In the BigBird paper, we explain why sparse attention is sufficient to approximate quadratic attention, partially explaining why ETC was successful. A crucial observation is that there is an inherent tension between how few similarity scores one computes and the flow of information between different nodes (i.e., the ability of one token to influence each other). Global tokens serve as a conduit for information flow and we prove that sparse attention mechanisms with global tokens can be as powerful as the full attention model. In particular, we show that BigBird is as expressive as the original Transformer, is computationally universal (following the work of Yun et al. and Perez et al.), and is a universal approximator of continuous functions. Furthermore, our proof suggests that the use of random graphs can further help ease the flow of information — motivating the use of the random attention component.

This design scales to much longer sequence lengths for both structured and unstructured tasks. Further scaling can be achieved by using gradient checkpointing by trading off training time for sequence length. This lets us extend our efficient sparse transformers to include generative tasks that require an encoder and a decoder, such as long document summarization, on which we achieve a new state of the art.

Summarization ROUGE score for long documents. Both for BigPatent and ArXiv datasets, we achieve a new state of the art result.

Moreover, the fact that BigBird is a generic replacement also allows it to be extended to new domains without pre-existing domain knowledge. In particular, we introduce a novel application of Transformer-based models where long contexts are beneficial — extracting contextual representations of genomic sequences (DNA). With longer masked language model pre-training, BigBird achieves state-of-the-art performance on downstream tasks, such as promoter-region prediction and chromatin profile prediction.

On multiple genomics tasks, such as promoter region prediction (PRP), chromatin-profile prediction including transcription factors (TF), histone-mark (HM) and DNase I hypersensitive (DHS) detection, we outperform baselines. Moreover our results show that Transformer models can be applied to multiple genomics tasks that are currently underexplored.

Main Implementation Idea
One of the main impediments to the large scale adoption of sparse attention is the fact that sparse operations are quite inefficient in modern hardware. Behind both ETC and BigBird, one of our key innovations is to make an efficient implementation of the sparse attention mechanism. As modern hardware accelerators like GPUs and TPUs excel using coalesced memory operations, which load blocks of contiguous bytes at once, it is not efficient to have small sporadic look-ups caused by a sliding window (for local attention) or random element queries (random attention). Instead we transform the sparse local and random attention into dense tensor operations to take full advantage of modern single instruction, multiple data (SIMD) hardware.

To do this, we first “blockify” the attention mechanism to better leverage GPUs/TPUs, which are designed to operate on blocks. Then we convert the sparse attention mechanism computation into a dense tensor product through a series of simple matrix operations such as reshape, roll, and gather, as illustrated in the animation below.

Illustration of how sparse window attention is efficiently computed using roll and reshape, and without small sporadic look-ups.

Recently, “Long Range Arena: A Benchmark for Efficient Transformers“ provided a benchmark of six tasks that require longer context, and performed experiments to benchmark all existing long range transformers. The results show that the BigBird model, unlike its counterparts, clearly reduces memory consumption without sacrificing performance.

Conclusion
We show that carefully designed sparse attention can be as expressive and flexible as the original full attention model. Along with theoretical guarantees, we provide a very efficient implementation which allows us to scale to much longer inputs. As a consequence, we achieve state-of-the-art results for question answering, document summarization and genome fragment classification. Given the generic nature of our sparse attention, the approach should be applicable to many other tasks like program synthesis and long form open domain question answering. We have open sourced the code for both ETC (github) and BigBird (github), both of which run efficiently for long sequences on both GPUs and TPUs.

Acknowledgements
This research resulted as a collaboration with Amr Ahmed, Joshua Ainslie, Chris Alberti, Vaclav Cvicek, Avinava Dubey, Zachary Fisher, Guru Guruganesh, Santiago Ontañón, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, Li Yang, Manzil Zaheer, who co-authored EMNLP and NeurIPS papers.

Read More

Recursive Classification: Replacing Rewards with Examples in RL

Posted by Benjamin Eysenbach, Student Researcher, Google Research

A general goal of robotics research is to design systems that can assist in a variety of tasks that can potentially improve daily life. Most reinforcement learning algorithms for teaching agents to perform new tasks require a reward function, which provides positive feedback to the agent for taking actions that lead to good outcomes. However, actually specifying these reward functions can be quite tedious and can be very difficult to define for situations without a clear objective, such as whether a room is clean or if a door is sufficiently shut. Even for tasks that are easy to describe, actually measuring whether the task has been solved can be difficult and may require adding many sensors to a robot’s environment.

Alternatively, training a model using examples, called example-based control, has the potential to overcome the limitations of approaches that rely on traditional reward functions. This new problem statement is most similar to prior methods based on “success detectors”, and efficient algorithms for example-based control could enable non-expert users to teach robots to perform new tasks, without the need for coding expertise, knowledge of reward function design, or the installation of environmental sensors.

In “Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification,” we propose a machine learning algorithm for teaching agents how to solve new tasks by providing examples of success (e.g., if “success” examples show a nail embedded into a wall, the agent will learn to pick up a hammer and knock nails into the wall). This algorithm, recursive classification of examples (RCE), does not rely on hand-crafted reward functions, distance functions, or features, but rather learns to solve tasks directly from data, requiring the agent to learn how to solve the entire task by itself, without requiring examples of any intermediate states. Using a version of temporal difference learning — similar to Q-learning, but replacing the typical reward function term using only examples of success — RCE outperforms prior approaches based on imitation learning on simulated robotics tasks. Coupled with theoretical guarantees similar to those for reward-based learning, the proposed method offers a user-friendly alternative for teaching robots new tasks.

Top: To teach a robot to hammer a nail into a wall, most reinforcement learning algorithms require that the user define a reward function. Bottom: The example-based control method uses examples of what the world looks like when a task is completed to teach the robot to solve the task, e.g., examples where the nail is already hammered into the wall.

Example-Based Control vs Imitation Learning
While the example-based control method is similar to imitation learning, there is an important distinction — it does not require expert demonstrations. In fact, the user can actually be quite bad at performing the task themselves, as long as they can look back and pick out the small fraction of states where they did happen to solve the task.

Additionally, whereas previous research used a stage-wise approach in which the model first uses success examples to learn a reward function and then applies that reward function with an off-the-shelf reinforcement learning algorithm, RCE learns directly from the examples and skips the intermediate step of defining the reward function. Doing so avoids potential bugs and bypasses the process of defining the hyperparameters associated with learning a reward function (such as how often to update the reward function or how to regularize it) and, when debugging, removes the need to examine code related to learning the reward function.

Recursive Classification of Examples
The intuition behind the RCE approach is simple: the model should predict whether the agent will solve the task in the future, given the current state of the world and the action that the agent is taking. If there were data that specified which state-action pairs lead to future success and which state-action pairs lead to future failure, then one could solve this problem using standard supervised learning. However, when the only data available consists of success examples, the system doesn’t know which states and actions led to success, and while the system also has experience interacting with the environment, this experience isn’t labeled as leading to success or not.

Left: The key idea is to learn a future success classifier that predicts for every state (circle) in a trajectory whether the task will be solved in the future (thumbs up/down). Right: In the example-based control approach, the model is provided only with unlabeled experience (grey circles) and success examples (green circles), so one cannot apply standard supervised learning. Instead, the model uses the success examples to automatically label the unlabeled experience.

Nonetheless, one can piece together what these data would look like, if it were available. First, by definition, a successful example must be one that solves the given task. Second, even though it is unknown whether an arbitrary state-action pair will lead to success in solving a task, it is possible to estimate how likely it is that the task will be solved if the agent started at the next state. If the next state is likely to lead to future success, it can be assumed that the current state is also likely to lead to future success. In effect, this is recursive classification, where the labels are inferred based on predictions at the next time step.

The underlying algorithmic idea of using a model’s predictions at a future time step as a label for the current time step closely resembles existing temporal-difference methods, such as Q-learning and successor features. The key difference is that the approach described here does not require a reward function. Nonetheless, we show that this method inherits many of the same theoretical convergence guarantees as temporal difference methods. In practice, implementing RCE requires changing only a few lines of code in an existing Q-learning implementation.

Evaluation
We evaluated the RCE method on a range of challenging robotic manipulation tasks. For example, in one task we required a robotic hand to pick up a hammer and hit a nail into a board. Previous research into this task [1, 2] have used a complex reward function (with terms corresponding to the distance between the hand and the hammer, the distance between the hammer and the nail, and whether the nail has been knocked into the board). In contrast, the RCE method requires only a few observations of what the world would look like if the nail were hammered into the board.

We compared the performance of RCE to a number of prior methods, including those that learn an explicit reward function and those based on imitation learning , all of which struggle to solve this task. This experiment highlights how example-based control makes it easy for users to specify even complex tasks, and demonstrates that recursive classification can successfully solve these sorts of tasks.

Compared with prior methods, the RCE approach solves the task of hammering a nail into a board more reliably that prior approaches based on imitation learning [SQIL, DAC] and those that learn an explicit reward function [VICE, ORIL, PURL].

Conclusion
We have presented a method to teach autonomous agents to perform tasks by providing them with examples of success, rather than meticulously designing reward functions or collecting first-person demonstrations. An important aspect of example-based control, which we discuss in the paper, is what assumptions the system makes about the capabilities of different users. Designing variants of RCE that are robust to differences in users’ capabilities may be important for applications in real-world robotics. The code is available, and the project website contains additional videos of the learned behaviors.

Acknowledgements
We thank our co-authors, Ruslan Salakhutdinov and Sergey Levine. We also thank Surya Bhupatiraju, Kamyar Ghasemipour, Max Igl, and Harini Kannan for feedback on this post, and Tom Small for helping to design figures for this post.

Read More

Progress and Challenges in Long-Form Open-Domain Question Answering

Posted by Aurko Roy, Research Scientist, Google Research

Open-domain long-form question answering (LFQA) is a fundamental challenge in natural language processing (NLP) that involves retrieving documents relevant to a given question and using them to generate an elaborate paragraph-length answer. While there has been remarkable recent progress in factoid open-domain question answering (QA), where a short phrase or entity is enough to answer a question, much less work has been done in the area of long-form question answering. LFQA is nevertheless an important task, especially because it provides a testbed to measure the factuality of generative text models. But, are current benchmarks and evaluation metrics really suitable for making progress on LFQA?

In “Hurdles to Progress in Long-form Question Answering” (to appear at NAACL 2021), we present a new system for open-domain long-form question answering that leverages two recent advances in NLP: 1) state-of-the-art sparse attention models, such as Routing Transformer (RT), which allow attention-based models to scale to long sequences, and 2) retrieval-based models, such as REALM, which facilitate retrievals of Wikipedia articles related to a given query. To encourage more factual grounding, our system combines information from several retrieved Wikipedia articles related to the given question before generating an answer. It achieves a new state of the art on ELI5, the only large-scale publicly available dataset for long-form question answering.

However, while our system tops the public leaderboard, we discover several troubling trends with the ELI5 dataset and its associated evaluation metrics. In particular, we find 1) little evidence that models actually use the retrievals on which they condition; 2) that trivial baselines (e.g., input copying) beat modern systems, like RAG / BART+DPR; and 3) that there is a significant train/validation overlap in the dataset. Our paper suggests mitigation strategies for each of these issues.

Text Generation
The main workhorse of NLP models is the Transformer architecture, in which each token in a sequence attends to every other token in a sequence, resulting in a model that scales quadratically with sequence length. The RT model introduces a dynamic, content-based sparse attention mechanism that reduces the complexity of attention in the Transformer model from n2 to n1.5, where n is the sequence length, which enables it to scale to long sequences. This allows each word to attend to other relevant words anywhere in the entire piece of text, unlike methods such as Transformer-XL where a word can only attend to words in its immediate vicinity.

The key insight of the RT work is that each token attending to every other token is often redundant, and may be approximated by a combination of local and global attention. Local attention allows each token to build up a local representation over several layers of the model, where each token attends to a local neighborhood, facilitating local consistency and fluency. Complementing local attention, the RT model also uses mini-batch k-means clustering to enable each token to attend only to a set of most relevant tokens.

Attention maps for the content-based sparse attention mechanism used in Routing Transformer. The word sequence is represented by the diagonal dark colored squares. In the Transformer model (left), each token attends to every other token. The shaded squares represent the tokens in the sequence to which a given token (the dark square) is attending. The RT model uses both local attention (middle), where tokens attend only to other tokens in their local neighborhood, and routing attention (right), in which a token only attends to clusters of tokens most relevant to it in context. The dark red, green and blue tokens only attend to the corresponding color of lightly shaded tokens.

We pre-train an RT model on the Project Gutenberg (PG-19) data-set with a language modeling objective, i.e, the model learns to predict the next word given all the previous words, so as to be able to generate fluent paragraph long text.

Information Retrieval
To demonstrate the effectiveness of the RT model on the task of LFQA, we combine it with retrievals from REALM. The REALM model (Guu et al. 2020) is a retrieval-based model that uses the maximum inner product search to retrieve Wikipedia articles relevant to a particular query or question. The model was fine-tuned for factoid-based question answering on the Natural Questions dataset. REALM utilizes the BERT model to learn good representations for a question and uses SCANN to retrieve Wikipedia articles that have a high topical similarity with the question representation. This is then trained end-to-end to maximize the log-likelihood on the QA task.

We further improve the quality of REALM retrievals by using a contrastive loss. The idea behind this is to encourage the representation of a question to get close to its ground truth answer and diverge from the other answers in its mini-batch. This ensures that when the system retrieves relevant items using this question representation, it returns articles that are “similar” to ground truth answers. We call this retriever contrastive-REALM or c-REALM.

The combined RT + c-REALM system for LFQA.

Evaluation
We test the model on long-form question answering using the ELI5 dataset, which is a part of the KILT benchmark, and is the only publicly available large-scale LFQA dataset. The KILT benchmark measures text retrievals using Precision (R-Prec) and text generation using ROUGE-L. The two scores are combined to give a KILT R-L score, which determines a model’s ranking on the leaderboard. We fine-tune the pre-trained RT model together with retrievals from c-REALM on the ELI5 dataset from KILT.

Our submission tops the KILT leaderboard for long-form question answering on ELI5 with a combined KILT R-L score of 2.36. It improves on the previous leaderboard entry of BART + DPR (KILT R-L score of 1.9), while having a similar number of parameters as the other models on the leaderboard. In terms of text generation quality, we see an improvement of +4.11, +5.78 and +9.14 Rouge-L over T5, BART + DPR and RAG, respectively.

Our results on the KILT leaderboard for long form question answering on ELI5.

Example Generations from the RT Model

  • Q: Why are almost all boats white?

    Boats are generally painted white for obvious reasons, as they are inexpensive and easy to clean. Ships are also often painted to provide a visible visual contrast to the sea. Boats also often have high visibility ports to allow for easy international trade. In addition, many boats have an air-breathing function that makes them easier to spot and stay afloat, which is a good thing if you’re a pirate or trying to get into some country’s waters. Lastly, many ships are built in the last few decades, so they are not as exotic looking as they were thirty years ago. And lastly, many ships are made out of materials that are easy to work with, and so many of them are painted white as a practical matter. There are many, many more reasons for why ships are white, but those are the ones I know of. I’m sure there are many other reasons, but those are the ones I know of. Edit: I forgot to mention that the other answers about visibility and easy to see ports are also true. Ships are painted to appear to be larger, and more expensive. That makes them appear more impressive, and so they are more likely to be bought by people who want to show off their wealth and status.

  • Q: What causes the trail behind jets at high altitude?

    The same thing that causes the clouds you see at the start of a storm to form in the first place. The air pressure changes as the jet moves through the air, causing it to cool down and condense into water droplets. These droplets are in fact a lot bigger than the water droplets in the clouds, so when the plane passes through them, they can cause the air pressure to change, causing the water droplets to condense into clouds. This is what you see when you look at a cloud formation from the ground. In addition to this, when the plane passes through the clouds, it also moves through the air, causing it to change its temperature, causing the water droplets to condense and form clouds. These clouds are often quite large, and can block out the sun. This is what you see when you look at a contrail. Finally, the water vapor that forms at high altitudes is cooled by the air above, and as it cools, it condenses into clouds. This is what you see when you look at a cumulonimbus cloud. These clouds are even larger than the water droplets in the clouds, and they block out the sun.

  • Q: How do shampoo and conditioner work?

    The key ingredients in shampoo and conditioner are usually surfactants, they tend to attract oil from your skin. Shampoo will usually contain an anti-dandruff agent (dandruff is caused by fungi in your skin which cause dandruff). They also contain some moisturizing agents (moisturizers are things that stay in your skin longer and protect your scalp from cleaning), and a lot of anti-dandruff (dandruff is a side effect of high doses of both surfactants and moisturizers) Conditioner will contain moisturizing agents, like a skin moisturizer. They tend to cause oils to be separated from your hair (e.g. wash your hair with conditioner instead of shampoo). They both have a foaming agent that makes them foam up (usually Hydrogen Peroxide). They also have a foaming agent that stops dirt from attaching to your hair. Some conditioner will also suppress the action of the shampoo (to help prevent dirt from attaching to your hair).

  • Q: When glancing at a clock, why does the first second after glancing at it sometimes feel longer than the rest?

    This is known as chronostasis. Your brain gets used to seeing the second hand move after the first, so it ignores the first one. If you were to look at the clock a lot, you’d see the second hand move a lot more than the first, and you’d be able to tell the difference between the second and the first. The second hand will appear to move much slower than the first hand because your brain is used to seeing the second hand move a lot more than the first. If you were to go on a long trip, you’d get used to the second hand moving a lot more than the first, and you’d have no problem telling the difference between the second and the first hand. It’s kind of like how you don’t notice your nose when you’re driving, but if you look at it for a while, you start to notice it. Your brain is just getting used to seeing the nose move a lot more than the other, and it can’t tell the difference between the two. It’s kind of like that with your eyes.

  • Q: Why do Britain and other English empire countries still bow to monarchs? What real purpose does the queen serve?

    Depends what you define as “purpose”. There is no specific purpose for the British monarch – rather, they are figureheads that serve a role. The role is to have a high degree of influence over a country, though the rules of the monarchy can somewhat vary from country to country, and the ceremony of “kissing the royal corpse” does not always involve kissing the monarch. Whether that’s the only reason for the custom, or if it was the only reason, I can’t say, but that’s the reasoning. When the U.K. was in full power, the monarch was a puppet of Parliament, and the powers of the monarch were transferred to the Prime Minister and thus to the Prime Minister’s deputy, who then became the Prime Minister. Since then, Parliament has been able to vote on legislation that goes through the monarch, although they may still act as the monarch’s representative in negotiating treaties, which can have very very deep consequences. The Queen’s role, as a representative of Britain, doesn’t necessarily involve her formal approval of any of the laws or legislation that goes through Parliament, though.

  • Q: What exactly is fire, in detail? How can light and heat come from something we can’t really touch?

    Fire is the chemical reaction of fuel (oxygen) with (what we call) impurities (ash, soot, oil etc). These impurities are created due to a number of things. If you were to study the chemical composition of a log (or a candle) and analyse it you would discover the chemical formula (A, C, and G) which is a simplified version of what actually happens. There is always fuel, and there is always the ash/soot. Wood is made up of mostly carbon and hydrogen and these are the fuels that create the fire. Hydrogen and oxygen are reactive, and they form water (H2O) and CO2 (CO2). If we split a log into smaller pieces, we start to see the process of combustion as it occurs. The carbon is burned, and with the explosion of CO, H2O and H2CO2, we get water and carbon dioxide (mostly). If we smash a log together, we start to see what happens when we also hit another log with another log…imagine smashing two sand castles together. We see how the sand castles and brick castles form and collapse at different rates…then it starts to burn. The smoke, ash and flames are the actual combustion of fuel.

  • Q: What goes on in those tall tower buildings owned by major banks?

    The actual buildings are not there for profit but for show. There are many real world historical buildings that have little to no effect other than being a symbol of an economic entity or symbol of an era or location, and nothing more. For example look at Sears, Sears didn’t care what went on inside, it was all about the _appearance_ of its location, the prestige of the location, the facilities and so on. It didn’t care about how long it took it to operate, it was about how much people would pay to go see it. Sears was a landmark as a cultural movement and other big companies followed suit, so if you want to see a building you’ve never seen before, you have to go see Sears, just like you have to see a Toyota Camry for Toyota Camry. They used to be all about building new factories, some of them if I recall, but now that they’re bigger, that means that more factory jobs are coming to them. You’ve probably seen them in stores as stores where people buy and sell stuff, so there aren’t that many places for them to come from. Instead, it’s just for show, a symbol of rich people.

Hurdles Towards Progress in LFQA
However, while the RT system described here tops the public leaderboard, a detailed analysis of the model and the ELI5 dataset reveal some concerning trends.

  • Many held-out questions are paraphrased in the training set. Best answer to similar train questions gets 27.4 ROUGE-L.

  • Simply retrieving answers to random unrelated training questions yields relatively high ROUGE-L, while actual gold answers underperform generations.

  • Conditioning answer generation on random documents instead of relevant ones does not measurably impact its factual correctness. Longer outputs get higher ROUGE-L.

We find little to no evidence that the model is actually grounding its text generation in the retrieved documents — fine-tuning an RT model with random retrievals from Wikipedia (i.e., random retrieval + RT) performs nearly as well as the c-REALM + RT model (24.2 vs 24.4 ROUGE-L). We also find significant overlap in the training, validation and test sets of ELI5 (with several questions being paraphrases of each other), which may eliminate the need for retrievals. The KILT benchmark measures the quality of retrievals and generations separately, without making sure that the text generation actually use the retrievals.

Trivial baselines get higher Rouge-L scores than RAG and BART + DPR.

Moreover, we find issues with the Rouge-L metric used to evaluate the quality of text generation, with trivial nonsensical baselines, such as a Random Training Set answer and Input Copying, achieving relatively high Rouge-L scores (even beating BART + DPR and RAG).

Conclusion
We proposed a system for long form-question answering based on Routing Transformers and REALM, which tops the KILT leaderboard on ELI5. However, a detailed analysis reveals several issues with the benchmark that preclude using it to inform meaningful modelling advances. We hope that the community works together to solve these issues so that researchers can climb the right hills and make meaningful progress in this challenging but important task.

Acknowledgments
The Routing Transformer work has been a team effort involving Aurko Roy, Mohammad Saffar, Ashish Vaswani and David Grangier. The follow-up work on open-domain long-form question answering has been a collaboration involving Kalpesh Krishna, Aurko Roy and Mohit Iyyer. We wish to thank Vidhisha Balachandran, Niki Parmar and Ashish Vaswani for several helpful discussions, and the REALM team (Kenton Lee, Kelvin Guu, Ming-Wei Chang and Zora Tung) for help with their codebase and several useful discussions, which helped us improve our experiments. We are grateful to Tu Vu for help with the QQP classifier used to detect paraphrases in ELI5 train and test sets. We thank Jules Gagnon-Marchand and Sewon Min for suggesting useful experiments on checking ROUGE-L bounds. Finally we thank Shufan Wang, Andrew Drozdov, Nader Akoury and the rest of the UMass NLP group for helpful discussions and suggestions at various stages in the project.

Read More

Helping newsrooms experiment together with AI

In our JournalismAI report, journalists around the world told researchers they are eager to collaborate and explore the benefits of AI, especially as it applies to newsgathering, production and distribution. 

To facilitate their collaboration, the Google News Initiative and Polis – the journalism think tank at the London School of Economics and Political Science – are launching the JournalismAI Collab Challenges, an opportunity for three groups of five newsrooms from the Americas, Europe, the Middle East and Africa, and Asia Pacific to experiment together.

Each cohort – selected by Polis – will have six months to either cover global news stories using AI-powered storytelling techniques or to develop prototypes of new AI-based products and processes.

Participants will receive support from the JournalismAI team and partner institutions in each region: in the Americas, the challenge will be co-hosted with the Knight Lab at Northwestern University; in Europe, the Middle East and Africa, the challenge will be co-hosted with BBC News Labs and Clwstwr. JournalismAI’s partner in Asia Pacific will be announced later this year.

The Collab Challenges build on a successful pilot run by JournalismAI last year. More than 20 global newsrooms joined forces to solve four common problems using AI, from creating automated news summaries to mitigating newsroom biases, and from powering archives to increasing audience loyalty. JournalismAI online trainings are available on the Google News Initiative Training Center, where they have already been seen by more than 110,000 participants.

Newsrooms interested in participating in this free, year-long program must have made AI a strategic priority, must guarantee the participation of two staff members – one from editorial and one from the technical department – who can participate two to four hours a week, and must embrace collaboration with other publishers.

The outcome of their work – whose ownership will be shared among participants – will be presented at the second edition of the JournalismAI Festival in November.

Applications for the Americas challenge and the Europe, the Middle East and Africa Challenge open today and close at 11:59 PM GMT on April 5. The Challenge will open later this year in Asia Pacific.

To learn more about the process, please visit Polis blog and sign up for the JournalismAI newsletter.

Read More