iGibson: A Simulation Environment to Train AI Agents in Large Realistic Scenes

iGibson: A Simulation Environment to Train AI Agents in Large Realistic Scenes

Why simulation for AI?

We are living in a Golden Age of simulation environments in AI and robotics. Looking back ten years, simulation environments were rare, with only a handful of available solutions, and were complex and used only by experts. Today, there are many available simulation environments and most papers in AI and robotics at first tier conferences such as NeurIPS, CoRL or even ICRA and IROS, make some use of them. What has changed?

This extensive use of simulation environments is the result of several trends:

  • First, the increasing role of machine learning in robotics creates a demand for more data (for example, interactive experiences) than what can be generated in real time 1234. Also, the initial data collection process often involves random exploration that may be dangerous for physical robots or their surroundings.
  • Second, simulation environments have matured to be more robust, realistic (visually and physically), user friendly and accessible to all types of users, and the necessary computation to simulate complex physics is reasonably fast on most modern machines. Therefore, simulation environments have the potential to lower the barrier to entry in robotics, even for researchers without the funds to acquire expensive real robot platforms.
  • Finally, the increasing number of robotic solutions to tasks such as grasping, navigation or manipulation have brought more attention to a critical absence in our community: the lack of repeatable benchmarks. Mature sciences are based on experiments that can be easily and reliably replicated, so that different techniques, theories, and solutions can be compared in fair conditions. Simulation environments can help us to establish repeatable benchmarks, which is very difficult to achieve with real robots, which can in turn help us understand the status of our field.

Why iGibson?

These ideas motivated us in the Stanford Vision and Learning Lab to develop a simulation environment that can serve as a “playground” to train and test interactive AI agents – an environment we call iGibson5. What makes iGibson special? To understand this, let’s first define what a simulation environment is and how it is different from a physics simulator. A physics simulator is an engine capable of computing the physical effect of actions on an environment (e.g. motion of bodies when a force is applied, or flow of liquid particles when being poured). There are many existing physics simulation engines. The best known in robotics are Bullet and its python extension, PyBullet, MuJoCo, Nvidia PhysX and Flex, UnrealEngine, DART, Unity, and ODE. Given a physical problem (objects, forces, particles, and physics parameters), these engines compute the temporal evolution of the system. On the other hand, a simulation environment is a framework that includes a physics simulator, a renderer of virtual signals, and a set of assets (i.e. models of scenes, objects, and robots) that can be used to create simulations of problems to study and develop solutions for different tasks. The decision on what physics engine to use is based on the type of physical process that dominates the problem, for example rigid body physics or motion of fluids. However, to decide on what simulation environment to use, researchers are guided by the application domain they are interested in, and the research questions they want to explore. With iGibson, we aim to support the study of interactive tasks in large realistic scenes, guided by high quality virtual visual signals.

Comparison to existing simulators

No existing simulation environments support developing solutions for problems involving interactions in large scale scenes like full houses. There are several simulation environments for tasks with stationary arms, such as meta-world, RLBench, RoboSuite or DoorGym, but none of them include large realistic scenes like homes with multiple rooms for tasks that include navigation. For navigation, our previous version, Gibson (v1) and Habitat have proven to be great environments that allow researchers to study visual and language guided navigation. However, the included assets (scenes) are single meshes that cannot change when interactions are applied, like opening doors or moving objects.

Finally, a set of recent simulation environments allow for scene-level interactive tasks, such as Sapien, AI2Thor and ThreeDWorld (TDW). Sapien focuses on interaction with articulated objects (doors, cabinets, and drawers). TDW is a multi-modal simulator with audio, high quality visuals, and simulation of flexible materials and liquids via Nvidia Flex. But neither Sapien nor TDW include fully interactive scenes aligned with real object distribution and layout as part of the environment. AI2Thor includes fully interactive scenes, but the interactions are scripted: interactable objects are annotated with the possible actions they can receive. When the agent is close enough to an object and the object is in the right state (precondition), the agent can select a predefined action, and the object is “transitioned’” to the next state (postcondition). RoboThor, an alternative version of AI2Thor, enables continuous interactions but focuses on navigation. It provides limited sensory signals to the agent (only RGB-D images) that is always embodied as a locobot, a low-cost platform with limited interaction capabilities. Here at SVL, we want to study complex, long-horizon mobile manipulation tasks such as tidying a house or searching for objects, which requires access to fully interactive realistic large-scale scenes.

iGibson’s new features

The main focus of iGibson in interactivity: enabling realistic interactions in large scenes. For that, we have included several key features:

  • Fifteen fully interactive visually realistic scenes representing real world homes with furniture and articulated object models annotated with materials and dynamics properties.
  • Capabilities to import models from CubiCasa5K 6 and 3D-Front 7, giving access to more than 8000 additional interactive home scenes.
  • Realistic virtual sensor signals, including high quality RGB images from a physics-based renderer, depth maps, 1 beam and 16 beams virtual LiDAR signals, semantic/instance/material segmentation, optical and scene flow, and surface normals.
  • Domain randomization for visual texture, dynamics properties and object instances for endless variations of scenes.
  • Human-computer interface for humans to provide demonstrations of fully physical interactions with the scenes.
  • Integration with sampling-based motion planners to facilitate motion of robotic bases (navigation in 2D layout) and arms (interaction in 3D space).



Using iGibson for robot learning

These novel features in iGibson allow us to study and develop solutions for new interactive tasks in large environments. One of these new problems is Interactive Navigation, where the agents need to interact with the environment to change its configuration, for example, to open doors or push obstacles away. This is a common type of navigation in our homes and offices, but non-interactive simulation environments cannot be used to study it. In iGibson we have developed hierarchical reinforcement learning solutions for interactive navigation that decide explicitly what part of the body to use in the next phase of the task: the arm (for interactions), the base (for navigation) or the combination of both 8. We also propose a new learning solution for interactive navigation that integrates a motion planner: the learning algorithm decides on the next point to interact, and the motion planner finds a collision free path to that point of interaction 9. But these are just the tips of the iceberg: many of SVL’s projects are leveraging iGibson to study a wide variety of interactive tasks in large realistic scenes.


Summary

Simulation environments have the potential to support researchers in their study of robotics and embodied AI problems. With iGibson, SVL contributes to the community with an open source, fully academically developed simulation environment for interactive tasks in large realistic scenes. If you want to start using it, visit our website and download – setup should be straightforward, and we’re happy to answer any questions about getting the simulator up and running for your research! We hope we can facilitate new avenues of research in robotics and AI.

  1. Andrychowicz, OpenAI: Marcin, et al. “Learning dexterous in-hand manipulation.” The International Journal of Robotics Research 39.1 (2020): 3-20. 

  2. Rajeswaran, Aravind, et al. “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.” Robotics: Science and Systems, 2017 

  3. Peng, Xue Bin, et al. “Sfv: Reinforcement learning of physical skills from videos.” ACM Transactions on Graphics (TOG) 37.6 (2018): 1-14. 

  4. Zhu, Yuke, et al. “robosuite: A modular simulation framework and benchmark for robot learning.” arXiv preprint arXiv:2009.12293 (2020). 

  5. A note on Gibson – Our simulation environment takes the name from James J. Gibson [1904-1979]. Gibson was an influential psychologist and cognitive scientist with, at the time, disruptive ideas. He pushed forward a new concept of perception to be considered 1) an ecological process that cannot and should not be studied in isolation from the environment, and 2) an active process that needs agency and interactivity. This was in contrast to the predominant view of the time of perception to be a passive process where signals “arrive” and “are processed” by the brain. Instead, he argued that agents seek for information, interacting and revealing it. He also coined the term “affordance” as the opportunity the environment offers to an agent to perform a task. This is a quote from a colleague summarizing his research that directly connects to the guiding principle behind our work in the iGibson team: “ask not what’s inside your head, but what your head is inside of”. 

  6. Kalervo, Ahti, et al. “Cubicasa5k: A dataset and an improved multi-task model for floorplan image analysis.” Scandinavian Conference on Image Analysis. Springer, Cham, 2019. 

  7. Fu, Huan, et al. “3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics.” arXiv preprint arXiv:2011.09127 (2020). 

  8. Li, Chengshu, et al. “Hrl4in: Hierarchical reinforcement learning for interactive navigation with mobile manipulators.” Conference on Robot Learning. PMLR, 2020. 

  9. Xia, Fei, et al. “Relmogen: Leveraging motion generation in reinforcement learning for mobile manipulation.” arXiv preprint arXiv:2008.07792 (2020). 

Read More

Stanford AI Lab Papers and Talks at NeurIPS 2020

Stanford AI Lab Papers and Talks at NeurIPS 2020

The Neural Information Processing Systems (NeurIPS) 2020 conference is being hosted virtually from Dec 6th – Dec 12th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

List of Accepted Papers


Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration


Authors: Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill

Contact: zanette@stanford.edu

Keywords: reinforcement learning, function approximation, exploration


Acceleration with a Ball Optimization Oracle


Authors: Yair Carmon, Arun Jambulapati, Qijia Jiang, Yujia Jin, Yin Tat Lee, Aaron Sidford, Kevin Tian

Contact: kjtian@stanford.edu

Award nominations: Oral presentation

Links: Paper

Keywords: convex optimization, local search, trust region methods


BanditPAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits


Authors: Mo Tiwari, Martin Jinye Zhang, James Mayclin, Sebastian Thrun, Chris Piech, Ilan Shomorony

Contact: Motiwari@stanford.edu

Links: Paper | Video

Keywords: clustering, k-means, k-medoids, multi-armed bandits


CaSPR: Learning Canonical Spatiotemporal Point Cloud Representations


Authors: Davis Rempe, Tolga Birdal, Yongheng Zhao, Zan Gojcic, Srinath Sridhar, Leonidas J. Guibas

Contact: drempe@stanford.edu

Links: Paper | Video | Website

Keywords: 3d vision, dynamic point clouds, representation learning


Compositional Explanations of Neurons


Authors: Jesse Mu, Jacob Andreas

Contact: muj@stanford.edu

Award nominations: oral

Links: Paper

Keywords: interpretability, explanation, deep learning, computer vision, natural language processing, adversarial examples


Continuous Meta-Learning without Tasks


Authors: James Harrison, Apoorva Sharma, Chelsea Finn, Marco Pavone

Contact: jharrison@stanford.edu

Links: Paper

Keywords: meta-learning, continuous learning, changepoint detection


Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel


Authors: Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, Surya Ganguli

Contact: sfort1@stanford.edu

Links: Paper

Keywords: loss landscape, neural tangent kernel, linearization, taylorization, basin, nonlinear advantage


Diversity can be Transferred: Output Diversification for White- and Black-box Attacks


Authors: Yusuke Tashiro, Yang Song, Stefano Ermon

Contact: ytashiro@stanford.edu

Links: Paper | Website

Keywords: adversarial examples, deep learning, robustness


Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders


Authors: Masha Itkina, Boris Ivanovic, Ransalu Senanayake, Mykel J. Kochenderfer, and Marco Pavone

Contact: mitkina@stanford.edu

Links: Paper | Website

Keywords: sparse distributions, generative models, discrete latent spaces, behavior prediction, image generation


Federated Accelerated Stochastic Gradient Descent


Authors: Honglin Yuan, Tengyu Ma

Contact: yuanhl@stanford.edu

Award nominations: Best Paper Award of Federated Learning for User Privacy and Data Confidentiality in Conjunction with ICML 2020 (FL-ICML’20)

Links: Paper | Website

Keywords: federated learning, local sgd, acceleration, fedac


Fourier-transform-based attribution priors improve the interpretability and stability of deep learning models for genomics


Authors: Alex Michael Tseng, Avanti Shrikumar, Anshul Kundaje

Contact: amtseng@stanford.edu

Links: Paper | Website

Keywords: deep learning, interpretability, attribution prior, computational biology, genomics


From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering


Authors: Ines Chami, Albert Gu, Vaggos Chatziafratis, Christopher Re

Contact: chami@stanford.edu

Links: Paper | Video | Website

Keywords: hierarchical clustering, hyperbolic embeddings


FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply


Authors: Lingjiao Chen; Matei Zaharia; James Zou

Contact: lingjiao@stanford.edu

Links: Paper | Blog Post | Website

Keywords: machine learning as a service, ensemble learning, meta learning, systems for machine learning


Generative 3D Part Assembly via Dynamic Graph Learning


Authors: Jialei Huang, Guanqi Zhan, Qingnan Fan, Kaichun Mo, Lin Shao, Baoquan Chen, Leonidas Guibas, Hao Dong

Contact: fqnchina@gmail.com

Links: Paper

Keywords: 3d part assembly, dynamic graph learning


Generative 3D Part Assembly via Dynamic Graph Learning


Authors: Jialei Huang*, Guanqi Zhan*, Qingnan Fan, Kaichun Mo, Lin Shao, Baoquan Chen, Leonidas J. Guibas, Hao Dong

Contact: kaichun@cs.stanford.edu

Links: Paper | Website

Keywords: 3d part assembly, graph neural network


Gradient Surgery for Multi-Task Learning


Authors: Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, Chelsea Finn

Contact: tianheyu@cs.stanford.edu

Links: Paper | Website

Keywords: multi-task learning, deep reinforcement learning


HiPPO: Recurrent Memory with Optimal Polynomial Projections


Authors: Albert Gu*, Tri Dao*, Stefano Ermon, Atri Rudra, Chris Ré

Contact: albertgu@stanford.edu, trid@stanford.edu

Links: Paper | Blog Post

Keywords: representation learning, time series, recurrent neural networks, lstm, orthogonal polynomials


Identifying Learning Rules From Neural Network Observables


Authors: Aran Nayebi, Sanjana Srivastava, Surya Ganguli, Daniel L.K. Yamins

Contact: anayebi@stanford.edu

Award nominations: Spotlight Presentation

Links: Paper | Website

Keywords: computational neuroscience, learning rule, deep networks


Improved Techniques for Training Score-Based Generative Models


Authors: Yang Song, Stefano Ermon

Contact: songyang@stanford.edu

Links: Paper

Keywords: score-based generative modeling, score matching, deep generative models


Language Through a Prism: A Spectral Approach for Multiscale Language Representations


Authors: Alex Tamkin, Dan Jurafsky, Noah Goodman

Contact: atamkin@stanford.edu

Links: Paper

Keywords: bert, signal processing, self-supervised learning, interpretability, multiscale


Large-Scale Methods for Distributionally Robust Optimization


Authors: Daniel Levy, Yair Carmon, John Duchi, Aaron Sidford

Contact: danilevy@stanford.edu

Links: Paper

Keywords: robustness dro optimization large-scale optimal


Learning Physical Graph Representations from Visual Scenes


Authors: Daniel Bear, Chaofei Fan, Damian Mrowca, Yunzhu Li, Seth Alter, Aran Nayebi, Jeremy Schwartz, Li F. Fei-Fei, Jiajun Wu, Josh Tenenbaum, Daniel L. Yamins

Contact: dbear@stanford.edu

Links: Paper | Blog Post | Website

Keywords: structure learning, graph learning, visual scene representations, unsupervised learning, unsupervised segmentation, object-centric representation, intuitive physics


MOPO: Model-based Offline Policy Optimization


Authors: Tianhe Yu*, Garrett Thomas*, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, Tengyu Ma

Contact: tianheyu@cs.stanford.edu

Links: Paper | Website

Keywords: offline reinforcement learning, model-based reinforcement learning


MOPO: Model-based Offline Policy Optimization


Authors: Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, Tengyu Ma

Contact: tianheyu@cs.stanford.edu,gwthomas@stanford.edu

Links: Paper

Keywords: model-based rl, offline rl, batch rl


Measuring Robustness to Natural Distribution Shifts in Image Classification


Authors: Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, Ludwig Schmidt

Contact: rtaori@stanford.edu

Award nominations: Spotlight

Links: Paper | Website

Keywords: machine learning, robustness, image classification


Minibatch Stochastic Approximate Proximal Point Methods


Authors: Hilal Asi, Karan Chadha, Gary Cheng, John Duchi

Contact: chenggar@stanford.edu

Award nominations: Spotlight talk

Links: Paper

Keywords: stochastic optimization, sgd, aprox


Model-based Adversarial Meta-Reinforcement Learning


Authors: Zichuan Lin, Garrett Thomas, Guangwen Yang, Tengyu Ma

Contact: lzcthu12@gmail.com,gwthomas@stanford.edu

Links: Paper

Keywords: model-based rl, meta-rl, minimax


Multi-Plane Program Induction with 3D Box Priors


Authors: Yikai Li, Jiayuan Mao, Xiuming Zhang, William T. Freeman, Joshua B. Tenenbaum, Noah Snavely, Jiajun Wu

Contact: jiajunwu@cs.stanford.edu

Links: Paper | Video | Website

Keywords: visual program induction, 3d vision, image editing


Multi-label Contrastive Predictive Coding


Authors: Jiaming Song, Stefano Ermon

Contact: jiaming.tsong@gmail.com

Links: Paper

Keywords: representation learning, mutual information


Neuron Shapley: Discovering the Responsible Neurons


Authors: Amirata Ghorbani, James Zou

Contact: amiratag@stanford.edu

Links: Paper

Keywords: interpretability, deep learning, shapley value


No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems


Authors: Nimit Sharad Sohoni, Jared Alexander Dunnmon, Geoffrey Angus, Albert Gu, Christopher Ré

Contact: nims@stanford.edu

Links: Paper | Blog Post | Video

Keywords: classification, robustness, clustering, neural feature representations


Off-policy Policy Evaluation For Sequential Decisions Under Unobserved Confounding


Authors: Hongseok Namkoong, Ramtin Keramati, Steve Yadlowsky, Emma Brunskill

Contact: keramati@stanford.edu

Links: Paper

Keywords: off-policy policy evaluation, unobserved confounding, reinforcement learning


One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL


Authors: Saurabh Kumar, Aviral Kumar, Sergey Levine, Chelsea Finn

Contact: szk@stanford.edu

Links: Paper

Keywords: robustness, diversity, reinforcement learning


Point process models for sequence detection in high-dimensional neural spike trains


Authors: Alex H. Williams, Anthony Degleris, Yixin Wang, Scott W. Linderman

Contact: ahwillia@stanford.edu

Award nominations: Selected for Oral Presentation

Links: Paper | Website

Keywords: bayesian nonparametrics, unsupervised learning


Predictive coding in balanced neural networks with noise, chaos and delays


Authors: Jonathan Kadmon, Jonathan Timcheck, Surya Ganguli

Contact: kadmonj@stanford.edu

Links: Paper

Keywords: neuroscience, predictive coding, chaos


Probabilistic Circuits for Variational Inference in Discrete Graphical Models


Authors: Andy Shih, Stefano Ermon

Contact: andyshih@stanford.edu

Links: Paper

Keywords: variational inference, discrete, high-dimensions, sum product networks, probabilistic circuits, graphical models


Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration


Authors: Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill.

Contact: yaoliu@stanford.edu

Links: Paper

Keywords: reinforcement leanring, off-policy, batch reinforcement learning


Pruning neural networks without any data by iteratively conserving synaptic flow


Authors: Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, Surya Ganguli

Contact: kunin@stanford.edu

Links: Paper | Video | Website

Keywords: network pruning, sparse initialization, lottery ticket


Robust Sub-Gaussian Principal Component Analysis and Width-Independent Schatten Packing


Authors: Arun Jambulapati, Jerry Li, Kevin Tian

Contact: kjtian@stanford.edu

Award nominations: Spotlight presentation

Links: Paper

Keywords: robust statistics, principal component analysis, positive semidefinite programming


Self-training Avoids Using Spurious Features Under Domain Shift


Authors: Yining Chen*, Colin Wei*, Ananya Kumar, Tengyu Ma (*equal contribution)

Contact: cynnjjs@stanford.edu

Links: Paper

Keywords: self-training, pseudo-labeling, domain shift, robustness


Wasserstein Distances for Stereo Disparity Estimation


Authors: Divyansh Garg, Yan Wang, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger, Wei-Lun Chao

Contact: divgarg@stanford.edu

Award nominations: Spotlight

Links: Paper | Video | Website

Keywords: depth estimation, disparity estimation, autonomous driving, 3d object detection, statistical learning


We look forward to seeing you at NeurIPS2020!

Read More

Learning from Language Explanations

Learning from Language Explanations

Imagine you’re a machine learning practitioner and you want to solve some classification problem, like classifying groups of colored squares as being either 1s or 0s. Here’s what you would typically do: collect a large dataset of examples, label the data, and train a classifier:

But humans don’t learn like this. We have a very powerful and intuitive mechanism for communicating information about the world – language!

With just the phrase at least 2 red squares, we’ve summarized the entire dataset presented above in a much more efficient manner.

Language is a crucial medium for human learning: we use it to convey beliefs about the world, teach others, and describe things that are hard to experience directly. Thus, language ought to be a simple and effective way to supervise machine learning models. Yet past approaches to learning from language have struggled to scale up to the general tasks targeted by modern deep learning systems and the freeform language explanations used in these domains. In two short papers presented at ACL 2020 this year, we use deep neural models to learn from language explanations to help tackle a variety of challenging tasks in natural language processing (NLP) and computer vision.

What’s the challenge?

Given that language is such an intuitive interface for humans to teach others,
why is it so hard to use language for machine learning?

The principal challenge is the grounding
problem
: understanding language
explanations in the context of other inputs. Building models that can
understand rich and ambiguous language is tricky enough, but building models
that can relate language to the surrounding world is even more challenging. For
instance, given the explanation at least two red squares, a model must not
only understand the terms red and square, but also how they refer to
particular parts of (often complex) inputs.

Past work (1,
2,
3) has relied on semantic
parsers
which
convert natural language statements (e.g. at least two red squares) to formal
logical representations (e.g. Count(Square AND Red) > 2). If we can easily
check whether explanations apply to our inputs by executing these logical
formulas, we can use our explanations as features to train our model.
However, semantic parsers only work on simple domains
where we can hand-engineer a logical grammar of explanations we might expect to
see. They struggle to handle richer and vaguer language or scale up to more
complex inputs, such as images.

Fortunately, modern deep neural language models such as
BERT are beginning to show promise at
solving many language understanding tasks. Our papers propose to alleviate the
grounding problem by using neural language models that are either trained to
ground language explanations in the domain of interest, or come pre-trained
with general-purpose “knowledge” that can be used to interpret explanations. We
will show that these neural models allow us to learn from richer and more
diverse language for more challenging settings.

Representation Engineering with Natural Language Explanations

In our first paper, we examine how to build text classifiers with language
explanations.
Consider the task of relation extraction, where we are given a
short paragraph and must identify whether two people mentioned in the
paragraph are married. While state-of-the-art NLP models can likely solve
this task from data alone, humans might use language to describe ways to tell
whether two people are married—for example, people who go on honeymoons are
typically married
. Can such language explanations be used to train better
classifiers?

In the same way that we might take an input , and extract features (e.g.
the presence of certain words) to train a model, we can use explanations to
provide additional features. For example, knowing that honeymoons are relevant
for this task, if we can create a honeymoon feature that reliably activates
whenever the two people in a paragraph are described as going on a honeymoon,
this should be useful signal for training a better model.

But creating such features requires some sort of explanation interpretation
mechanism that tells us whether an explanation is true for an input. Semantic
parsers are one such tool: given and went on honeymoon, we could
parse this explanation into a logical form which, when run on an input,
produces 1 if the word honeymoon appears between and . But what about
a vaguer explanation like and are in love? How can we parse this?

While semantic parsing is efficient and accurate in small domains, it can be
overly brittle, as it can only interpret explanations which adhere to a fixed
set of grammatical rules and functions that we must specify in advance (e.g.
contains and extract_text).
Instead, we turn to the soft reasoning
capabilities of BERT, a neural language model. BERT is particularly effective
at the task of textual entailment: determining whether a sentence implies or
contradicts another sentence (e.g. does She ate pizza imply that She ate
food?
Yes!). In our proposed ExpBERT model, we take a BERT model
trained for textual entailment, and instead ask it to identify whether a
paragraph in our task entails an explanation. The features produced by BERT
during this process replace the indicator features produced by the semantic
parser above.

Does the soft reasoning power of BERT improve over semantic parsing? On the
marriage identification task, we find that ExpBERT leads to substantial
improvements over a classifier that is trained on the input features only (No
Explanations). Importantly, using a semantic parser to try to parse
explanations doesn’t help much, since there are general explanations (in
love
) that are difficult to convert to logical forms.

In the full paper, we compare to more baselines, explore larger relation
extraction tasks (e.g. TACRED),
conduct ablation studies to understand what kinds of explanations are
important, and examine how much more efficient explanations are compared to
additional data.

Shaping Visual Representations with Language

The work we’ve just described uses natural language explanations for a single
task like marriage identification. However, work in cognitive
science
suggests that
language also equips us with the right features and abstractions that help us
solve future tasks.
For example, explanations that indicate whether person is married to
also highlight other concepts that are crucial to human relationships:
children, daughters, honeymoons, and more. Knowing these additional
concepts are not just useful for identifying married people; they are also
important if we would later like to identify other relationships
(e.g. siblings, mother, father).

In machine learning, we might ask: how can language point out the right
features for challenging and underspecified domains, if we
ultimately wish to solve new tasks where no language is available? In our
second paper, we explore this setting,
additionally increasing the challenge by seeing whether language can improve
the learning of representations across modalities—here, vision.

We’re specifically interested in few-shot visual reasoning tasks like the following (here, from the ShapeWorld dataset):

Given a small training set of examples of a visual concept, the task is to
determine whether a held-out test image expresses the same concept. Now, what
if we assume access to language explanations of the relevant visual concepts at
training time? Can we use these to learn a better model, even if no language
is available at test time
?

We frame this as a meta-learning task:
instead of training and testing a model on a single task, we
train a model on a set of tasks, each with a small training set and
an accompanying language description (the meta-train set). We then test
generalization to a meta-test set of unseen tasks, for which no language is
available:

First, let’s look at how we might solve this task without language. One typical
approach is Prototype Networks, where we learn some model
(here, a deep convolutional neural network)
that embeds the training images, averages them, and compares to an embedding of
the test image:

To use language, we propose a simple approach called Language Shaped Learning
(LSL): if we have access to explanations at training time, we encourage the
model to learn representations that are not only helpful for classification,
but are predictive of the language explanations. We do this by introducing an
auxiliary training objective (i.e. it is not related to the ultimate task of
interest), where we simultaneously train a recurrent neural network (RNN)
decoder to predict the explanation(s) from the representation of the
input images. Crucially, training this decoder depends on the
parameters of our image model , so this process should encourage
to better encode the features and abstractions exposed in
language.

In effect, we are training the model to “think out loud” when representing
concepts at training time. At test time, we simply discard the RNN decoder, and
do classification as normal with the “language-shaped” image embeddings.

We apply this model to both the ShapeWorld dataset described above, and a more
realistic Birds
dataset, with real images and human language:

In both cases, this auxiliary training objective improves performance over a
no-explanation baseline (Meta), and Learning with Latent
Language
(L3), a similar model proposed
for this setting that uses language as a discrete bottleneck (see the paper for
details):

In the full paper, we also explore which parts of language are most important
(spoiler: a little bit of everything), and how much language is needed for
LSL to improve over models that don’t use language (spoiler: surprisingly little!)

Moving Forward

As NLP systems grow in their ability to understand and produce language, so too
grows the potential for machine learning systems to learn from language to
solve other challenging tasks. In the papers above, we’ve shown that deep
neural language models can be used to successfully learn from language
explanations to improve generalization across a variety of tasks in vision and
NLP.

We think this is an exciting new avenue for training machine learning models,
and similar ideas are already being explored in areas such as reinforcement
learning (4,
5). We envision a future where in order to
solve a machine learning task, we no longer have to collect a large labeled
dataset, but instead interact naturally and expressively with a model in the
same way that humans have interacted with each other for millennia—through
language
.

Acknowledgments

Thanks to our coauthors (Pang Wei Koh, Percy Liang, and Noah Goodman), and to
Nelson Liu, Pang Wei Koh, and the rest of the SAIL blog team for reviewing and
publishing this blog post. This research was supported in part by the Facebook
Fellowship
(to Pang Wei Koh), the NSF Graduate Research Fellowship (to Jesse Mu), Toyota Research
Institute
, and the Office of Naval Research.

Read More

Stanford AI Lab Papers and Talks at CoRL 2020

Stanford AI Lab Papers and Talks at CoRL 2020

The Conference on Robot Learning (CoRL) 2020 is being hosted virtually from November 16th – November 18th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

List of Accepted Papers


Learning 3D Dynamic Scene Representations for Robot Manipulation


Authors: Zhenjia Xu, Zhanpeng He, Jiajun Wu, Shuran Song

Contact: jiajunwu@cs.stanford.edu

Links: Paper | Video | Website

Keywords: scene representations, 3d perception, robot manipulation


Learning Latent Representations to Influence Multi-Agent Interaction


Authors: Annie Xie, Dylan P. Losey, Ryan Tolsma, Chelsea Finn, Dorsa Sadigh

Contact: anniexie@stanford.edu

Links: Paper | Blog Post | Website

Keywords: multi-agent systems, human-robot interaction, reinforcement learning


Learning Object-conditioned Exploration using Distributed Soft Actor Critic


Authors: Ayzaan Wahid (Google), Austin Stone (Google), Brian Ichter (Google Brain), Kevin Chen (Stanford), Alexander Toshev (Google)

Contact: ayzaan@google.com

Links: Paper

Keywords: object navigation, visual navigation


MATS: An Interpretable Trajectory Forecasting Representation for Planning and Control


Authors: Boris Ivanovic, Amine Elhafsi, Guy Rosman, Adrien Gaidon, Marco Pavone

Contact: borisi@stanford.edu

Links: Paper | Video

Keywords: trajectory forecasting, learning dynamical systems, motion planning, autonomous vehicles


Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous


Authors: Rose E. Wang, J. Chase Kew, Dennis Lee, Tsang-Wei Edward Lee, Tingnan Zhang, Brian Ichter, Jie Tan, Aleksandra Faust

Contact: rewang@stanford.edu

Links: Paper | Video | Website

Keywords: multiagent systems; model-based reinforcement learning


Reinforcement Learning with Videos: Combining Offline Observations with Interaction


Authors: Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, Chelsea Finn

Contact: karls@seas.upenn.edu

Links: Paper | Website

Keywords: reinforcement learning, learning from observation


Sampling-based Reachability Analysis: A Random Set Theory Approach with Adversarial Sampling


Authors: Thomas Lew, Marco Pavone

Contact: thomas.lew@stanford.edu

Links: Paper

Keywords: reachability analysis, robust planning and control, neural networks

Keynote


Walking the Boundary of Learning and Interaction (Dorsa Sadigh)

Overview: There have been significant advances in the field of robot learning in the past decade. However, many challenges still remain when considering how robot learning can advance interactive agents such as robots that collaborate with humans. This includes autonomous vehicles that interact with human-driven vehicles or pedestrians, service robots collaborating with their users at homes over short or long periods of time, or assistive robots helping patients with disabilities. This introduces an opportunity for developing new robot learning algorithms that can help advance interactive autonomy.

In this talk, I will discuss a formalism for human-robot interaction built upon ideas from representation learning. Specifically, I will first discuss the notion of latent strategies— low dimensional representations sufficient for capturing non-stationary interactions. I will then talk about the challenges of learning such representations when interacting with humans, and how we can develop data-efficient techniques that enable actively learning computational models of human behavior from demonstrations, preferences, or physical corrections. Finally, I will introduce an intuitive controlling paradigm that enables seamless collaboration based on learned representations, and further discuss how that can be used for further influencing humans.

Live Event: November 17th, 7:00AM – 7:45AM PST


We look forward to seeing you at CoRL!

Read More

Stanford AI Lab Papers and Talks at EMNLP 2020

Stanford AI Lab Papers and Talks at EMNLP 2020

The Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020 is being hosted virtually from November 16th – November 20th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

Main Conference


Pre-Training Transformers as Energy-Based Cloze Models


Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

Contact: kevclark@cs.stanford.edu

Links: Paper

Keywords: representation learning, self-supervised learning, energy-based models


ALICE: Active Learning with Contrastive Natural Language Explanations


Authors: Weixin Liang, James Zou, Zhou Yu

Contact: wxliang@stanford.edu

Links: Paper

Keywords: natural language explanation, class-based active learning, contrastive explanation


CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT


Authors: Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, Matthew P. Lungren

Contact: akshaysm@stanford.edu

Links: Paper

Keywords: bert, natural language processing, radiology, medical imaging, deep learning


AutoQA: From Databases To QA Semantic Parsers With Only Synthetic Training Data


Authors: Silei Xu, Sina J. Semnani, Giovanni Campagna, Monica S. Lam

Contact: silei@cs.stanford.edu

Links: Paper

Keywords: question answering, semantic parsing, language models, synthetic training data, data augmentation


Data and Representation for Turkish Natural Language Inference


Authors: Emrah Budur, Rıza Özçelik, Tunga Güngör, Christopher Potts

Contact: emrah.budur@boun.edu.tr

Links: Paper | Website

Keywords: sentence-level semantics, natural language inference, neural machine translation, morphologically rich language


Intrinsic Evaluation of Summarization Datasets


Authors: Rishi Bommasani, Claire Cardie

Contact: nlprishi@stanford.edu

Links: Paper | Video | Website | Virtual Conference Room

Keywords: summarization, datasets, evaluation


Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models


Authors: Isabel Papadimitriou, Dan Jurafsky

Contact: isabelvp@stanford.edu

Links: Paper

Keywords: transfer learning, analysis, music, hierarchical structure


Localizing Open-Ontology QA Semantic Parsers in a Day Using Machine Translation


Authors: Mehrad Moradshahi, Giovanni Campagna, Sina J. Semnani, Silei Xu, Monica S. Lam

Contact: mehrad@cs.stanford.edu

Links: Paper | Website

Keywords: machine translation, semantic parsing, localization


SLM: Learning a Discourse Language Representation with Sentence Unshuffling


Authors: Haejun Lee, Drew A. Hudson, Kangwook Lee, Christopher D. Manning

Contact: dorarad@stanford.edu

Links: Paper

Keywords: transformer, bert, language, understanding, nlp, squad, glue, sentences, discourse


Utility is in the Eye of the User: A Critique of NLP Leaderboards


Authors: Kawin Ethayarajh, Dan Jurafsky

Contact: kawin@stanford.edu

Links: Paper | Website

Keywords: nlp, leaderboard, utility, benchmark, fairness, efficiency


With Little Power Comes Great Responsibility


Authors: Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, Dan Jurafsky

Contact: dcard@stanford.edu

Links: Paper | Website

Keywords: statistical power, experimental methodology, leaderboards, machine translation, human evaluation


Findings of EMNLP


DeSMOG: Detecting Stance in Media On Global Warming


Authors: Yiwei Luo, Dallas Card, Dan Jurafsky

Contact: yiweil@stanford.edu

Links: Paper | Website

Keywords: computational social science; framing; argumentation; stance; bias; climate change


Investigating Transferability in Pretrained Language Models


Authors: Alex Tamkin, Trisha Singh, Davide Giovanardi, Noah Goodman

Contact: atamkin@stanford.edu

Links: Paper | Website | Virtual Conference Room

Keywords: finetuning, transfer learning, language models, bert, probing


Stay Hungry, Stay Focused: Generating Informative and Specific Questions in Information-Seeking Conversations


Authors: Peng Qi, Yuhao Zhang, Christopher D. Manning

Contact: pengqi@cs.stanford.edu

Links: Paper | Blog Post

Keywords: conversational agents, question generation, natural language generation


Do Language Embeddings Capture Scales?


Authors: Xikun Zhang*, Deepak Ramachandran*, Ian Tenney, Yanai Elazar, Dan Roth

Contact: xikunz2@cs.stanford.edu

Links: Paper | Virtual Conference Room

Keywords: probing, analysis, bertology, scales, common sense knowledge


On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks


Authors: Stephen Mussmann, Robin Jia, Percy Liang

Contact: robinjia@cs.stanford.edu

Links: Paper | Website

Keywords: active learning, robustness, label imbalance


Pragmatic Issue-Sensitive Image Captioning


Authors: Allen Nie, Reuben Cohn-Gordon, Christopher Potts

Contact: anie@stanford.edu

Links: Paper | Video

Keywords: controllable caption generation, question under discussion, discourse, pragmatics


Workshops and Co-Located Conferences


BLEU Neighbors: A Reference-less Approach to Automatic Evaluation


Authors: Kawin Ethayarajh, Dorsa Sadigh

Contact: kawin@stanford.edu

Links: Paper | Website

Keywords: nlp, bleu, evaluation, nearest neighbors, dialogue


Determining Question-Answer Plausibility in Crowdsourced Datasets Using Multi-Task Learning


Authors: Rachel Gardner, Maya Varma, Clare Zhu, Ranjay Krishna

Contact: rachel0@cs.stanford.edu

Links: Paper

Keywords: noisy text, bert, plausibility, multi-task learning


Explaining the ‘Trump Gap’ in Social Distancing Using COVID Discourse


Authors: Austin van Loon, Sheridan Stewart, Brandon Waldon, Shrinidhi K. Lakshmikanth, Ishan Shah, Sharath Chandra Guntuku, Garrick Sherman, James Zou, Johannes Eichstaedt

Contact: avanloon@stanford.edu

Links: Paper

Keywords: computational social science, social distancing, word2vec, vector semantics, twitter, bert


Learning Adaptive Language Interfaces through Decomposition


Authors: Siddharth Karamcheti, Dorsa Sadigh, Percy Liang

Contact: skaramcheti@cs.stanford.edu

Links: Paper | Virtual Conference Room

Keywords: semantic parsing, interaction, decomposition


Modeling Subjective Assessments of Guilt in Newspaper Crime Narratives


Authors: Elisa Kreiss*, Zijian Wang*, Christopher Potts

Contact: ekreiss@stanford.edu

Links: Paper | Website

Keywords: psycholinguistics, pragmatics, token-level supervision, model attribution, news, guilt, hedges, corpus, subjectivity


Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation


Authors: Atticus Geiger, Kyle Richardson, Chris Potts

Contact: atticusg@stanford.edu

Links: Paper | Website

Keywords: entailment intervention causality systematic generalization


Structured Self-Attention Weights Encode Semantics in Sentiment Analysis


Authors: Zhengxuan Wu, Thanh-Son Nguyen, Desmond C. Ong

Contact: wuzhengx@stanford.edu

Links: Paper

Keywords: attention, explainability, sentiment analysis


We look forward to seeing you at EMNLP 2020!

Read More

Learning to Influence Multi-Agent Interaction

Learning to Influence Multi-Agent Interaction

Interaction with others is an important part of everyday life. No matter
the situation – whether it be playing a game of chess, carrying a
box together, or navigating lanes of traffic – we’re able to
seamlessly compete against, collaborate with, and acclimate to other
people.



Likewise, as robots become increasingly prevalent and capable, their
interaction with humans and other robots is inevitable. However, despite
the many advances in robot learning, most current algorithms are
designed for robots that act in isolation. These methods miss out on the
fact that other agents are also learning and changing – and so the
behavior the robot learns for the current interaction may not work
during the next one! Instead, can robots learn to seamlessly interact
with humans and other robots by taking their changing strategies into
account? In our new work (paper,
website), we
begin to investigate this question.

A standard reinforcement learning agent (left) based on Soft
Actor-Critic
(SAC) assumes that
the opponent (right) follows a fixed strategy, and only blocks on its
left side.

Interactions with humans are difficult for robots because humans and
other intelligent agents don’t have fixed behavior – their
strategies and habits change over time. In other words, they update
their actions in response to the robot and thus continually change the
robot’s learning environment. Consider the robot on the left (the agent)
learning to play air hockey against the non-stationary robot on the
right. Rather than hitting the same shot every time, the other robot
modifies its policy between interactions to exploit the agent’s
weaknesses. If the agent ignores how the other robot changes, then it
will fail to adapt accordingly and learn a poor policy.

The best defense for the agent is to block where it thinks the opponent
will next target. The robot therefore needs to anticipate how the
behavior of the other agent will change, and model how its own actions
affect the other’s behavior. People can deal with these scenarios on a
daily basis (e.g., driving, walking), and they do so without explicitly
modeling every low-level aspect of each other’s policy.

Humans tend to be bounded-rational (i.e., their rationality is limited
by knowledge and computational capacity), and so likely keep track of
much less complex entities during interaction. Inspired by how humans
solve these problems, we recognize that robots also do not need to
explicitly model every low-level action another agent will make.
Instead, we can capture the hidden, underlying intent – what we call
latent strategy (in the sense that it underlies the actions of the
agent) – of other agents through learned low-dimensional
representations. These representations are learned by optimizing neural
networks based on experience interacting with these other agents.

Learning and Influencing Latent Intent

We propose a framework for learning latent representations of another
agent’s policy: Learning and Influencing Latent Intent (LILI). The
agent of our framework identifies the relationship between its behavior
and the other agent’s future strategy, and then leverages these latent
dynamics to influence the other agent, purposely guiding them towards
policies suitable for co-adaptation. At a high level, the robot learns
two things: a way to predict latent strategy, and a policy for
responding to that strategy. The robot learns these during interaction
by “thinking back” to prior experiences, and figuring out what
strategies and policies it should have used.

Modeling Agent Strategies

The first step, shown in the left side of the diagram above, is to learn
to represent the behavior of other agents. Many prior works assume
access to the underlying intentions or actions of other agents, which
can be a restrictive assumption. We instead recognize that a
low-dimensional representation of their behavior, i.e., their latent
strategy, can be inferred from the dynamics and rewards experienced by
the agent during the current interaction. Therefore, given a sequence of
interactions, we can train an
encoder-decoder
model; the encoder embeds interaction and predicts the next
latent strategy , and the decoder takes this prediction
and reconstructs the transitions and rewards observed during interaction
.

Influencing by Optimizing for Long-Term Rewards

Given a prediction of what strategy the other agent will follow next,
the agent can learn how to react to it, as illustrated on the right
side of the diagram above. Specifically, we train an agent policy
with reinforcement learning (RL) to
make decisions conditioned on the latent strategy predicted
by the encoder.

However, beyond simply reacting to the predicted latent strategy, an
intelligent agent should proactively influence this strategy to
maximize rewards over repeated interactions. Returning to our hockey
example, consider an opponent with three different strategies: it fires
to the left, down the middle, or to the right. Moreover, left-side shots
are easier for the agent to block and so gives a higher reward when
successfully blocked. The agent should influence its opponent to adopt
the left strategy more frequently in order to earn higher long-term
rewards.

For learning this influential behavior, we train the agent policy
to maximize rewards across multiple interactions:

With this objective, the agent learns to generate interactions that
influence the other agent, and hence the system, toward outcomes that
are more desirable for the agent or for the team as a whole.

Experiments

2D Navigation

We first consider a simple point mass navigation task. Similar to
pursuit-evasion games, the agent needs to reach the other agent (i.e.,
the target) in a 2D plane. This target moves one step clockwise or
counterclockwise around a circle depending on where the agent ended the
previous interaction. Because the agent starts off-center, some target
locations can be reached more efficiently than others. Importantly, the
agent never observes the location of the target.

Below, we visualize 25 consecutive interactions from policies learned by
Soft Actor-Critic (SAC) (a standard RL algorithm), LILI (no influence),
and LILI. LILI (no influence) corresponds to our approach without the
influencing objective; i.e., the agent optimizes rewards accumulated in
a single interaction. The gray circle represents the target, while the
teal line marks the trajectory taken by the agent and the teal circle
marks the agent’s position at the final timestep of the interaction.

SAC
LILI (no influence)
LILI

The SAC policy, at convergence, moves to the center of the circle in
every interaction. Without knowledge of or any mechanism to infer where
the other agent is, the center of the circle gives the highest stable
rewards. In contrast, LILI (no influence) successfully models the other
agent’s behavior dynamics and correctly navigates to the other agent,
but isn’t trained to influence the other agent. Our full approach LILI
does learn to influence: it traps the other agent at the top of the
circle, where the other agent is closest to the agent’s starting
position and yields the highest rewards.

Robotic Air Hockey

Next, we evaluate our approach on the air hockey task, played between
two robotic agents. The agent first learns alongside a robot opponent,
then plays against a human opponent. The opponent is a rule-based agent
which always aims away from where the agent last blocked. When blocking,
the robot does not know where the opponent is aiming, and only observes
the vertical position of the puck. We additionally give the robot a
bonus reward if it blocks a shot on the left of the board, which
incentivizes the agent to influence the opponent into aiming left.

In contrast to the SAC agent, the LILI agent learns to anticipate
the opponent’s future strategies and successfully block the different
incoming shots.

Because the agent receives a bonus reward for blocking left, it should
lead the opponent into firing left more often. LILI (no influence) fails
to guide the opponent into taking advantage of this bonus: the
distribution over the opponent’s strategies is uniform. In contrast,
LILI leads the opponent to strike left 41% of the time, demonstrating
the agent’s ability to influence the opponent. Specifically, the agent
manipulates the opponent into alternating between the left and middle
strategies.

Finally, we test the policy learned by LILI (no influence) against a
human player following the same strategy pattern as the robot opponent.
Importantly, the human has imperfect aim and so introduces new noise to
the environment. We originally intended to test our approach LILI with
human opponents, but we found that – although LILI worked well when
playing against another robot – the learned policy was too brittle
and did not generalize to playing alongside human opponents. However,
the policy learned with LILI (no influence) was able to block 73% of
shots from the human.

Final Thoughts

We proposed a framework for multi-agent interaction that represents the
behavior of other agents with learned high-level strategies, and
incorporates these strategies into an RL algorithm. Robots with our
approach were able to anticipate how their behavior would affect another
agent’s latent strategy, and actively influenced that agent for more
seamless co-adaptation.

Our work represents a step towards building robots that act alongside
humans and other agents. To this end, we’re excited about these next
steps:

  • The agents we examined in our experiments had a small number of simple strategies determining their behavior. We’d like to study the scalability of our approach to more complex agent strategies that we’re likely to see in humans and intelligent agents.

  • Instead of training alongside artificial agents, we hope to study the human-in-the-loop setting in order to adapt to the dynamic needs and preferences of real people.


This post is based on the following paper:

Annie Xie, Dylan P. Losey, Ryan Tolsma, Chelsea Finn, Dorsa Sadigh.
Learning Latent Representations for Multi-Agent Interaction.
Project webpage

Finally, thanks to Dylan Losey, Chelsea Finn, Dorsa Sadigh, Andrey Kurenkov, and Michelle Lee for valuable feedback on this post.

Read More

Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation

Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation


Named entity disambiguation (NED) is the process of mapping “strings” to “things” in a knowledge base. You have likely already used a system that requires NED multiple times today. Every time you ask a question to your personal assistant or issue a search query on your favorite browser, these systems use NED to understand what people, places, and things (entities) are being talked about.

Named entity disambiguation example. The ambiguous “Lincoln” refers to the car, not the person or location.

Take the example shown above. You ask your personal assistant “What is the average gas mileage of a Lincoln?”. The assistant would need NED to know that “Lincoln” refers to Lincoln Motors (the car company)—not the former president or city in Nebraska. The ambiguity of mentions in text is what makes NED so challenging as it requires the use of subtle cues.

The spectrum of entities. Popular (head) entities occur frequently in data while rare (tail) entities are infrequent.

NED gets more interesting when we examine the full spectrum of entities shown above, specifically the more rare tail and unseen entities. These are entities that occur infrequently or not at all in data. Performance over the tail is critical because the majority of entities are rare. In Wikidata, only 13% of entities even have Wikipedia pages as a source of textual information.

Bootleg compared to a BERT-based baseline model Févry et el. 2020 showing average F1 versus number of times an entity occurred in the training data. As there are 15x the number of entities in Wikidata than in Wikipedia (most of them are rare) and the baseline model needs to see an entity on average 100x for it to achieve 60 F1, it follows that the baseline model would need to train on data 1,500x the size of Wikipedia to achieve 60 F1 over all entities.

Prior approaches to NED use BERT-based systems to memorize textual patterns associated with an entity (e.g., Abraham Lincoln is associated with “president”). As shown above, the SotA BERT-based baseline from Févry does a great job at memorizing patterns over popular entities (it achieves 86 F1 points over all entities). For the rare entities, it does much worse (58 F1 points lower on the tail). One possible solution to better tail performance is to simply train over more data, but this would likely require training over data 1,500x the size of Wikipedia for the model to achieve 60 F1 points over all entities!

In this blog post, we present Bootleg, a self-supervised approach to NED that is better able to handle rare entities.

Tail Disambiguation through NED Reasoning Patterns

The question we are left with is how to disambiguate these rare entities? Our insight is that humans disambiguate entities, including rare entities, by using signals from text as well as from entity relations and types. For example, the sentence “What is the gas mileage of a Lincoln?” requires reasoning that cars have a gas mileage, not people or locations. This can be used to reason that the mention of “Bluebird” in “What is the average gas mileage of a Bluebird?” refers to the car, a Nissan Bluebird, not the animal. Our goal in Bootleg is to train a model to reason over entity types and relations and better identify these tail entities.

Through empirical analysis, we found four reasoning patterns for NED, shown and defined in the figure below.

Four reasoning patterns of NED. Each pattern uses some combination of entity, type, and relation information.

These patterns rely on signals from entities, types, and relations. Luckily, tail entities do not have equally rare types and relations. This means we should be able to learn type and relation patterns from our data that can apply to tail entities.

Bootleg: A Model for Tail NED

Bootleg takes as input a sentence, determines the possible entity candidates that could be mentioned in the sentence, and outputs the most likely candidates. The core insight that enables Bootleg to better identify rare entities is in how it internally represents entities.

The creation of an entity candidate representation. Each candidate is a combination of an entity, type, and relation learned embedding.

Similar to how words are often represented by continuous word embeddings (e.g., BERT or ELMo), Bootleg represents entity candidates as a combination of a unique entity embedding, a type embedding, and a relation embedding, as shown above. For example, each car entity will get the same car type embedding (likewise for relations) which will encode patterns learned over all cars in the training data. A rare car can then use this global “car type” knowledge for disambiguation, as it will have the car embedding as part of its representation.

To output the correct entities, Bootleg uses these representations in a stacked Transformer module to allow the model to naturally learn the useful patterns for disambiguation without hard-coded rules. Bootleg then scores the output candidate representations and returns the most likely candidates.

There are other exciting techniques we present in our paper regarding regularization and weak labeling to improve tail performance.

Bootleg Improves Tail Performance and Allows for Knowledge Transfer

Our simple insight of training a model to reason over types and relations provides state-of-the-art performance on three standard NED benchmarks – matching or exceeding SotA by up to 5.6 F1 points – and outperforms a BERT-based NED baseline by 5.4 F1 points over all entities and 40 F1 points over tail entities (see F1 versus entity occurrence plot above).

Benchmark System Precision Recall F1
KORE50 Hu et al., 2019 80.0 79.8 79.9
Bootleg 86.0 85.4 85.7
RSS500 Phan et al., 2019 82.3 82.3 82.3
Bootleg 82.5 82.5 82.5
AIDA CoNLL YAGO Févry et al., 2020 96.7
Bootleg 96.9 96.7 96.8

We’ll now show how the entity knowledge encoded in Bootleg’s entity representations can transfer to non-NED tasks. We extract our entity representations and use them in both a production task at a major technology company and relation extraction task. We find that the use of Bootleg embeddings in the production task provides a 8% lift in performance and even improves quality over Spanish, French, and German languages. We repeat this experiment by adding Bootleg representations to a SotA model for the TACRED relation extraction task (see tutorial). We find this Bootleg-enhanced model sets a new SotA by 1 F1 point.

Model TACRED F1
Bootleg-Enhanced 80.3
KnowBERT 79.3
SpanBERT 78.0

These results suggest that Bootleg entity representations can transfer entity knowledge to other language tasks!

Recap

To recap, we described the problem of the tail of NED and showed that existing NED systems fall short at disambiguating these rare, yet important entities. We then introduced four reasoning patterns for NED and described how we trained Bootleg to learn these patterns through the use of embeddings and Transformer modules. We finally showed that Bootleg is a SotA NED system that better disambiguates rare entities than prior methods. Further, Bootleg learns representations that can transfer entity knowledge to non-NED tasks.

We are actively developing Bootleg and would love to hear your thoughts. See our website, source code, and paper.

Read More

Measuring Bias in NLP (with Confidence!)

Measuring Bias in NLP (with Confidence!)

Countless studies have found that “bias” – typically with respect to race and gender – pervades the embeddings and predictions of the black-box models that dominate natural language processing (NLP). For example, the language model GPT-3, of OpenAI fame, can generate racist rants when given the right prompt. Attempts to detect hate speech can itself harm minority populations, whose dialect is more likely to be flagged as hateful.

This, in turn, has led to a wave of work on how to “debias” models, only for others to find ways in which debiased models are still biased, and so on.

But are these claims of NLP models being biased (or unbiased) being made with enough evidence?

Consider the sentence “The doctor gave instructions to the nurse before she left.” A co-reference resolution system, tasked with finding which person the pronoun “she” is referring to1, may incorrectly predict that it’s the nurse. Does this incorrect prediction – which conforms to gender stereotypes that doctors are usually male – mean that the system is gender-biased? Possibly – but it may also make mistakes in the other direction with equal frequency (e.g., thinking “he” refers to a nurse when it doesn’t). What if the system makes gender-stereotypical mistakes on not one sentence, but 100, or 1000? Then we could be more confident in claiming that it’s biased.

In my ACL 2020 paper, “Measuring Fairness under Uncertainty with Bernstein Bounds”, I go over how, in the haste to claim the presence or absence of bias, the inherent uncertainty in measuring bias is often overlooked in the literature:

  • Bias is not a single number. When we test how biased a model is, we are estimating its bias on a sample of the data; our estimate may suggest that the model is biased or unbiased, but the opposite could still be true.

  • This uncertainty can be captured using confidence intervals. Instead of reporting a single number for bias, practitioners should report an interval, based on factors such as the desired confidence and the proposed definition of “bias”.

  • Existing datasets are too small to conclusively identify bias. Existing datasets for measuring specific biases can only be used to make 95% confidence claims when the bias estimate is egregiously high; to catch more subtle bias, the NLP community needs bigger datasets.

Although this problem can exist with any kind of model, we focus on a remedy for classification models in particular.

Bernstein-Bounded Unfairness

A bias estimate, made using a small sample of data, likely differs from the true bias (i.e., at the population-level). How can we express our uncertainty about the estimate? We propose a method called Bernstein-bounded unfairness that translates this uncertainty into a confidence interval2.

Let’s say we want to measure whether some protected group – that is legally protected due to an attribute such as race or gender – is being discriminated against by some classifier, relative to some unprotected group . They occur in the population with frequency respectively. We need

  • An annotation function that maps each example to or neither. Note that the annotation function maps inputs to the protected/unprotected groups, not to the output space . For example, if we wanted to study how a sentiment classifier performed across different racial groups, then the inputs would be sentences, labels would be the sentiment, and the annotation function might map to {white, non-white} depending on the racial group of the sentence author.

  • A cost function that describes the cost of incorrectly predicting when the true label is , where is the maximum possible cost. Since a model making an incorrect prediction for is an undesirable outcome for the group that belongs to, we frame this as a cost that must be borne by the group.

We want to choose these functions such that our bias metric of choice – which we call the groupwise disparity – can be expressed as the difference in expected cost borne by the protected and unprotected groups. Given a model that makes predictions for protected and for unprotected , we want to express the bias as:

If the protected group is incurring higher costs in expectation, it is being biased against. For example, if we want to determine whether a classifier is more accurate on the unprotected group , then we would set the cost function to be the 1-0 loss (1 for an incorrect prediction, 0 for a correct one). If has a lower cost on average then , then it would mean that the classifier is more accurate on .

For a desired confidence level , a dataset of examples, and the variance of the amortized groupwise disparity across examples, the confidence interval would be

If we set , we could claim with 95% confidence that the true bias experienced by the protected group lies in the interval , where is our bias estimate.

Why We Need Bigger Datasets

If we want to say with 95% confidence that a classifier is biased to some extent – but want to spend as little time annotating data as possible – we need to find the smallest such that . We can do this by working backwards from the formula for given above (see paper for details).

Let’s go back to our original example. Say we want to figure out whether a co-reference resolution system, tasked with matching pronouns to the nouns they refer to, is gender-biased or not. We have a dataset of 500 examples to test whether the model does better on gender-stereotypical examples (e.g., a female nurse) than non-gender-stereotypical examples (e.g., a male nurse). Since we are measuring the difference in accuracy, we set the cost function to be the 1-0 loss.

On this dataset, our bias estimate for a model we’re evaluating is . Is this enough to claim with 95% confidence that the model is gender-biased?

In this scenario . We assume that there are equally many stereotypical and non-stereotypical examples and that the variance is maximal, so .

With these settings, ; we would need a dataset of more than 11903 examples to claim with 95% confidence that the co-reference resolution system is gender-biased. This is roughly 3.8 times larger than WinoBias, the largest dataset currently available for this purpose. We could only use WinoBias if – that is, if the sample bias were almost twice as high.

As seen above, the WinoBias dataset cannot be used to make claims of bias with 95% confidence unless the sample bias is egregiously high.

Conclusion

In the haste to claim the presence or absence of bias in models, the uncertainty in estimating bias is often overlooked in the literature. A model’s bias is often thought of as a single number, even though this number is ultimately an estimate and not the final word on whether the model is or is not biased.

We proposed a method called Bernstein-bounded unfairness for capturing this uncertainty using confidence intervals. To faithfully reflect the range of possible conclusions, we recommend that NLP practitioners measuring bias not only report their bias estimate but also this confidence interval.

What if we want to catch more subtle bias? Although it may be possible to derive tighter confidence intervals, what we really need are larger bias-specific datasets. The datasets we currently have are undoubtedly helpful, but they need to be much larger in order to diagnose biases with confidence.

Acknowledgements

Many thanks to Krishnapriya Vishnubhotla, Michelle Lee, and Kaitlyn Zhou for their feedback on this blog post.

  1. The goal of coreference resolution more broadly is to find all expressions that refer to the same entity in a text. For example, in “I gave my mother Sally a gift for her birthday.”, the terms “my mother”, “Sally”, and “her” all refer to the same entity. 

  2. We use Bernstein’s inequality to derive the confidence intervals, hence the name Bernstein-bounded unfairness. This inequality tells us with what probability the average of independent random variables will be within a constant $t$ of their true mean $mu$. 

Read More

Learning to Fix Programs from Error Messages

Learning to Fix Programs from Error Messages

Machine Learning for Program Repair

When writing programs, a lot of time is spent debugging or fixing source code errors, both for beginners (imagine the intro programming classes you took) as well as for professional developers (for example, this case study from Google 1). Automating program repair could dramatically enhance the productivity of both programming and learning programming. In our recent work published at ICML 2020, we study how to use machine learning to repair programs automatically.

Problem Setting

Programmers write programs incrementally: write code, compile or execute it, and if there are any errors, repair the program based on the received feedback. Can we model and solve this problem with machine learning?

Let’s say we have a broken C++ program (figure left), where the char in line 5 should actually be string. When we compile it, we get an error (figure top right), which says “line 9 is requesting for size in a which is of type char”. From this message, a programmer can notice that the error is related to the type of the variable a, track how a has been used or declared in the source code, reaching line 5, and then edit the line to fix the error. Thus, the concrete task we want our machine learning model to solve is, given broken code (figure left) and an error message (figure top right), localize the error line (line 5) and generate a repaired version of it (“string tmp, a, b;”) (figure bottom right).

Challenges:
This task poses two main challenges. First, on the modeling side, we need to connect and jointly reason over two modalities, the program and the error message: for instance, tracking variables that caused the error as we saw in the example above. Second, on the training data side, we need an efficient source of data that provides supervision for correcting broken programs; unfortunately, existing labeled datasets with <broken code, fixed code> pairs are small and hard to come by, and don’t scale up. In this work, we introduce promising solutions to those two challenges by: 1) modeling program repair with program-feedback graph, and 2) introducing a self-supervised training scheme that uses unlabeled programs.

Modeling Approach: Program-Feedback Graph

How can we effectively connect the two modalities (programs and error messages) and perform the reasoning needed for repair? To achieve this, we introduce a program-feedback graph, a joint graph representation that connects symbols across the program and error message. For instance, the compiler message in the example mentions a, size, and char, so we connect these symbols to their occurrences in the source code, to capture semantic correspondence. This way, we treat the two modalities in a shared semantic space rather than separately. We then perform reasoning over the symbols in this space using graph attention 2.

Specifically, for the model architecture, we build on the encoder-decoder framework commonly used in NLP, which encodes input sequences (in our case, the program and error message; next figure bottom) and then decodes outputs (in our case, the localized line index, and the repaired version of the line; figure top), and we incorporate a graph attention module applied to the program-feedback graph in the intermediate layer of the architecture (figure middle).

Training Approach: Self-Supervised Learning

Our second technique is self-supervised learning. Labeled datasets of program repair are small, but there are vast amounts of unlabeled programs available online. For example, GitHub has more than 30M public repositories. Using this large amount of freely available code to improve learning program repair would significantly enhance the scalability and reliability of our system.
Our idea is as follows: we first collect unlabeled, working programs from online resources such as GitHub and codeforce.com (figure left). We then design randomized program corruption procedures (e.g. delete/insert/replace tokens) and corrupt the unlabeled programs (figure middle). As a result, the corrupted programs give us errors (figure right). This way, we can create a lot of new examples of program repair, <broken code, error message, fixed code>. We can use this extra data to pre-train the program repair model, and then fine-tune on the labeled target dataset.

Let’s use our program repair model!

We apply and evaluate our repair model (we call DrRepair) on two benchmark tasks:

Application to DeepFix (Correcting Student Programs)

In DeepFix, the task is to correct C programs written by students in an intro programming class so that they will compile. The input programs may have multiple lines with errors, so we apply the repair model iteratively, addressing one error at a time. For instance, the following figure shows an example program in DeepFix, which has a compiler error saying that “i is undeclared”. By applying the repair model, DrRepair, it repairs this error by inserting a declaration of i in line 5. After this fix, we notice that there is another error, which says “expected semicolon before brace”. We can apply the repair model again – this time, the model inserts a semicolon in line 12, and now the repaired program compiles successfully! This approach is conducive to the idea of iterative refinement: we can keep running the repair model and progressively fixing errors.

What is the effect of using error messages, program-feedback graphs, and self-supervised pre-training? Existing repair systems studied on DeepFix did not use compiler error messages – they aimed to directly translate from broken code to fixed code. To see the effect of using error messages in the first place, we tried removing all our techniques from the system: the use of compiler messages, program-feedback graphs, and pre-training. This version of our model (“ours: no compiler” in the figure below) achieves 34% repair accuracy on DeepFix, which is comparable to the existing systems. Now we add compiler messages to our input. We find that this model achieves much better performance and generalization (62.5% accuracy; “ours: base” in the figure). This suggests that with an access to error messages, the model learns the right inductive bias to repair the code based on the feedback. Next, we add program-feedback graphs and self-supervised pre-training. We find that both provide further improvements (“ours: base+graph” and “ours: base+graph+pretrain”), and our final system can fix 68.2% of the broken programs in DeepFix!

Application to SPoC (Natural Language to Code)

Program synthesis, in particular systems that can translate natural language descriptions (e.g. English) into code (e.g. Python, C++), are useful because they can help a wider range of people use programming languages. In SPoC (Pseudocode-to-Code), the task is to synthesize C++ implementation from pseudocode, a natural language description of a program. However, one challenge experienced by existing synthesizers (machine translation models applied to SPoC) is that they tend to output inconsistent code that does not compile – for instance, in the figure below, the variable i is declared twice in the synthesized code. We find that we can apply our program repair model to this invalid code and fix it into a correct one, helping the program synthesis task. In the evaluation on SPoC, the use of our repair model improves the final synthesis success rate from the existing system’s 34% to 37.6%.

Conclusion

In this work, we studied how to use machine learning to repair programs from error messages, and developed three key insights:

  1. Error messages provide a crucial signal for learning program repair.
  2. Program-feedback graphs (joint representations of code & error messages) help model the reasoning of repair (e.g. tracking variables that caused the error).
  3. Self-supervised learning allows us to turn freely-available, unlabeled programs (e.g. GitHub code) into useful training examples of program repair.

This work also provides a general framework of “learning from feedback”, which has various applications: editing documents based on comments, learning from users in interactive dialog, etc.

You can check out our full paper (ICML 2020) here and our source code/data on GitHub. You can also find the presentation slides on this work here. If you have questions, please feel free to email us!

Acknowledgments

Many thanks to Percy Liang, as well as members of the P-Lambda lab and the Stanford NLP group for their valuable feedback, and to Sidd Karamcheti and Andrey Kurenkov for edits on this blog post!

  1. Programmers’ Build Errors: A Case Study (at Google). Hyunmin Seo, Caitlin Sadowski, Sebastian Elbaum, Edward Aftandilian, Robert Bowdidge. 2014 

  2. Graph Attention Networks. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengio. 2018. 

  3. DeepFix: Fixing common C language errors by deep learning. Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade. 2017. 

  4. SPoC: Search-based Pseudocode to Code. Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken and Percy Liang. 2019. 

Read More

The Coming Wave of ML Systems

AI and ML products now permeate every aspect of our digital lives–from recommendations of what to watch, to divining our search intent, to powering increasingly-present virtual assistants in consumer and enterprise settings. While quality improvements are the main focus of traditional ML and AI research, a second and arguably less well understood benefit of machine learning is that it can dramatically reshape the practice of building applications. With an eye toward generations of compiler, database, and operating systems work, they may inspire new foundational questions for how to build the next generation of AI-powered systems.

Tools are important. They are the scaffolding of the machine learning revolution: the widespread adoption of tools like PyTorch and TensorFlow (building on earlier academic prototypes like Theano and Torch) enabled users to more easily assemble models due to both well-suited domain-specific languages and a rich collection of building blocks. Supported by large companies, these tools have spawned a rich ecosystem to which new building blocks are contributed almost daily and which even contains tools for deployment (eg TFX and TorchScript). Moving from the era of bespoke AI tools to a shared communal foundation has seen stunning productivity gains–on a personal note, it was wild to live through and modestly contribute to.

With the stunning success of these platforms, these libraries have moved the pain point for engineers who build and maintain these products. To understand what might be next, perhaps we can take a page from computing history? One view is that the current generation of tools are akin to software libraries, but they lack some of the features that distinguish long-lived computing systems, such as:

  • monitoring and lifecycle management (most ML systems only deal with training monitoring),
  • support collaboration of all stakeholders around the life of the product (most ML systems lack a model management solution),
  • end-to-end data flow debugging and monitoring (most ML systems don’t manage training data production pipelines)
  • … and many more …
    Understanding this thought has been a driving force behind our recent work.
    We presented some of our initial ideas in the MLSys keynote and described some of our thoughts for production and research systems.

While we contend entirely new ways of building these systems are possible, we are at the start of this journey. There is preliminary evidence that there’s something here: these new breed of systems have found their way into industry products used by billions of people every day like Google
[data programming,
information extraction],
YouTube [multi-modal],
multiple Apple products [Overton],
Uber [customer support,
food recommendation,
Ludwig open-sourced], and many more.

The goal of this post is to introduce the Stanford MLSys Seminar Series to hopefully engage more of the community around ideas to build these systems. If you’re interested in this area or you have a topic you’d like to see, let us know!
Please visit the webpage at mlsys.stanford.edu to see our preliminary thoughts and the schedule of our first speakers. We welcome your feedback!

One outcome of the course is to articulate the challenges that we’ve seen, solicit challenges from the community, and try to make the field more accessible for academic research. If we’re lucky, we may just help to spawn the next major subfield of computer science!

Read More