Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning

We consider a problem: Can a machine learn from a few labeled pixels to predict every pixel in a new image?
This task is extremely challenging (see Fig. 1) as a single body part could contain visually distinctive areas
(e.g. head consists of eyes, noses and mouths); different body parts might look similar and undistinguishable
(e.g., upper arms v.s. lower arms). It could be even more difficult if we do not provide any precise location
but only the occurrence of body parts in the image. This problem is dubbed weakly-supervised segmentation, where
the goal is to classify every pixel into semantic categories using only partial / weak supervision. There are many
forms of weak annotations which are cheap but not perfect, e.g. image-level tags, bounding boxes, points and scribbles.

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

Recent years have demonstrated the potential of deep multi-agent reinforcement
learning (MARL) to train groups of AI agents that can collaborate to solve complex
tasks – for instance, AlphaStar achieved professional-level performance in the
Starcraft II video game, and OpenAI Five defeated the world champion in Dota2.
These successes, however, were powered by huge swaths of computational resources;
tens of thousands of CPUs, hundreds of GPUs, and even TPUs were used to collect and train on
a large volume of data. This has motivated the academic MARL community to develop
MARL methods which train more efficiently.



DeepMind’s AlphaStar attained professional level performance in StarCraft II, but required enormous amounts of
computational power to train.

Research in developing more efficient and effective MARL algorithms has focused on off-policy methods – which store and re-use data for multiple policy updates – rather than on-policy algorithms, which use newly collected training data before each update to the agents’ policies. This is largely due to the common belief that off-policy algorithms are much more sample-efficient than on-policy methods.

In this post, we outline our recent publication in which we re-examine many of these assumptions about on-policy algorithms. In particular, we analyze the performance of PPO, a popular single-agent on-policy RL algorithm, and demonstrate that with several simple modifications, PPO achieves strong performance in 3 popular MARL benchmarks while exhibiting a similar sample efficiency to popular off-policy algorithms in the majority of scenarios. We study the impact of these modifications through ablation studies and suggest concrete implementation and tuning practices which are critical for strong performance. We refer to PPO with these modifications as Multi-Agent PPO (MAPPO).

BASALT: A Benchmark for
Learning from Human Feedback

TL;DR: We are launching a NeurIPS competition and benchmark called BASALT: a
set of Minecraft environments and a human evaluation protocol that we hope will
stimulate research and investigation into solving tasks with no pre-specified
reward function, where the goal of an agent must be communicated through
demonstrations, preferences, or some other form of human feedback. Sign up
to participate in the
competition!

Learning What To Do by Simulating the Past

Reinforcement learning (RL) has been used successfully for solving tasks which
have a well defined reward function – think AlphaZero for Go, OpenAI Five for
Dota, or AlphaStar for StarCraft. However, in many practical situations you
don’t have a well defined reward function. Even a task as seemingly
straightforward as cleaning a room has many subtle cases: should a business
card with a piece of gum be thrown away as trash, or might it have sentimental
value
? Should the clothes on the floor be washed, or returned to the
closet? Where are notebooks supposed to be stored? Even when these aspects of
a task have been clarified, translating it into a reward is non-trivial: if you
provide rewards every time you sweep the trash, then the agent might dump the
trash back out so that it can sweep it up again
.1

Alternatively, we can try to learn a reward function from human feedback about
the behavior of the agent. For example, Deep RL from Human Preferences
learns a reward function from pairwise comparisons of video clips of the
agent’s behavior. Unfortunately, however, this approach can be very costly:
training a MuJoCo Cheetah to run forward requires a human to provide 750
comparisons.


Instead, we propose an algorithm that can learn a policy without any human
supervision or reward function
, by using information implicitly available in
the state of the world. For example, we learn a policy that balances this
Cheetah on its front leg from a single state in which it is balancing.

An EPIC way to evaluate reward functions

Cross-posted from the DeepMind Safety blog.

In many reinforcement learning problems the objective is too complex to be specified procedurally, and a reward function must instead be learned from user data. However, how can you tell if a learned reward function actually captures user preferences? Our method, Equivalent-Policy Invariant Comparison (EPIC), allows one to evaluate a reward function by computing how similar it is to other reward functions. EPIC can be used to benchmark reward learning algorithms by comparing learned reward functions to a ground-truth reward.
It can also be used to validate learned reward functions prior to deployment, by comparing them against reward functions learned via different techniques or data sources.



Figure 1: EPIC compares reward functions $R_a$ and $R_b$ by first mapping them to canonical representatives and then computing the Pearson distance between the canonical representatives on a coverage distribution $mathcal{D}$. Canonicalization removes the effect of potential shaping, and Pearson distance is invariant to positive affine transformations.

The Importance of Hyperparameter Optimization for Model-based Reinforcement Learning

Model-based reinforcement learning (MBRL) is a variant of the iterative
learning framework, reinforcement learning, that includes a structured
component of the system that is solely optimized to model the environment
dynamics. Learning a model is broadly motivated from biology, optimal control,
and more – it is grounded in natural human intuition of planning before acting. This intuitive
grounding, however, results in a more complicated learning process. In this
post, we discuss how model-based reinforcement learning is more susceptible to
parameter tuning and how AutoML can help in finding very well performing
parameter settings and schedules. Below, left is the expected behavior of an
agent maximizing velocity on a “Half Cheetah” robotic task, and to the right is
what our paper with hyperparameter tuning finds.





Pretrained Transformers as Universal Computation Engines

Transformers have been successfully applied to a wide variety of modalities:
natural language, vision, protein modeling, music, robotics, and more. A common
trend with using large models to train a transformer on a large amount of
training data, and then finetune it on a downstream task. This enables the
models to utilize generalizable high-level embeddings trained on a large
dataset to avoid overfitting to a small task-relevant dataset.

We investigate a new setting where instead of transferring the high-level
embeddings, we instead transfer the intermediate computation modules – instead
of pretraining on a large image dataset and finetuning on a small image
dataset, we might instead pretrain on a large language dataset and finetune on
a small image dataset. Unlike conventional ideas that suggest the attention
mechanism is specific to the training modality, we find that the self-attention
layers can generalize to other modalities without finetuning.

Maximum Entropy RL (Provably) Solves Some Robust RL Problems

Nearly all real-world applications of reinforcement learning involve some degree of shift between the training environment and the testing environment. However, prior work has observed that even small shifts in the environment cause most RL algorithms to perform markedly worse.
As we aim to scale reinforcement learning algorithms and apply them in the real world, it is increasingly important to learn policies that are robust to changes in the environment.

Broadly, prior approaches to handling distribution shift in RL aim to maximize performance in either the average case or the worst case. Methods such as domain randomization train a policy on a distribution of environments, and optimize the average performance of the policy on these environments. While this approach has been successfully applied to a number of areas
(e.g., self-driving cars, robot locomotion and manipulation),
its success rests critically on the design of the distribution of environments.
Moreover, policies that do well on average are not guaranteed to get high reward on every environment. The policy that gets the highest reward on average might get very low reward on a small fraction of environments. The second set of approaches, typically referred to as robust RL, focus on the worst-case scenarios. The aim is to find a policy that gets high reward on every environment within some set. Robust RL can equivalently be viewed as a two-player game between the policy and an environment adversary. The policy tries to get high reward, while the environment adversary tries to tweak the dynamics and reward function of the environment so that the policy gets lower reward. One important property of the robust approach is that, unlike domain randomization, it is invariant to the ratio of easy and hard tasks. Whereas robust RL always evaluates a policy on the most challenging tasks, domain randomization will predict that the policy is better if it is evaluated on a distribution of environments with more easy tasks.

Self-Supervised Policy Adaptation during Deployment






Our method learns a task in a fixed, simulated environment and quickly adapts
to new environments (e.g. the real world) solely from online interaction during
deployment.

The ability for humans to generalize their knowledge and experiences to new
situations is remarkable, yet poorly understood. For example, imagine a human
driver that has only ever driven around their city in clear weather. Even
though they never encountered true diversity in driving conditions, they have
acquired the fundamental skill of driving, and can adapt reasonably fast to
driving in neighboring cities, in rainy or windy weather, or even driving a
different car, without much practice nor additional driver’s lessons. While
humans excel at adaptation, building intelligent systems with common-sense
knowledge and the ability to quickly adapt to new situations is a long-standing
problem in artificial intelligence.

The Successor Representation, $gamma$-Models, and Infinite-Horizon Prediction

The Successor Representation, Gamma-Models, and Infinite-Horizon Prediction



Standard single-step models have a horizon of one. This post describes a method for training predictive dynamics models in continuous state spaces with an infinite, probabilistic horizon.

Reinforcement learning algorithms are frequently categorized by whether they predict future states at any point in their decision-making process. Those that do are called model-based, and those that do not are dubbed model-free. This classification is so common that we mostly take it for granted these days; I am guilty of using it myself. However, this distinction is not as clear-cut as it may initially seem.

In this post, I will talk about an alternative view that emphases the mechanism of prediction instead of the content of prediction. This shift in focus brings into relief a space between model-based and model-free methods that contains exciting directions for reinforcement learning. The first half of this post describes some of the classic tools in this space, including
generalized value functions and the successor representation. The latter half is based on our recent paper about infinite-horizon predictive models, for which code is available here.