Highlights from CHI 2023

Highlights from CHI 2023

Microsoft at CHI'23 highlights

The ways in which people are able to interact with technologies can have a profound effect on a technology’s utility and adoptability. Building computing tools and services around people’s natural styles of work, communication, and play can give technology the value it needs to have meaningful impact. For decades, human-computer interaction (HCI) has examined the relationship between people and computers to help maximize the capabilities of each across a range of experiences and situations.

The ACM CHI Conference on Human Factors in Computing Systems (CHI) is a renowned meeting ground for top talent in the HCI field and a showcase for some of its most compelling work. Hosted April 23 through April 28, this year’s conference drew more than 4,500 participants from 79 countries. Contributions from Microsoft researchers and their collaborators demonstrated the breadth of work inspired by the myriad and diverse ways people use computing today and will in the future.

Check out a few highlights from this year’s conference below, including researchers’ efforts to better understand the role of wellbeing in work, to augment memory through our sense of smell, and to bridge the gap between programmers and code-generating models, which received honorable mention at the conference.

“What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code-Generating Large Language Models
CHI 2023 Honorable Mention

Michael Xieyang Liu, Advait Sarkar, Carina Negreanu, Ben Zorn, Jack Williams, Neil Toronto, Andy Gordon

Programming languages are an extremely powerful form of user interface. They also happen to be extremely difficult to learn, especially for non-expert end-user programmers who lack training in computing. What if end-user programmers could instead use a natural language they already know? This prospect can be realized through large language models (LLM): deep neural networks using the transformer architecture, trained on large corpora, and fine-tuned to generate code from natural language. Despite impressive benchmark performance, LLMs are beset with issues in practical use. Lab and field studies have shown that the mapping between natural language and code is poorly understood, that generated code can contain subtle bugs, and that generated code can be difficult to verify.

In their paper, researchers consider the specific problem of abstraction matching: when the user has well-formed intent, how do they select an utterance from the near infinite space of naturalistic utterances that they believe the system will reliably map to a satisfactory solution? This involves “matching” the utterance to the right level of “abstraction” by specifying the utterance at a level of granularity and detail that matches the set of actions the system can take and selecting suitable words and grammar.

Workplace Rhythm Variability and Emotional Distress in Information Workers

Subigya Kumar Nepal, Javier Hernandez, Judith Amores, Mehrab Bin Morshed, Robert Lewis, Hemma Prafullchandra, Mary Czerwinski

Regularity in daily activities has been linked to positive wellbeing outcomes, but previous studies have mainly focused on clinical populations and traditional daily activities such as sleep and exercise. This research extends prior work by examining the regularity of both self-reported and digital activities of 49 information workers in a four-week naturalistic study. Findings suggest that greater variability in self-reported mood, job demands, lunch time, and sleep quality may be associated with increased stress, anxiety, and depression. However, when it comes to digital activity–based measures, greater variability in rhythm is associated with reduced emotional distress. This study expands our understanding of workers and the potential insights that can be gained from analyzing technology interactions and wellbeing.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

Olfactory Wearables for Targeted Memory Reactivation

Judith Amores, Nirmita Mehra, Bjoern Rasch, Pattie Maes

This paper investigates how a smartphone-controlled olfactory wearable might improve memory recall. Researchers conducted a within-subjects experiment with 32 participants using the device and not using the device (control). In the experimental condition, bursts of odor were released during visuo-spatial memory navigation tasks, which also had a language learning component, and rereleased during sleep the following night in the subjects’ home. The researchers found that compared with control, there was an improvement in memory performance when using the scent wearable in memory tasks that involved walking in a physical space. Furthermore, participants recalled more objects and translations when re-exposed to the same scent during the recall test in addition to during sleep. These effects were statistically significant, and in the object recall task, they also persisted for more than a week. This experiment demonstrates a potential practical application of olfactory interfaces that can interact with a user during wake, as well as sleep, to support memory.

AdHocProx: Sensing Mobile, Ad-Hoc Collaborative Device Formations using Dual Ultra-Wideband Radios

Richard Li, Teddy Seyed, Nicolai Marquardt, Eyal Ofek, Steve Hodges, Mike Sinclair, Hugo Romat, Michel Pahud, Jatin Sharma, William A. S. Buxton, Ken Hinckley, Nathalie Henry Riche

In their paper, researchers present AdHocProx, a system that uses device-relative, inside-out sensing to augment co-located collaboration across multiple devices without recourse to externally anchored beacons or even reliance on Wi-Fi connectivity.

AdHocProx achieves this via sensors, including dual ultra-wideband (UWB) radios for sensing distance and angle to other devices in dynamic, ad-hoc arrangements and capacitive grip to determine where the user’s hands hold the device and to partially correct for the resulting UWB signal attenuation. All spatial sensing and communication take place via the side-channel capability of the UWB radios, suitable for small-group collaboration across up to four devices (eight UWB radios).

Together, these sensors detect proximity and natural, socially meaningful device movements to enable contextual interaction techniques. Researchers find that AdHocProx can obtain 95 percent accuracy recognizing various ad-hoc device arrangements in an offline evaluation, with participants particularly appreciative of interaction techniques that automatically leverage proximity-awareness and relative orientation among multiple devices.

Escapement: A Tool for Interactive Prototyping with Video via Sensor-Mediated Abstraction of Time

Molly Jane Nicholas, Nicolai Marquardt, Michel Pahud, Nathalie Henry Riche, Hugo Romat, Christopher Collins, David Ledo, Rohan Kadekodi, Badrish Chandramouli, Ken Hinckley

This paper introduces Escapement, a video prototyping tool that introduces a powerful new concept for prototyping screen-based interfaces by flexibly mapping sensor values to dynamic playback control of videos. This recasts the time dimension of video mockups as sensor-mediated interaction.

This abstraction of time as interaction, which the researchers dub video-escapement prototyping, empowers designers to rapidly explore and viscerally experience direct touch or sensor-mediated interactions across one or more device displays. The system affords cross-device and bidirectional remote (telepresent) experiences via cloud-based state sharing across multiple devices. This makes Escapement especially potent for exploring multi-device, dual-screen, or remote-work interactions for screen-based applications. Researchers share the results of observations of long-term usage of video-escapement techniques with experienced interaction designers and articulate design choices for supporting a reflective, iterative, and open-ended creative design process.

Your Mileage May Vary: Case Study of a Robotic Telepresence Pilot Roll-out for a Hybrid Knowledge Work Organization

Andriana Boudouraki, Joel E. Fischer, Stuart Reeves, Sean Rintel

Organizations wishing to maintain employee satisfaction for hybrid collaboration need to explore flexible solutions that provide value for both remote and on-site employees. This case study reports on the roll-out of a telepresence robot pilot at Microsoft Research Cambridge to test whether robots would provide enjoyable planned and unplanned encounters between remote and on-site employees. Researchers describe the work that was undertaken to prepare for the roll-out, including the occupational health and safety assessment, systems for safety and security, and the information for employees on safe and effective use practices. The pilot ended after three months, and robot use has been discontinued after weighing the opportunities against low adoption and other challenges. The researchers discuss the pros and cons within this organizational setting and make suggestions for future work and roll-outs.

Focus Time for Wellbeing and Work Engagement of Information Workers 

Koustuv Saha, Shamsi Iqbal 

Having little time for focused work is a major challenge in information work. While research has explored computing-assisted user-facing solutions for protecting time for focused work, there’s limited empirical evidence about the effectiveness of these features on wellbeing and work engagement. Toward this problem, researchers study the effects of automatically scheduling time for focused work on people’s work calendars using the “focus time” feature on Outlook calendars. The researchers conducted an experimental study over six weeks with 15 treatment and 10 control participants, who responded to survey questions on wellbeing and work engagement throughout the study. The researchers found that the treatment participants showed higher wellbeing, including increased excitement, relaxation, and satisfaction, and decreased anger, frustration, tiredness, and stress. The researchers study the needs, benefits, and challenges of scheduling focus time and discuss the importance of and design recommendations for enabling mechanisms and tools supporting focused work.

The post Highlights from CHI 2023 appeared first on Microsoft Research.

Read More

Research Focus: Week of May 8, 2023

Research Focus: Week of May 8, 2023

Microsoft Research Focus 15 | Week of May 8, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

AWARD

Microsoft’s danah boyd awarded MIT’s Morison Prize

danah boyd, a partner researcher at Microsoft Research, has been awarded MIT’s Morison Prize in Science, Technology, and Society, for outstanding work combining humanistic values with effectiveness in the world of practical affairs, particular in science and technology.

Dr. boyd, who is also a Distinguished Visiting Professor at Georgetown University, is currently conducting a multi-year ethnographic study of the U.S. census to understand how data are made legitimate. Her previous studies have focused on media manipulation, algorithmic bias, privacy practices, social media, and teen culture. 

To learn more, see the Microsoft Research Summit presentation Statistical Imaginaries: An Ode to Responsible Data Science or the publications Differential Perspectives: Epistemic Disconnects Surrounding the U.S. Census Bureau’s Use of Differential Privacy.


AWARD

Microsoft’s Nicole Immorlica receives 2023 SIGecom Test of Time Award

Nicole Immorlica, a Senior Principal Researcher with Microsoft Research New England, has been awarded the 2023 SIGecom Test of Time Award for her work on a 2005 paper on matching markets. The award from the Association of Computing Machinery (ACM) recognizes “an influential paper or series of papers published between ten and twenty-five years ago that has significantly impacted research or applications exemplifying the interplay of economics and computation.” 

In the award-winning paper: Marriage, honesty, and stability, Immorlica and a co-author explored centralized two-sided markets, such as the medical residency market, matching participants by running a stable marriage algorithm. While no matching mechanism based on a stable marriage algorithm can guarantee ‘truthfulness’ as a dominant strategy, the paper showed that in certain probabilistic settings, truthfulness is the best strategy for the participants.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

AWARD

Microsoft’s Lorin Crawford named 2023 COPSS Emerging Leader

Lorin Crawford, a principal researcher at Microsoft Research New England, has been named a 2023 COPSS Emerging Leader by the Committee of Presidents of Statistical Societies. The award announcement cited Crawford’s path-breaking research combining theory and methods of mathematics, statistics and computing to generate new knowledge and insight about the genetic basis of disease, and exceptional mentoring of students from multiple scientific disciplines.

The award recognizes the important role of early-career statistical scientists in shaping the future of their discipline. The selection criteria are designed to highlight contributions in areas not traditionally recognized by other early-career awards in the statistical sciences.

Crawford, who is also a faculty member at Brown University’s School of Public Health, focuses on developing novel and efficient algorithms that address complex problems in quantitative genetics, cancer pharmacology, molecular genomics, and geometric morphometrics.


AWARD

Microsoft researchers receive Test of Time award for personalized news recommendation work

A paper co-authored by two Microsoft researchers has received a 2023 Seoul Test of Time Award from the International World Wide Web Conference Committee (IW3C2). The 2020 paper: A Contextual-Bandit Approach to Personalized News Article Recommendation, was written by John Langford and Robert Schapire, along with two industry colleagues. The authors proposed a new approach for personalized recommendation using contextual bandit algorithms. According to the IW3C2, the paper now has more than 2,730 citations and has become foundational research in the area of recommendation systems.

The award announcement also states: “The paper addressed fundamental challenges in real-world recommendation systems via computationally efficient algorithms grounded in learning theory. It also showed that recommendation algorithms can be reliably evaluated offline, enabling algorithm selection without operational impact, and that contextual bandits can yield significant gains in user engagement.”


NEW RESEARCH

A Frequency Domain Approach to Predict Power System Transients

The dynamics of power grids are governed by a large number of nonlinear differential and algebraic equations (DAEs). To safely run the system, operators need to check that the states described by these DAEs stay within prescribed limits after various potential faults. However, current numerical solvers of DAEs are often too slow for real-time system operations. In addition, detailed system parameters are often not exactly known. Machine learning approaches have been proposed to reduce the computational efforts, but existing methods generally suffer from overfitting and failures to predict unstable behaviors.

In a new paper: A Frequency Domain Approach to Predict Power System Transients, Microsoft researchers propose a novel framework to predict power system transients by learning in the frequency domain. The intuition is that although the system behavior is complex in the time domain, relatively few dominant modes exist in the frequency domain. Therefore, the researchers learn to predict by constructing neural networks with Fourier transform and filtering layers. System topology and fault information are encoded by taking a multi-dimensional Fourier transform, allowing researchers to leverage the fact that the trajectories are sparse both in time and spatial frequencies. This research shows that the proposed approach does not need detailed system parameters, greatly speeds up prediction computations and is highly accurate for different fault types.


NEW RESEARCH

Inference with Reference: Lossless Acceleration of Large Language Models

The growing use of large foundation models like GPT-3.5/4 for real-world applications has raised concerns about high deployment costs. While general methodologies such as quantization, pruning, compression, and distillation help reduce costs. At test time, output tokens must be decoded (sequentially) one by one, which poses significant challenges for LLMs to be deployed at scale.

In a new paper: Inference with Reference: Lossless Acceleration of Large Language Models, Microsoft researchers study accelerating LLM inference by improving the efficiency of autoregressive decoding. In multiple real-world applications, this research shows that an LLM’s output tokens often come from its context. For example, in a retrieval-augmented generation scenario for a search engine, an LLM’s context usually includes relevant documents retrieved from an external corpus as reference according to a query, and its output usually contains many text spans found in the reference (i.e., retrieved documents). Motivated by this observation, the researchers propose an LLM accelerator (LLMA) to losslessly speed inference with references. Its improved computational parallelism allows LLMA to achieve over 2x speed-up for LLMs, with identical generation results as greedy decoding, in many practical generation scenarios where significant overlap between in-context reference and outputs exists. The researchers are collaborating with the Bing search team to explore integrating this technique into snippet/caption generation, Bing chat, and other potential scenarios.


NEW RESEARCH

High-throughput ab initio reaction mechanism exploration in the cloud with automated multi-reference validation

Quantum chemical calculations on atomistic systems have evolved into a standard approach to studying molecular matter. But these calculations often involve a significant amount of manual input and expertise. Most of these calculations could be automated, alleviating the need for expertise in software and hardware accessibility.

In a new paper: High-throughput ab initio reaction mechanism exploration in the cloud with automated multi-reference validation, researchers from Microsoft present the AutoRXN workflow, an automated workflow for exploratory high-throughput electronic structure calculations of molecular systems.

This workflow i) uses density functional theory methods to deliver minimum and transition-state structures and corresponding energies and properties, (ii) launches coupled cluster calculations for optimized structures to provide more accurate energy and property estimates, and (iii) evaluates multi-reference diagnostics to back check the coupled cluster results and subjects them to automated multi-configurational calculations for potential multi-configurational cases.

All calculations take place in a cloud environment and support massive computational campaigns. Key features of all components of the AutoRXN workflow are autonomy, stability, and minimum operator interference.

The paper was recently published in the Journal of Chemistry and Physics.

The post Research Focus: Week of May 8, 2023 appeared first on Microsoft Research.

Read More

Using generative AI to imitate human behavior

Using generative AI to imitate human behavior

This research was accepted by the 2023 International Conference on Learning Representations (ICLR), which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.

An overview of our method, providing a side-by-side comparison of text-to-image diffusion, with observation-to-action diffusion. On the right are diagrams of the different denoising architectures tested, as well an illustration of the sampling schemes explored.
Figure 1: Overview of our method.

Diffusion models have emerged as a powerful class of generative AI models. They have been used to generate photorealistic images and short videos, compose music, and synthesize speech. And their uses don’t stop there. In our new paper, Imitating Human Behaviour with Diffusion Models, we explore how they can be used to imitate human behavior in interactive environments.

This capability is valuable in many applications. For instance, it could help automate repetitive manipulation tasks in robotics, or it could be used to create humanlike AI in video games, which could lead to exciting new game experiences—a goal particularly dear to our team.

We follow a machine learning paradigm known as imitation learning (more specifically behavior cloning). In this paradigm, we are provided with a dataset containing observations a person saw, and the actions they took, when acting in an environment, which we would like an AI agent to mimic. In interactive environments, at each time step, an observation ( o_t ) is received (e.g. a screenshot of a video game), and an action ( a_t ) is then selected (e.g. the mouse movement). With this dataset of many ( o )’s and ( a )’s performed by some demonstrator, a model ( pi ) could try to learn this mapping of observation-to-action, ( pi(o) to a ).

Spotlight: Microsoft Research Podcast

AI Frontiers: The Physics of AI with Sébastien Bubeck

What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

When the actions are continuous, training a model to learn this mapping introduces some interesting challenges. In particular, what loss function should be used? A simple choice is mean squared error, as often used in supervised regression tasks. In an interactive environment, this objective encourages an agent to learn the average of all the behaviors in the dataset.

If the goal of the application is to generate diverse human behaviors, the average might not be very useful. After all, humans are stochastic (they act on whims) and multimodal creatures (different humans might make different decisions). Figure 2 depicts the failure of mean squared error to mimic the true action distribution (marked in yellow) when it is multimodal. It also includes several other popular choices for the loss function when doing behavior cloning.

This toy example (based on an arcade claw game) shows an action space with two continuous action dimensions. It shows that popular choices of behavioral cloning loss fail to capture the true distribution, but diffusion models offer a good approximation.
Figure 2: This toy example (based on an arcade claw game) shows an action space with two continuous action dimensions. Here the demonstration distribution is marked in yellow—it is both multimodal and has correlations between action dimensions. Diffusion models offer a good imitation of the full diversity in the dataset.

Ideally, we’d like our models to learn the full variety of human behaviors. And this is where generative models help. Diffusion models are a specific class of generative model that are both stable to train and easy to sample from. They have been very successful in the text-to-image domain, which shares this one-to-many challenge—a single text caption might be matched by multiple different images.

Our work adapts ideas that have been developed for text-to-image diffusion models, to this new paradigm of observation-to-action diffusion. Figure 1 highlights some differences. One obvious point is that the object we are generating is now a low-dimensional action vector (rather than an image). This calls for a new design for the denoising network architecture. In image generation, heavy convolutional U-Nets are in vogue, but these are less applicable for low-dimensional vectors. Instead, we innovated and tested three different architectures shown in Figure 1.

In observation-to-action models, sampling a single bad action during an episode can throw an agent off course, and hence we were motivated to develop sampling schemes that would more reliably return good action samples (also shown in Figure 1). This problem is less severe in text-to-image models, since users often have the luxury of selecting a single image from among several generated samples and ignoring any bad images. Figure 3 shows an example of this, where a user might cherry-pick their favorite, while ignoring the one with nonsensical text.

Four samples from a text-to-image diffusion model from Bing using the prompt “A cartoon style picture of people playing with arcade claw machine”. Some of the samples are good quality, some contain errors, for example the text in one image is nonsensical.
Figure 3: Four samples from a text-to-image diffusion model from Bing (note this is not our own work), using the prompt “A cartoon style picture of people playing with arcade claw machine”.

We tested our diffusion agents in two different environments. The first, a simulated kitchen environment, is a challenging high-dimensional continuous control problem where a robotic arm must manipulate various objects. The demonstration dataset is collected from a variety of humans performing various tasks in differing orders. Hence there is rich multimodality in the dataset.

We found that diffusion agents outperformed baselines in two aspects. 1) The diversity of behaviors they learned were broader, and closer to the human demonstrations. 2) The rate of task completion (a proxy for reward) was better.

The videos below highlight the ability of diffusion to capture multimodal behavior–starting from the same initial conditions, we roll out the diffusion agent eight times. Each time it selects a different sequence of tasks to complete.

A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.

The second environment tested was a modern 3D video game, Counter-strike. We refer interested readers to the paper for results.

In summary, our work has demonstrated how exciting recent advances in generative modeling can be leveraged to build agents that can behave in humanlike ways in interactive environments. We’re excited to continue exploring this direction – watch this space for future work.

For more detail on our work, please see our paper and code repo.

The post Using generative AI to imitate human behavior appeared first on Microsoft Research.

Read More

Inferring rewards through interaction

Inferring rewards through interaction

This research was accepted by the 2023 International Conference on Learning Representations (ICLR), which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.

A diagram in which five newspaper icons are lined up in the middle, the first of which is labeled a. An arrow points from the newspaper to an icon of a person above it. The person is labeled x and has a mouse click icon next to it and a thought bubble with the words “I like this!” that’s labeled r. An arrow points from the mouse click icon to a box labeled “recommender system” under the newspapers.

Reinforcement learning (RL) hinges on the power of rewards, driving agents—or the models doing the learning—to explore and learn valuable actions. The feedback received through rewards shapes their behavior, culminating in effective policies. Yet, crafting reward functions is a complex, laborious task, even for experts. A more appealing option, particularly for the people ultimately using systems that learn from feedback over time, is an agent that can automatically infer a reward function. The interaction-grounded learning (IGL) paradigm from Microsoft Research enables agents to infer rewards through the very process of interaction, utilizing diverse feedback signals rather than explicit numeric rewards. Despite the absence of a clear reward signal, the feedback relies on a binary latent reward through which the agent masters a policy that maximizes this unseen latent reward using environmental feedback.

In our paper “Personalized Reward Learning with Interaction-Grounded Learning,” which we’re presenting at the 2023 International Conference on Learning Representations (ICLR), we propose a novel approach to solve for the IGL paradigm: IGL-P. IGL-P is the first IGL strategy for context-dependent feedback, the first use of inverse kinematics as an IGL objective, and the first IGL strategy for more than two latent states. This approach provides a scalable alternative to current personalized agent learning methods, which can require expensive high-dimensional parameter tuning, handcrafted rewards, and/or extensive and costly user studies.

IGL-P in the recommender system setting

IGL-P is particularly useful for interactive learning applications such as recommender systems. Recommender systems help people navigate increasing volumes of content offerings by providing personalized content suggestions. However, without explicit feedback, recommender systems can’t detect for certain whether a person enjoyed the displayed content. To accommodate, modern recommender systems equate implicit feedback signals with user satisfaction. Despite the popularity of this approach, implicit feedback is not the true reward. Even the click-through rate (CTR) metric, the gold standard for recommender systems, is an imperfect reward, and its optimization naturally promotes clickbait.

Interaction-grounded learning (IGL) for the recommender system setting. The recommender system receives features describing a person (x), recommends an item (a), and observes implicit user feedback (y), which is dependent on the latent reward (r) but not r itself, to learn how to better recommend personalized content to the individual.
Interaction-grounded learning (IGL) for the recommender system setting. The recommender system receives features describing a person (x), recommends an item (a), and observes implicit user feedback (y), which is dependent on the latent reward (r) but not r itself, to learn how to better recommend personalized content to the individual.

This problem has led to the handcrafting of reward functions with various implicit feedback signals in modern recommender systems. Recommendation algorithms will use hand-defined weights for different user interactions, such as replying to or liking content, when deciding how to recommend content to different people. This fixed weighting of implicit feedback signals might not generalize across a wide variety of people, and thus a personalized learning method can improve user experience by recommending content based on user preferences.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

The choice of reward function is further complicated by differences in how people interact with recommender systems. A growing body of work shows that recommender systems don’t provide consistently good recommendations across demographic groups. Previous research suggests that this inconsistency has its roots in user engagement styles. In other words, a reward function that might work well for one type of user might (and often does) perform poorly for another type of user who interacts with the platform differently. For example, older adults have been found to click on clickbait more often. If the CTR is used as an objective, this group of users will receive significantly more clickbait recommendations than the general public, resulting in higher rates of negative user experiences and leading to user distrust in the recommender system.

IGL-P provides a novel approach to optimize content for latent user satisfaction—that is, rewards that a model doesn’t have direct access to—by learning personalized reward functions for different people rather than requiring a fixed, human-designed reward function. IGL-P learns representations of diverse user communication modalities and how these modalities depend on the underlying user satisfaction. It assumes that people may communicate their feedback in different ways but a given person expresses (dis)satisfaction or indifference to all content in the same way. This enables the use of inverse kinematics toward a solution for recovering the latent reward. With additional assumptions that rewards are rare when the agent acts randomly and some negatively labeled interactions are directly accessible to the agent, IGL-P recovers the latent reward function and leverages that to learn a personalized policy.

IGL-P successes

The success of IGL-P is demonstrated with experiments using simulations, as well as with real-world production traces. IGL-P is evaluated in three different settings:

  • A simulation using a supervised classification dataset shows that IGL-P can learn to successfully distinguish between different communication modalities.
  • A simulation for online news recommendation based on publicly available data from Facebook users shows that IGL-P leverages insights about different communication modalities to learn better policies and achieve consistent performance among diverse user groups (the dataset, created in 2016, consists of public posts from the official Facebook pages of news companies from 2012 to 2016 and aggregated user reactions; because of this aggregation, identifying information can’t be extracted).
  • A real-world experiment deployed in the Microsoft image recommendation product Windows Spotlight showcases that the proposed method outperforms the hand-engineered reward baseline and succeeds in a practical application serving millions of people.

The post Inferring rewards through interaction appeared first on Microsoft Research.

Read More

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

GitHub Product Manager Kasia Sitkiewicz and Protocol Labs Research Scientist Petar Maymounkov discuss their collaboration on Gov4git on the Microsoft Research Podcast

Episode 139 | May 3, 2023

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a new Microsoft Research podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with. 

In this inaugural episode, host Dr. Gretchen Huizinga talks with GitHub Staff Product Manager Kasia Sitkiewicz and Protocol Labs Research Scientist Petar Maymounkov about how their collaboration on Gov4git, a governance tool for decentralized, open-source cooperation, is helping to lay the foundation for a future in which everyone can collaborate more efficiently, transparently, and easily and in the ways that meet the unique desires and needs of their respective communities. They discuss the governance features that make Gov4git more suitable for serving a broader range of communities than today’s public blockchains and the open-source book project allowing them to test the potential and limitations of the work.

Transcript

[MUSIC] 

GRETCHEN HUIZINGA: Every great idea at Microsoft Research is yearning to find its way into the hearts, minds, and hands of people. Microsoft researchers work with an amazing—and sometimes surprising—array of collaborators from across the sciences who are integral to the process of shepherding these ideas from lab to life. Welcome to Collaborators, a podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga, and in this series, we’ll dive deep into the collaboration process and illuminate how research ideas move from mind to market in our ongoing effort to enhance human abilities, strengthen human communities, and benefit human lives. 


[MUSIC ENDS] 

Welcome to Episode 1 of Collaborators. Today, I’m joined by our first two guests, Petar Maymounkov and Kasia Sitkiewicz. Petar and Kasia are working on a project that has collaboration in its DNA: Gov4git, a decentralized, transparent, and secure git-based protocol for governing open-source communities that they say circumvents more costly approaches to things like validation and dispute resolution. 

We’re going to unpack all of that in this episode. But before we do, let’s get to know our collaborators. Kasia, let’s start with you. You’re at GitHub, “an open-source platform for collaborative software development and version management.” This platform is well-known in the dev community but give us a brief elevator tour of GitHub and particularly what your role is there. 

KASIA SITKIEWICZ: Sure. So I’m happy to give an overview of GitHub. Uh, GitHub is primarily known to be a home for all developers and open-source communities. It’s one of the most popular resources for developers, as you mentioned, to share code and work on projects in collaboration. It makes [it] super easy for developers to share code files and collaborate with each other using GitHub issues, which we will be referencing in the podcast, and pull requests, uh, which we call PRs. So imagine GitHub issues being like a project description or some kind of information that what needs to be built, and PRs, um, are pretty much amendments to the code change that a community wants to merge with the main code branch, uh, and that’s very well known among developer community. So pretty much like that’s how we use version control. We know what needs to be changed, what needs to be merged, and community pretty much participates in all of those changes. And what I do at GitHub, uh, I work as a product manager. I oversee growth for GitHub Enterprise Cloud and GitHub Advanced Security, and on the side, I collaborate with Microsoft, Web3, and Microsoft Research team on, uh, working on projects like Gov4git or other Web3 partnerships where I represent GitHub and, um, trying to onboard and make those projects successful. 

HUIZINGA: So there’s meta-collaboration, and then there’s micro-collaboration, and collaboration all over the place in GitHub. 

SITKIEWICZ: Exactly. Yes, we, we do like to collaborate. 

HUIZINGA: [LAUGHS] Well, you’re perfect for this show. So, Petar, you’re at Protocol Labs, “an open-source research, development, and deployment laboratory.” And, and you say you’re “building the next generation of the internet and making human existence orders of magnitude better through technology.” No pressure, right? Briefly tell us about Protocol Labs and your role in taking the internet and humanity to the next level. 

PETAR MAYMOUNKOV: Yeah, um, first, thank you for having us. Since you’re asking about the North Star mission of Protocol Labs, so to speak, I think it’s quite simple. I think it’s really trying to sort of create a better world that is both, um, it’s sustainable, fair, and inclusive, and it’s trying to do this through decentralization as a concept and technologies, of course, in particular. Now this is a mighty goal, and in practice, it, um, comprises essentially three workstreams, if you will. Um, the first thing is decentralized infrastructure, because it’s not possible to, to build anything useful without the infrastructure, and in this regard, Protocol Labs is, um, essentially working on and stewarding, uh, two products Filecoin and IPFS, which provide decentralized infrastructure in a democratic way to the whole world essentially. Um, now the second workstream is, um—Protocol Labs was one of the companies to realize early on that, uh, whenever decentralized technologies are involved, um, they go hand in hand with, uh, enabling everybody to contribute, so this raises the question of decentralized development, which is how do people collaborate across country boundaries, backgrounds, different levels of experience, and so forth. So along with all the engineering efforts, Protocol Labs is also essentially innovating workflows and culture about being productive in a decentralized development kind of, um setting. And the final workstream, uh, which kind of shows you how long term the vision is in Protocol Labs, so we recognize that, um, we cannot have a sustainable, decentralized world unless we replicate some of the important, um, sort of processes that happen in the real world, in particular the research-to-development innovation pipeline. So in the real world, this goes from academia to industry, and so forth. And part of, um, why this question is new and not the same as in the real world is because, uh, decentralized products being a type of public good, um, do not succumb to the same incentive mechanisms that drive the conventional economy. So we, we have a department called network funding and funding of public goods, which is itself involved in thinking about new mechanisms and incentives for, for making this, this process work in a repeatable fashion, basically. And my, uh, my role currently in the company is, uh, to think about facilitating decentralized development through standardized tools and protocols. 

HUIZINGA: Gotcha. Well, as we’re talking about collaboration and collaborators and you two are at two different companies, I’m going to call this question “how I met your mother”! How did Gov4git come about, and what was the initial felt need that defined the purpose? And as you answer that, tell us who’s all involved and how you each got involved on the team. Kasia, I’ll let you take the lead on this one. 

SITKIEWICZ: Sure. So I guess on my end, it all started through the passion I have for open source and the idea of decentralized communities. As I mentioned, I’m part of a lot of, uh, projects here at Microsoft and GitHub, and one of them is Web3 and Plural Technology Collaboratory that is led by Glen Weyl, and a few months ago, Glen and I, we had a conversation about how amazing git is and how amazing our GitHub communities are and overall like the efforts that they are working on towards like better world, public goods, and so on, and I share my vision for GitHub to be a tool or platform that can be accessible by anyone around the world where people can collaborate, they can own, uh, share and like earn money pretty much because of those contributions that they have. So we talk about this vision and we share the same kind of like a passion for all of those different projects and, you know, aspects of like open source, and he mentioned like, “Hey, we’re actually working on this like open-source book, uh, that will be hosted on GitHub, and we would love to do some kind of collaboration here.” And then he introduced me to Petar and Protocol Labs, and we had our first intro call. Uh, we learned like what is the objective, what problems we are trying to solve, and we put a small team of GitHub, Microsoft, and folks from Protocol Labs and a few folks also from open source, like purely I put a tweet about like, “Hey, I’m looking for contributors to this amazing project that will help with governance for open source,” and few folks reach out, and that’s how we kind of put it together. 

HUIZINGA: Right. Petar, how do you see the, the thing coming around? 

MAYMOUNKOV: So I had been working for Protocol Labs for about three and a half years. The first couple of years, I spent most of my time engineering and sort of being in the real-world decentralized development kind of environment, so I saw lots of things that work well; I saw lots of things that need improving; and over time, I developed an interest to kind of address, uh, this question sort of systematically and head on, which is when I, um, started working specifically just on this problem. And about six months ago or so, when I was starting, I was initially researching the space and what’s known. This is how I ran into Glen Weyl’s work, so eventually, we, we connected, and, um, I read sort of most of the stuff that he’s been working on and tried to sort of find a connection between this and what I knew from, from the trenches, if you will, from the engineering department, and then—and then, you know, he connected us with, um, with, uh, with Kasia. But the thing that sparked it, though, so at some point, Glen did sort of point out the specific project that he was trying to initiate, the plurality book, and this was kind of the thing that put a shape to our efforts because it was a very concrete task that we needed to figure out how to like address and accomplish in like a reasonable time. 

HUIZINGA: Yeah, so, so let’s get sort of granular about Gov4git and what it is, because I don’t think we’ve defined that, uh, from the get-go here, so, Kasia, can you kind of explain what it is and why it’s different? 

SITKIEWICZ: Sure. So Gov4git is pretty much a tool that helps, uh, open-source community to govern their community members in a more efficient, transparent, and easy way. There is a lot of problems in traditional governance model for any communities, and the larger communities are, there, there are more problems. And Gov4git is trying to solve a very particular problem of giving autonomy and ownership to the community to make decision what needs to happen and what changes the community needs to prioritize in order to make the project more successful. So, it’s just a solution that helps you to govern your communities in an efficient way. 

HUIZINGA: Yeah, so even as we’re talking, I’m thinking, OK, you’ve got Microsoft Research, you’ve got GitHub, you’ve got Protocol Labs. But do you use this to govern the things that you guys are working through as a community collaboration? 

MAYMOUNKOV: The tool itself is essentially implementing processes that kind of have organically emerged both in, in the context of Protocol Labs, as well as even other organizations like Ethereum. Um, I mean, this is the process of people kind of collaborating on specifications for decentralized protocols and so forth. For the particular—for Gov4git specifically, since the tool is still, uh, in some sense under development, but it, but it is kind of approaching MVP, we have used it internally as, as dog food, um, but not at large scale yet. 

HUIZINGA: Right. Gotcha. 

SITKIEWICZ: Yeah. And I think the beauty of Gov4git is actually very useful when you have a bigger community. Right now, our team is very small. It’s just like, uh, six people working together, so—and this is something I want to elaborate a little bit more in our, later in the podcast—but the smaller community, there is less problems, and you kind of make a decision on the fly, on the go, like, “Hey, what are we going to build next? And should we, should we focus on this or that?” So you can actually make those decisions without really spending too much time. And that’s a beauty for all startups moving fast, but the moment the community grows, you have those constraints and problems. So Gov4git is precisely designed for those growing communities and making sure the communities grow in like a very healthy way versus like there is a stop at some point where, like, you cannot make a consensus because of, you know, this person is out, or I don’t have enough information, or I don’t have rights or permissions to make those changes. So, uh, we—like Petar said—we dogfood the code, but at the same time, the use cases are like for a little bit bigger groups and communities. 

HUIZINGA: Well let’s get specific about the problems and solutions from a technical perspective. And, um, Petar, I’m going to ask you to take the lead on this. As I understand Gov4git from my non-technical perch, it’s a sort of sandbox for community governments mechanisms. How would you define the problems you’re trying to solve with Gov4git, and how are you going about solving them technically? 

MAYMOUNKOV: Yeah, this is a good way of putting it. It’s, it’s a sandbox for governance, um, solutions, so, um, indeed I have the, um, technical kind of part of this, um, project. And from, um, from a computer science point of view, governance is synonymous with trusted computation. So trusted computation is, is an abstraction or a notion whereby there is a public, uh, program or rules of governance and the community has a method of kind of—there is a, there is a device that, that executes and follows the rules of governance and the community members have, um, assurance that the rules are followed as advertised and that nobody can sidestep the system regardless of their role in the community. So governance is trusted computation to scientists, basically. Now, uh, trusted computation being a general abstraction is, is something that has various embodiments in the real world, and the most, uh, famously known currently embodiment of trusted computation are public blockchains such as Ethereum, Filecoin, and others. So we could have sort of chosen to use these existing solutions to how you build governance applications, um, but we ran into a number of practical issues with them that prevent us from delivering sort of practical results in a reasonable amount of time. And also, there are some shortcomings that prevent these solutions from reaching people in unprivileged parts of the world, so developing world, war zones, authoritarian countries. Uh, so effectively, Gov4git from a technical standpoint is a different embodiment, a different implementation, of trusted computation, which is not in competition with public blockchains. It captures a, a different tradeoff, so to speak. 

HUIZINGA: OK, talk a little bit more about the tradeoff. I mean, some of these things would represent to me a barrier to entry—I wouldn’t be able to, um, afford it. What are some of the, the upsides to Gov4git that, um, we don’t find in the other spaces? 

MAYMOUNKOV: Yeah, so to make a fair comparison, I should first give some context on the existing blockchains. Um, so the existing blockchain technologies are quite exciting, um, and they, they’re very promising. But currently, they’re in a state of having overshot in their level of ambition and slightly underdelivered, at least for the present time, and I’m sure they will eventually deliver, uh, sort of completely. So what do I mean by this? So they have overshot in the sense that they are—they provide so many features and, and they capture an extremely large set of applications, but at the same time, this of course involves a lot of complexity that they need to deal with, and this complexity hasn’t been fully sorted out yet to make them usable for sort of common cases. So what, what we’ve noticed here is that there is a large group of applications, in particular community governance, which does not need most of the features that are provided by public blockchains. And so once you realize that this is the case, you unlock much simpler solutions that have the same sort of outcome for the users. So public blockchains—let me be a little bit specific here for the technical listeners—so public blockchains, they’re global systems, so across the world. They’re capable of hosting multiple independent applications. Uh, you can think of this as independent communities which need to interact with each other at very high speeds and with a very high throughput. So the typical applications that you can think of is essentially high-volume, cross-community business or trade interactions. And, of course, this is a real use case, especially with financial systems and so forth. But, um, in contrast, community governance applications, which are sort of designed to serve humancentric deliberative processes within a community, they’re not global; they’re local to a community. They are not multiple applications; they are a single application that governs one community. And because they are human-deliberative applications, they don’t need high speeds and high throughput, so recognizing that these, um, this is the case, alternative designs for trusted computation, um, sort of emerge and this is what we’ve, what we went after. 

HUIZINGA: That’s, that’s awesome. Well, and so, Kasia, let’s go back to a little bit because we’re going to cross over here. There’s a couple of themes that are emerging that I think are really interesting. Um, you talk about, earlier, the issues in pull requests that you deal with and that Gov4git has some mechanisms to help address the tension between what I might call anarchy and dictatorship. Is there some kind of a, a mechanism that’s different that can help mitigate that? 

SITKIEWICZ: Yeah, absolutely. So, as I mentioned, there are different types of communities, and the bigger the community gets, the more issues you have. Within smaller community, you pretty much know who you’re interacting with; you know the contributors; you know who is the maintainer. And it’s actually quite fast to make those changes and like approving those pull requests and reviewing comments and issues and other activities that are happening around every project. With the bigger communities, there’s more, uh, logistics problem and governance problem, and many times, you truly don’t know who is contributing to your code source. You just know their handle. That can be anyone; that can be even some kind of like ChatGPT, especially with like right now like the generative foundation models. Like we’re going to see more problems of like interacting with non-humans, right? So I feel like communities will have more and more problems facing like, “OK, how do I manage my contributors, and, uh, how fast we want to move the project?” So Gov4git is using, uh, a lot of like beautiful features from Web3, which is quadratic voting. It’s, uh, pretty much collective decision-making procedures that involve individuals who are part of your community with allocating votes to express the degree of their preferences. So as you mention, in a traditional organization, there is one person or one dictator that tells you like, “Hey, you’re going to build that.” And once we have it, we’re going to like approve it, right? And we’re going to like ship it. With quadratic voting, the decision is made collectively. So we’re going to implement quadratic voting part of our governance model. Second feature that is also very nice is like the governance tokens. Right now, um, communities, there are few ways of like how they make decisions, either majority of the votes or through consensus. With this type of governance tokens, you will be able to see like how many people voted on a specific pull request or a feature, and the majority of the votes will be pretty much the decision-making. So community can use those governance tokens for making the decision. And lastly, uh, there is a concept of badges. So in the Web3 space, there are like NFTs, and one of the NFTs is a soulbound token, which is a token that you are given that you cannot transfer, and we believe that by implementing those soulbound tokens, you can authenticate the user, you can say, “Hey, I know you; you’re part of this community; you got this badge.” And that badge gives you, let’s say, right to receive those tokens and so on. So again, those are just like a few features that are actually like very nice in that decentralized communities that we want to bring into Gov4git so that the communities can benefit from having specific features like, uh, quadratic voting, governance tokens, or like those badges. And what I want to say is like, you know, GitHub or like other git platforms, they don’t support this type of governance features, and that’s the need from the users and customers being like, “Hey, I need something that will be very easy, efficient, and transparent,” and Gov4git provides all of it. 

HUIZINGA: Yeah. Well, and on that same topic, Petar, I always like to ask what could possibly go wrong, and even as Kasia’s talking, all kinds of things are coming into my head like, um, could a bot get an SBT or, I mean, do you have to be, provide validation to who you are and what you represent yourself as? 

MAYMOUNKOV: Yeah, so, um, let me answer the general question and the specific question. So I think the specific question about bots is that, has the following answer. So I think people in Microsoft Research in particular, but people in general, are realizing that identity is going to be much harder to, uh, prove and understand in the presence of AI. And so here we kind of—especially Glen, sort of leading with his paper on soulbound tokens, is essentially looking into something that we do in the real world, uh, which is that we have deep ways of verifying people’s identity by essentially, um, looking into their history with communities and within society. Uh, so the presence of these badges that Kasia is mentioning is essentially creating a system whereby people can collect certificates from different endeavors that they have participated in to build out a résumé that is verifiable by the communities where they participated that they are who they are. In some, in some sense, the person is the sum total of everything they’ve done for other people. And currently, a bot cannot accomplish as much as a person and get sort of, you know, certificates from other humans that this has been the case. So roughly, this addresses the question of, OK, can something go wrong with, with bots. In a sense, bot or not, to be acknowledged in a system, you have to have contributed verifiably to, to multiple communities eventually. Um, but there is a bigger sort of picture about what can possibly go wrong. And so in this regard, Gov4git kind of sits in a very standard situation with most, uh, very promising software tools, which is that it, it is, it is a powerful tool that can fall in the hands both of good and bad people, acknowledging the fact that good and bad are relative terms. And, and this is, this actually also plays on a, on a general theme in software and science, which is that software engineers and engineers, scientists and so forth, they design software which is symmetric, so the software from the start treats everybody in the same way. It doesn’t have a way of distinguishing, you know, who’s using it. And even though this sounds like the right place to be—it’s a neutral place to be—there are plenty of cases already in the real world where, um, it is unclear, you know, whether society wants symmetric treatment of everybody. The, the classical example here that I would give is, is Twitter. When it comes to the question of censorship on Twitter, there’s a few different alternative, um, kind of directions that that people can think of, of taking. One direction is to say that, uh, no censorship should happen, uh, which is the symmetric treatment. So everybody gets the same agency within a system. But as you know, there’s plenty of people who don’t like this approach. There’s other approaches, such as “somebody should censor us.” But who’s, who’s the somebody? So, so these kinds of issues all apply in this case, as well, because if governance for git is to be successful, what I hope, or, you know, cautiously hope, that it will result in, it’ll enable communities to forum at a much larger speed and a much larger volume around the world. And usually, when things speed up for humans, just like Twitter sped up discourse between people, um, we tend to find ourselves in a situation where we are slightly unprepared to, to, to reason about where does this go. 

HUIZINGA: Right. Kasia, what do you have to add to Petar’s conversation there on the “what could go wrong” from your end? 

SITKIEWICZ: I think from the product side—and I can speak as a product manager—there might be a case where like the community will come back to us like, “Hey, this is not what we want. We want something different,” right. Which, it’s a hypothesis, and can, this can, this feedback can happen, right. But at the same time, I believe that the community will ask for more. So like we are building just a very simple MVP to pretty much let the community to make those decisions, but perhaps the direction might be like, “Hey, the value’s somewhere else.” Uh, because once we launch, we can learn like, OK, this is great, but it’s not enough. So I would speak from the product side and like the user testing that perhaps we might discover like, oh, the actually true value will be somewhere else, and perhaps it can be a quadratic voting; it can be those tokens or those badges, right. So from my end, I feel like that’s the biggest like unknown, and speaking about bots and, uh, all the AI work, I feel like there is a lot of value in that, as well. So it’s not just a negative aspect of like, “Hey, I don’t want automation to be part of my project.” I think we will see it more, and there will be a lot of benefits. It’s just there are a lot of things we do not know as of now, and we just have to make sure like we are very flexible in terms of like how we pivot and how we adapt to feedback. 

HUIZINGA: Right. But, but in other ways, GitHub itself and Gov4git is a platform for people to form their own communities and govern their own communities, right? So you’re not going to be sort of the 10,000-foot hall monitor and try to meta-govern the people that are governing their own communities, correct? 

MAYMOUNKOV: Yes. SITKIEWICZ: That’s correct, yes. 

HUIZINGA: They’re nodding their heads. It’s a podcast—you can’t see it! [LAUGHS] Well, and this, this discussion on the “what could possibly go wrong” is important for me because I think people who are going to use the technology want to know that people promoting it are aware of the potential for unforeseen and unintended consequences and have a plan for mitigating. But it’s such an interesting ramp up to this new kind of use case for collaborative, open-source governance that it’s really cool. Kasia, let’s talk specifically about some of those use cases from the product side that you’ve alluded to. Um, GitHub is well known in the developer community, but how’s the idea of decentralized open-source work moving into non-technical communities and applications? 

SITKIEWICZ: Yeah, absolutely. So in any open-source project, you will find very technical contributors and maintainers and also you will find people who just like want to like observe the project or perhaps help with like project management or translation and so on. So we already have a lot of non-technical contributors who perhaps are struggling when they first log in to GitHub and they learn about git. They were like, “What the heck is that?” It’s a black box. So we truly get that feedback from customers. It’s like a very overwhelming experience, and it takes some time to wrap up and kind of learn how to use it. So the idea for Gov4git is pretty much a very simple presentation, or UI, via extension, Chrome extension, where you will see something very familiar like you see on Twitter, where you have like a post that you need to vote on, and if you are eligible to vote, you will, you’ll be able to use your tokens, uh, and vote on the decision, and you will be able to comment and interact with the community, and so on. So the ultimate goal is to create something very simple, just like a Twitter, you know, is simple, so that community is like, “Hey, I can participate, and I can put my vote, and I can contribute to this project.” So ultimately that’s the case. And the way—how we will be testing it, we talked about this book. So the book is called Plurality: Technology for Collaborative Diversity and Democracy, and it’s led by Audrey Tang and Glen Weyl and with, along with the plurality community. So the Plurality, it’s an open git-based collective book project that aims to offer a vision for the future of technology focusing on empowering and bridging social differences. So that book is on GitHub, and collaborators and maintainers who are participating are writing this book in an open-source way. And as you can imagine, writing a book is not an easy or trivial thing. You have a lot of reviews; you have everyone looking and providing feedback. So we believe that they can benefit from, uh, using Gov4git, with like management of like PRs and issues and decision-making. And, um, the initiative is already like there, right; it’s started. So we are just like trying to see like how that can—book can be completely managed by a community versus like Audrey or Glen has to like spend a lot of hours to review all of those PRs. And it sometimes is very challenging, and it’s almost impossible to go every single comment, so we believe that this can help and expedite the process and make it very transparent and efficient way to write in open source. 

HUIZINGA: Petar, talk a little bit about the other applications, including this one, from a technical perspective. Um, what makes it easier to resolve arguments and make edits with Gov4git versus other mechanisms to do that? 

MAYMOUNKOV: Gov4git, being a sandbox, at least technologically, is not trying to be prescriptive about how people do this. We’re trying to enable people to, to, to pick the mechanisms that they want for themselves, for arbitrating conflicts, so, you know, starting with, with Glen’s project, of course, we are starting with quadratic voting, and we plan, um, the quadratic voting is a, is a large, at this point, field. There’s lots of different variants of it. So we, we build the product so that over time Glen and Audrey can experiment with, you know, different types of conflict resolution and, and so forth. What Gov4git provides is the ease of adding a new mechanism that the community wants. And of course, we plan to have a library of like mechanisms that people can choose from. One nice side benefit from this entire project is that Gov4git, uh, enables people to like reflect on what they’ve done and, and what is happening. So with Gov4git, you always have a complete history, both of the governance motions of, of the community, alongside with the actual open-source collaborative work, which in particular enables academics and researchers from organizations such as the Metagovernance Project being a good example to go in there and study what types of mechanisms make for better results, basically, and kind of improve iteratively over this. 

HUIZINGA: Yeah. So it sounds like there’s a spectrum of assessment or meta-governance testing with computer scientists, product managers, academics. Even there, you see this great collaboration happening. Go back to the, the academics and other, uh, collaborators that are coming in on this. Do you find a broad spectrum of disciplines involved, not just computer scientists in academia but perhaps social scientists, legal scholars, any of these kinds of things coming into this? 

MAYMOUNKOV: Um, it’s too early to tell, but, uh, but there has been indeed interest, so, so from a few places, right. So the, the academics are indeed interested to, to consume this data when it’s available from real-world communities, because the key thing for them is to have real-world data like sufficiently scaled communities, like the Plurality book would be a great example because it’s probably expecting to have thousands of contributors. And otherwise, um, in addition to, uh, the Plurality book as like a first customer, so to speak, uh, we already have lots of interest from AI companies. So these are AI companies that are currently building open-source AI models, and they want to experiment with attaching governance to their open-source work, which is already happening on gits and GitHub. And they want—uh, because once you have governance plus open source, then you, you have a holistically democratic development of something like an AI tool. 

HUIZINGA: Right. That just struck me that you say thousands of contributors to a book and you never [LAUGHS] think of that being the case. Um … 

MAYMOUNKOV: Well, that’s a special, that’s a special book because it’s, it’s going to have translations in multiple languages, and being, being it, uh, also needs to be fact-checked, so there’s a lot of work on fact-checking that, that goes along with the writing process. 

HUIZINGA: Yeah. Sounds a bit like wiki in terms of contributors and checking and making decisions and so on. Um, is, is Gov4git even in beta yet, or is it still just, um, sandboxing itself? 

MAYMOUNKOV: Um, so the, the MVP—the first version, if you will—is, is ready and has been tested for a few months internally at Protocol Labs. What we’re missing and we’re still working on is like the user interface that brings in the non-technical users. So I guess you could say that it’s in beta. I think like our launch with the Plurality book would be the first kind of official introduction event. 

HUIZINGA: Right. Yeah, and that’s an interesting, you know, when the outsiders looking in going open source, you think software, you think developers, you think code, but there’s a lot of other applications, including writing a book, which is basically just text-based writing. So, Kasia, are there any other sort of cream-floating-to-the-top applications or products that you could see coming out of this? 

SITKIEWICZ: Technically, anyone who wants to start something new and is looking for collaborators, and it can be pretty much whatever you want to build. It doesn’t have to be like a big idea. It can be just, “Hey, I want to collaborate with someone, and I want to like figure out how to do things and how to practice.” It can be used by academics, as you mentioned. Like pretty much any, any, any person who wants to start with like building something in public, they can do it and use it. So there is no limits. It’s up to you if you want to build community around the project you’re working on. So we don’t have any restrictions, and I feel like, um, we are in the stage right now or like this AI revolution where we’re just entering this like open-source community’s growth because there is like a lot of hype right now and everybody’s interested in it. Oh, maybe I can build that. It’s just so much easier to do things right now. And, you know, if you want to grow, you have to have a community around you. Um, so I think this is just like a best practices for anyone who wants to start writing in public. Whatever is that is—it might be like just a book or a code or like learning or like sharing some information. It doesn’t really matter. And, you know, being at GitHub, we see a lot of like amazing projects regardless of the discipline and like the area, and communities are just fascinating. And I think that’s the future. Like pretty much a lot of companies will start doing open-source code, just [like] Twitter has done it, right, just to bring the transparencies, because in a decentralized world, that’s like the value proposition, like, hey, it’s a very transparent way of building, and you have a history being displayed of the decision-making. And there are a lot of companies started noticing the beauty of it, and they—I think the movement is just starting, so I see a huge growth. 

HUIZINGA: Yeah, and that leads into the last question I wanted to ask both of you, um, and you both alluded to some of this already in your answers, but just if you could encapsulate in your ideal preferred future, what is your work look like in five to 10 years? How have you changed the landscape of collaborative work, community governance, and even that concept of communities? 

MAYMOUNKOV: So I hope that well within 10 years, this tool becomes perceived as a somewhat go-to tool for building, you know, communities from scratch, and, in particular, I actually hope that the tool reaches a critical point which you can label the beginning of intersectionality, to borrow a term from Glen’s, um, Glen’s vocabulary. Um, and what this means, this is a point where there is enough deployments of Gov4git that you have a non-trivial amount of people that are members of more than one community. So in other words, communities are starting to overlap, and when, when we reach this critical point, there’s a whole new set of applications that open up because now communities can, uh, interact with each other, uh, and ask each other for various kinds of help. The classical example here is that, um, one community can ask another community whether a given member has had a long and productive career in the other community. And this kind of idea—also mostly coming from Glen—is actually a mirror image of what I mentioned earlier, what happens in the real world. So when you apply for a job with, uh, an employer, the employer being a community, this employer calls up your university to verify that you actually went there and you did a good job. So you have these two communities basically sharing information. Um, so there’s lots of applications of intersectionality, but the reason I call this a critical point is because once you get there, you actually expect the network effect that we know from social networks to start taking place. In particular, if the network of communities using Gov4git is, is, is large and there’s lots of intersection, then any new communities being formed would benefit a lot from reusing the same technologies because now they can benefit from all of these other communities that already exist and that they can interoperate with. This is sort of a critical point, because, uh, if we reach it, then the tool really has a chance of becoming like an international standard for like conceiving communities, basically. 

HUIZINGA: Yeah. Kasia, what would you add to that? 

SITKIEWICZ: So I will speak a little bit more high level on the data we are seeing at GitHub, and what we believe that will happen is last year we hit 100 million developers being on our platform … 

HUIZINGA: Wow. 

SITKIEWICZ: and they’re like thousands of thousand different open-source communities. And we, we see a huge growth, and especially with like the AI innovation that is happening in that space, I think this will like triple in the upcoming few years. So the more people start understanding the beauty of technology and collaboration and like writing in public, the more adoption we will have. So I think it’s just a matter of time how fast, uh, tools like Gov4git will grow and will be needed. We’re still early because there is, like we don’t know what we don’t know. We know the problem. But we don’t know how the problem will, um, intensify in the upcoming like months or years, right. So I truly believe that there is a need for it. There will be a huge growth in terms of like creating new communities, and people from around the world, they can unite through using platforms like GitHub or other services where they can actually engage with other people who are passionate about the same thing. So as you mentioned, open-source concept is not new, but it’s actually getting more in the strength, and the value’s there. So in my eyes, it’s just a matter of time on like the scale and the growth, and features like, like prioritization or like quadratic funding will be just like more adopted by the community. So that’s my, uh, take and, uh, opinion about the space. 

[MUSIC] 

HUIZINGA: Petar and Kasia, thank you so much for coming on the show today and being our first guests on the Collaborators podcast. 

MAYMOUNKOV: It’s a pleasure. 

SITKIEWICZ: Thank you for having us. 

[MUSIC ENDS] 

The post Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz appeared first on Microsoft Research.

Read More

AI self-play for algorithm design

AI self-play for algorithm design

A flow chart demonstrating the five steps in a self-play pipeline for a language model to improve itself automatically.A self-play pipeline for a language model (LM) to improve itself in a fully automatic manner. First, the LM generates novel puzzles based on a training set of handwritten puzzles. Then, the LM attempts to solve each of these puzzles 100 times. In Step 3, the computer (specifically a Python interpreter) filters the candidate solutions for correctness. Finally, the LM is improved by further training on these verified correct solutions to synthetic puzzles, and the process repeats. This process leads to significant improvements as measured on held-out test puzzles that were also handwritten.
A self-play pipeline for a language model (LM) to improve itself in a fully automatic manner. First, the LM generates novel puzzles based on a training set of handwritten puzzles. Then, the LM attempts to solve each of these puzzles 100 times. In Step 3, the computer (specifically a Python interpreter) filters the candidate solutions for correctness. Finally, the LM is improved by further training on these verified correct solutions to synthetic puzzles, and the process repeats. This process leads to significant improvements as measured on held-out test puzzles, which were also handwritten.

Efficient algorithms are crucial for many purposes, including reducing energy consumption in digital devices. While humans outperform AI systems at designing such algorithms, we show how to improve AI programming abilities using self-play, a technique that has helped AI systems dominate in games such as chess and Go.

Designing fast and accurate algorithms requires high-level abstract reasoning, which remains difficult for AI systems. Our approach involves having the AI design and solve its own programming challenges, enabling practice on millions of artificial challenges and exploration of problem types not found in public repositories. We detail our work in a new paper, “Language Models Can Teach Themselves to Program Better,” which we’re presenting at the 2023 International Conference on Learning Representations (ICLR).

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

The key challenge and our solution

How can an AI system generate novel algorithmic programming problems without knowing the solution?

Our approach uses programming puzzles introduced by Microsoft Research in 2021. These puzzles—known in complexity theory as the class of “NP” decision problems—are easy to check for correctness (no hidden answer key) but often difficult to solve. In this way, they’re like a Rubik’s cube, where it’s trivial to recognize a solution but hard to find one. Three examples are illustrated below: a novel string challenge and the classic Towers of Hanoi and factoring problems. Programming puzzles can range from trivial to major open problems in algorithms and mathematics, and solving them requires all the major algorithmic techniques, such as dynamic programming and greedy algorithms. However, each puzzle just checks a single input as opposed to standard problems in algorithms, which require a solution that scales efficiently for all inputs, which is much harder to test.

Programming puzzle examples

Can computers generate valuable, novel challenges?

Surprisingly, language models such as Codex and GPT-Neo can indeed create novel puzzles when prompted to generate “more like these” on a set of example puzzles without solutions. You may wonder what makes a challenge good. Instead of focusing on interesting, we prioritize useful challenges. Our evaluation has the language model generate, solve, and train on its own puzzles; then we assess whether the training improved its performance on a hidden test set of puzzles. (By now, solutions to our puzzles may have leaked into AI training sets, but with the help of champion competitive programmers, we have created a secret test set that remains unpublished, which can be used for uncontaminated evaluation.) In our experiments with small- to medium-sized language models—with a few billion parameters, much fewer than the latest GPT models—self-training more than doubled success rates.

Risks and limitations

This research was conducted prior to GPT-4’s release. While we believe similar techniques may help GPT-4 self-improve in programming, this is an active area of research as we better understand the capabilities and limitations of these models. One key limitation of puzzles is that solutions might only work for the specific instance provided. However, this limitation also serves as an advantage in terms of human-AI alignment. Unlike other AI challenges with inherent ambiguities that could lead to unintended consequences if objectives are imprecisely defined (for example, an AI-designed math-tutor app that may become addicting unintendedly), our programming puzzles encompass exactly those standalone problems that can be perfectly verified for meeting a precise objective. As there remains a risk that any work that substantially advances AI programming capabilities can be used in other systems and with unintended consequences, we continue to encourage taking great care before deploying systems with artificially generated code.  

Examples of programming puzzles for AI self-play

Each puzzle is specified by a short Python program that checks a possible answer. Each solution is a Python program that outputs an answer in a limited amount of time.

Example 1: Towers of Hanoi

A Towers of Hanoi puzzle in three steps: the first a picture with the puzzle’s seven disks on the first tower, the second a picture with the disks split among the three towers, and the third a picture of all the disks on the last tower.

The goal of the well-known Towers of Hanoi puzzle is to move all the disks from the first tower to the last tower, one by one, without ever putting a bigger disk on top of a smaller disk. It’s easy to check that a solution is correct but hard to find a correct solution. Even though the number of steps required to solve it is exponential in the number of disks, there’s a solution in the form of a short program that is often used to teach recursion. The clever solution program that outputs the moves is easier to find than the sequence of moves itself. Here are the programming puzzle and solution:

Example 2: String challenge

This concise puzzle perplexes AI systems, although humans find it simple. The puzzle requires a string with 1,000 “A” characters but no two consecutive A’s. Most programmers devise solutions like “ABABAB …” (1,000 times), generated by the compact Python solution above. In contrast, AI systems usually need multiple attempts. Fortunately, AI systems can easily verify their attempts by running the checking program. This puzzle exemplifies a straightforward, unique problem specifically created for our dataset.

Example 3: Integer factorization

Another classic example is integer factorization. The puzzle above requires a factor of a relatively small number so it can be solved quickly by a simple loop. However, our dataset also contains factoring challenges like the 309-digit RSA Factoring Challenge number, which was published in 1991 along with a $100,000 prize. The 309-digit number was never factored, and the challenge has since ended.

The post AI self-play for algorithm design appeared first on Microsoft Research.

Read More

Research Focus: Week of April 24, 2023

Research Focus: Week of April 24, 2023

Microsoft Research Focus 14 edition, week of April 24, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

AWARD

Microsoft researcher Kalai awarded 2022 ACM Prize in Computing

Yael Tauman Kalai, a senior principal researcher at Microsoft Research, has been awarded the 2022 ACM Prize in Computing. Kalai was recognized for breakthroughs in verifiable delegation of computation and fundamental contributions to cryptography. According to the award announcement, “Kalai’s contributions have helped shape modern cryptographic practices and provided a strong foundation for further advancements.”

The ACM Prize in Computing recognizes early-to-mid-career computer scientists whose research contributions have fundamental impact and broad implications.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

Among the multiple accomplishments cited for the award, Kalai has developed methods for producing succinct proofs that certify the correctness of any computation. This method enables a weak device to offload any computation to a stronger device in a way that enables the results to be efficiently checked for correctness. Such succinct proofs have been used by blockchain companies to certify transaction validity, thereby overcoming key obstacles in blockchain scalability and enabling faster and more reliable transactions.

Kalai was also cited for her breakthrough work on the security of the “Fiat-Shamir paradigm,” a general technique for eliminating interaction from interactive protocols. This paradigm is extensively utilized in real-world applications, including the most prevalent digital signature scheme (ECDSA), which is used by all iOS and Android mobile devices.


NEW RESEARCH

Empowering Azure Storage with RDMA

High performance and highly reliable storage are fundamental requirements of public clouds. Given the wide adoption of disaggregated storage in the cloud, networking is essential for enabling high performance and high reliability. Microsoft’s Azure cloud service uses remote direct memory access (RDMA) as its transport and aims to enable it for both storage frontend traffic (between compute virtual machines and storage clusters) and backend traffic (within a storage cluster) to fully realize its benefits. As compute and storage clusters may be located in different datacenters within an Azure region, RDMA needs to be supported at regional scale.

In a new paper: Empowering Azure Storage with RDMA, Microsoft Azure and Microsoft Research report on their intra-region RDMA deployment to support storage workloads in Azure. The high complexity and heterogeneity of Azure infrastructure creates challenges, such as the problem of interoperability between different types of RDMA network interface cards. Several changes were made to the network infrastructure to address these challenges. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. This helps achieve significant disk I/O performance improvements and CPU core savings.


NEW RESEARCH

LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models

Systems that support users in the automatic creation of visualizations must address several subtasks—understand the semantics of data; enumerate relevant visualization goals; and generate visualization specifications. In a new paper: LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models, researchers from Microsoft pose visualization generation as a multi-stage generation problem and argue that well-orchestrated pipelines based on large language models (LLMs) and image generation models (IGMs) are suitable to addressing these tasks.

LIDA is a novel tool for generating grammar-agnostic visualizations and infographics. It is comprised of four modules—a summarizer that converts data into a rich but compact natural language summary; a goal explorer that enumerates visualization goals given the data; a visgenerator that generates, evaluates, refines, executes, and filters visualization code; and an infographer module that yields data-faithful stylized graphics using IGMs. LIDA provides a python API and a hybrid user interface (direct manipulation and multilingual natural language) for interactive chart, infographics and data story generation.


NEW RELEASE

Announcing DeepSpeed-Chat: Easy, fast, affordable RLHF Training of ChatGPT-like models at all scales

Microsoft’s AI at Scale initiative has released DeepSpeed-Chat, an easy, fast, and low-cost open-source solution for reinforcement learning from human feedback (RLHF) training that can create high-quality ChatGPT-like models ranging in size from a few to hundreds of billions of parameters. DeepSpeed-Chat provides complete RLHF training experience with a single click. It combines the prowess of DeepSpeed-Inference and DeepSpeed-Training to offer 15x faster throughput than the previous state of the art, while also supporting model sizes that are up to 8x larger on the same hardware. With DeepSpeed-Chat, practitioners can train an OPT-13B ChatGPT-like model in under 1.5 hours or a massive 175B model in a day on a modest GPU cluster. For those who don’t have a GPU cluster handy, DeepSpeed-Chat enables practitioners to train up to a 13B model on a single GPU, or at $300 to train on Azure Cloud. 


NEWS

Gov4git: Decentralized community governance to fuel open-source projects

Communal open-source projects have helped build countless applications for sourcing and sharing information like bug details and scientific data, as well as decentralized planning, design and policymaking. 

But the lack of a standardized and secure governance solution prevents many open-source projects from getting started—and holds them back when they get too big to be managed through ad-hoc methods. These small communities often resort to external mechanisms to manage their projects and protect them from malicious actors.

Microsoft Research and Protocol Labs, an open-source R&D company, are collaborating to develop Gov4git, a decentralized, git-native protocol with configurable governance rules to help launch more open-source projects and communities and support their growth.

Gov4git comes with many of the transparency, decentralization, and security benefits of blockchains while also harnessing the power of formal governance to avoid costly approaches to validation and dispute resolution. 

Git is the worldwide standard for version control and management of collaborative software development projects. Gov4git is designed as a secure and cost-effective framework solution which can be tailored to the specific needs of any one community and deployed by non-technical users anywhere where access to git is present. Gov4git can strengthen the security of such communities against the risks of malicious actors posing as collaborators with the intent to negatively impact community maintenance.

The post Research Focus: Week of April 24, 2023 appeared first on Microsoft Research.

Read More

TLA+ Foundation aims to bring math-based software modeling to the mainstream

TLA+ Foundation aims to bring math-based software modeling to the mainstream

Leslie Lamport headshot in front of blurred code

TLA+ is a high level, open-source, math-based language for modeling computer programs and systems–especially concurrent and distributed ones. It comes with tools to help eliminate fundamental design errors, which are hard to find and expensive to fix once they have been embedded in code or hardware. 

The TLA language was first published in 1993 by the pioneering computer scientist Leslie Lamport, now a distinguished scientist with Microsoft Research. After years of Lamport’s stewardship and Microsoft’s support, TLA+ has found a new home. The TLA+ Foundation is launching this month as part of the Linux Foundation, with Microsoft, Amazon Web Services (AWS), and Oracle serving as founding members to help further refine the tools and spur commercial usage and additional research. 

“The foundation will help spread that work among more hands,” said Lamport. 

TLA+ is just one piece of Lamport’s impressive portfolio. He invented the document preparation system LaTeX and won the 2013 Turing Award for his work to clarify distributed systems, in which several autonomous computers communicate with each other by passing messages. 

Along the way he developed an idea to help programmers build systems more effectively by using algorithmic models to specify how the code should work. It’s the same idea as creating blueprints to guide the construction of a bridge. TLA+ (for Temporal Logic of Actions) comes with a model checker that will check whether satisfying a program’s specification implies that the code will do what it should.

“When programmers write systems, they should start by defining what they are supposed to do and check that their work will do it. That’s a better way than just sitting down to write the code, based on some vague outline,” Lamport said. 

For simple tasks, a trial-and-error approach may be fine. But for more complicated projects, or those where mistakes are unacceptable, a systematic approach makes more sense.

The challenge with writing large programs isn’t necessarily their size, it’s their complexity. They are often distributed across multiple systems and involve multiple processes that need to interact. The number of possible executions becomes astronomical. To reason about and check such a system, it helps to have a mathematical way to think about it ahead of time. Yet engineers often balk at the idea. 

“The difficulty that engineers have is more a fear of math than the math itself. The math, as math goes, is very basic,” Lamport said, though it’s worth noting he holds a PhD in mathematics. “I find that engineers, after using TLA+, understand the benefit.”

Leslie Lamport giving a talk on stage

In fact, TLA+ has been adopted for industrial use at semiconductor makers, companies that build distributed and database systems, other tech companies, and in more mainstream applications like payment systems in retail stores. It’s likely that some applications aren’t made public—most companies don’t publicly discuss their engineering process or proprietary technology.

That’s where the foundation comes in. A formal system for contributing to the tools and defining their future direction may spawn additional collaboration among engineers and facilitate commercial adoption. The foundation will create a steering committee, similar to other panels that look after public domain programming languages like C or Java

“I would hope that the new stewards make more subtractions than additions to the language, to remove some things that aren’t needed,” Lamport said. 

Now 82 years old and nearing retirement, Lamport also hopes the foundation gets TLA+ closer to the mainstream of industrial and academic discussion.

“TLA+ is never going to be as popular as Java. And I’d be happy if someone else made it better at helping engineers think more mathematically,” Lamport says. “The ultimate goal is to get engineers to think rigorously at a higher level about what they are doing.”

The post TLA+ Foundation aims to bring math-based software modeling to the mainstream appeared first on Microsoft Research.

Read More

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Rank Game diagram

For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren’t always so easy.

In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate how good their behavior is compared to the desired outcome, or state. For the described movements, for example, we can specify a reward function that is +1 when the door is successfully opened or the pen is at the desired orientation and 0 otherwise. But this makes the learning task complicated for the robot since it has to try out various motions before stumbling on the successful outcome, or a reward of +1.

The imitation learning (IL) paradigm was introduced to mitigate the amount of trial and error. In IL, the robot is provided with demonstrations of a given task performed by an expert from which it can try to learn the task and possibly gain information about the expert’s reward function, or the expert’s intent, similar to how people pick up various skills. Yet, learning remains difficult in instances where we only have access to the change enacted by the expert in the world, known as the expert observation, and not the precise actions the expert took to achieve the change. Another difficulty the robot faces is that even if it sees infinite expert demonstrations, it can’t fully reason about the intent of the expert—that is, compare whether one of its own learned behaviors is closer to the expert’s than another behavior—as it only knows the best behavior and has no notion of ordering over other behaviors.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

In our paper “A Ranking Game for Imitation Learning,” being presented at Transactions on Machine Learning Research 2023 (TMLR), we propose a simple and intuitive framework, (texttt{rank-game}), that unifies learning from expert demonstrations and preferences by generalizing a key approach to imitation learning. Giving robots the ability to learn from preferences, obtained by having an expert rank which behavior aligns better with their objectives, allows the learning of more informative reward functions. Our approach, which enabled us to propose a new objective for training over behavior preferences, makes the learning process easier for a robot and achieves state-of-the-art results in imitation learning. It also enabled the training of a robot that can solve the tasks of opening a door and moving a pen between its fingers in simulation, a first in imitation learning with expert observations alone. The incorporation of preferences has also seen success in language modeling, where chatbots such as ChatGPT are improving themselves by learning a reward function inferred via preferences over several samples of model responses in addition to learning from desired human conversational data.

Robotics has found a place in controlled environments where the tasks at hand are well-defined and repeatable, such as on a factory floor. Our framework has the potential to help enable robot learning of tasks in more dynamic environments, such as helping people with daily chores around the home.

With (texttt{rank-game}), which combines learning from preferences and demonstrations via a two-player ranking-based game, robots in simulation were trained to manipulate a pen with a dexterous hand (left) and open a door with a parallel jaw gripper (right). The successful completion of these tasks marked a first in imitation learning with expert observations alone.

A ranking game for imitation learning

Inverse reinforcement learning (IRL) is a popular and effective method for imitation learning. IRL learns by inferring the reward function, also referred to as the intent of the expert, and a policy, which specifies what actions the agent—or, in our case, the robot—should take in a given state to successfully mimic the expert.

Notation: We use (pi) and (pi^E) to denote the policy of the agent and the expert, respectively, and (R_{gt}) to be the reward function of the expert, which is unknown to the agent/robot. (rho^pi) denotes the state-action/state visitation distribution of policy (pi) in the environment—the probabilistic collection of states the policy visits in the environment. We use (J(R;pi)) to denote the (textit{cumulative reward}), or the performance of policy (pi) under a reward function (R). We assume policy (pi) belongs to function class (Pi) and reward function R belongs to function class (mathcal{R}).

The goal of imitation learning is to make the agent have the same performance as the expert under the expert’s unknown reward function (R_{gt}). The classical IRL formulation tackles this by minimizing the imitation gap under a reward function that makes the performance gap the largest. We denote this framework by (texttt{imit-game}) and write it below formally:

(texttt{imit-game}(pi,pi^E): text{argmin}_{piinPi}text{max}_{Rinmathcal{R}} [mathbb{E}_{rho^E(s,a)}[R(s,a)]-mathbb{E}_{rho^pi(s,a)}[R(s,a)]])

Simply stated, the (texttt{imit-game}) tries to find a policy that has the lowest worst-case performance difference with the expert policy. This classical IRL formulation learns from expert demonstrations but provides no mechanism to incorporate learning from preferences. In our work, we ask, does IRL really need to consider the worst-case performance difference? We find that relaxing this requirement allows us to incorporate preferences.

Our proposed method treats imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to map more preferred behaviors to a higher total reward for each of the pairwise preferences, while the policy agent learns to maximize the performance on this reward function by interacting with the environment. Contrary to the classical IRL framework, the reward function now has to get only the rankings correct and not optimize for the worst case (see Figure 1).

A flow chart with, clockwise from top left, a green box labeled “policy agent,” a blue box labeled “reward agent,” and an orange box label “Dataset D,” which contains pairwise behavior rankings obtained from three sources. An arrow points from the policy agent to the dataset, indicating the policy’s contribution of rankings. An arrow pointing from the policy agent to the reward is labeled with the optimization strategy. An arrow pointing from the reward agent to the dataset is labeled with the ranking loss function.
Figure 1: The proposed (texttt{rank-game}) method treats imitation learning as a two-player ranking-based game between a policy and a reward. The policy agent maximizes the reward function by interacting with the environment. The reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent, automatically generated via data augmentation, or expert-annotated rankings obtained from a human or offline dataset.

To incorporate preferences, we need to quantify the behaviors in order to compare them. In this work, we choose the behaviors ((rho)) to be the state-action or state-only visitation distribution of the agent. A ranking between behaviors is used to specify that the expert would prefer one behavior over the other. A reward function that satisfies the behavior rankings ensures that the average return under a lower-ranked behavior is smaller than the higher-ranked behavior. More formally, the ranking game is defined as a game where the policy agent (pi) maximizes the expected return (J(R;pi)) of the policy under reward function (R) when deployed in the environment. The reward player takes the dataset of pairwise rankings (D^p) (rankings are denoted as (rho^ipreceqrho^j)) as an input and attempts to learn a reward function that satisfies those rankings using a ranking loss (denoted by (L(D^p;R))).

(underbrace{text{argmax}_{piinPi}J(R;pi)}_{text{Policy Agent}}~~~~~~~~~~~~~~~underbrace{text{argmin}_{Rinmathcal{R}}L(D^p;R)}_{text{Reward Agent}})

The ranking loss induces a reward function (R) that attempts to satisfy each pairwise preference in the dataset as follows:

(mathbb{E}_{rho^i}[R(s,a)]lemathbb{E}_{rho^j}[R(s,a)]~~,~~forall rho^ipreceqrho^j in D^p)

Generalizing prior imitation learning approaches with (texttt{rank-game})

The (texttt{rank-game}) framework neatly encapsulates prior work in IRL and prior work in learning from preferences, respectively. First, let’s see how classical IRL is a part of this framework. Recall that the classical IRL/(texttt{imit-game}) optimization can be written as:

(text{argmin}_{piinPi}text{max}_{Rinmathcal{R}} [mathbb{E}_{rho^E(s,a)}[R(s,a)]-mathbb{E}_{rho^pi(s,a)}[R(s,a)]])

The inner optimization learns a reward function that ensures that the return gap under the reward function is maximized between the current policy’s behavior and the expert behavior. Thus, (texttt{imit-game}) can be seen to be a special case of (texttt{rank-game}) with: (1) a ranking dataset that prefers expert behavior more than the current agent behavior and (2) a form of ranking loss that maximizes the performance gap (termed as (textit{supremum loss})). A number of prior methods in the imitation learning domain can be understood as special cases of (texttt{rank-game}) under various ranking losses, classes of reward functions, and abilities to incorporate preferences (see Figure 2).

A table with a summary of imitation learning (IL) methods demonstrating the data modalities they can handle (expert data and/or preferences), their ranking-loss functions, the assumptions they make on reward function, and whether they require availability of an external agent to provide preferences during training.  

  

The IL methods MaxEntIRL, AdRIL, GAN-GCL, GAIL, f-MAX, and AIRL don’t use offline preferences or active human query, enable Learning from Demonstration (LfD) when incorporating expert data, and use the supremum ranking loss function and a non-linear reward function. 

  

BCO, GAIfO, DACfO, OPOLO, and f-IRL don’t use offline preferences or active human query, enable Learning from Observation (LfO), and use the supremum ranking loss function and a non-linear reward function. 

  

TREX and DREX use offline preferences, the Bradley-Terry ranking loss function and a non-linear reward function; they don’t use active human query or enable LfO or LfD. 

  

BREX uses offline preferences, the Bradley-Terry ranking loss function, and a linear reward function; it doesn’t use active human query or enable LfO or LfD. 

  

DemPref uses offline preferences, the Bradley-Terry ranking loss function, a linear reward function, and active human query; it enables LfO and LfD. 

  

Ibarz et al. (2018) uses offline preferences, the Bradley-Terry ranking loss function, a non-linear reward function, and active human query; it enables LfD. 

  

Rank-game uses offline preferences, a new principled ranking loss that can naturally incorporate rankings provided by diverse sources, and a non-linear reward function; it enables LfO and LfD and doesn’t use active human query.
Figure 2: Previous methods that learn from expert demonstrations or preferences form a special case of (texttt{rank-game}) under a specific choice of ranking loss and a reward function class. Also noted in the table is whether a method enables learning from demonstration (LfD)—that is, learning from both expert states and actions—or learning from observations (LfO), where an agent learns from expert states alone.

Setting up the ranking game

To develop a framework that successfully combines learning from demonstrations and learning from preferences, we addressed several questions:

  1. What is the ranking loss function that allows for the reward to satisfy the preferences in the dataset?
  2. Where do we get the dataset of pairwise preferences?
  3. How can we effectively optimize this two-player game?

Step 1: A new ranking loss function for reward learning

Our proposed framework requires learning a reward function such that the rankings in the dataset are satisfied. While several loss functions exist in prior literature to enable this, such as Luce Shepard, Lovász-Bregman divergences, and the earlier discussed supremum loss, we introduce a new loss function:

(L_k(mathcal{D}^p;R) = mathbb{E}_{(rho^{pi^i},rho^{pi^j})sim mathcal{D}^p} Big[mathbb{E}_{s,asimrho^{pi^i}}{[(R(s,a)-0)^2]} + mathbb{E}_{s,asimrho^{pi^j}}{[(R(s,a)-k)^2]}Big])

The loss function is simple and intuitive: For all the preference pairs in the dataset, the less preferred behavior is regressed to a return of 0 and more preferred behavior is regressed to a return of user-defined parameter (k). This loss function allows us to learn a reward function with user-defined scale (k), which plays an important role in enabling better policy optimization; it’s principled and facilitates near-optimal imitation learning; and by design, it allows us to incorporate preferences.

Step 2: Getting the ranking dataset

Besides giving more information about the expert’s intent and being easy to obtain, another benefit of preferences is that they can also help learn a more informative, or shaped, reward function. This form of reward shaping can provide better guidance for policy optimization, reducing the burden of exploring the environment to find the optimal policy and increasing sample efficiency for IRL. Our initial ranking dataset is generated by the policy agent from its interactions with the environment; we always prefer expert’s behavior to be better or equal to current policy’s behavior in the rankings. To further leverage the benefits of preferences, we consider two methods for augmenting this ranking dataset:

  • Expert-annotated rankings: In situations where we have access to additional rankings, provided by humans or obtained from reward-annotated datasets, we can simply add them to our ranking dataset.
  • Automatically generated rankings: It turns out we can improve learning efficiency for imitation by using the rankings already present in the dataset of pairwise preferences to generate more preferences in a procedure similar to Mixup regularization in trajectory space.

Step 3: Improving optimization stability with Stackelberg game

Prior work has found the Stackelberg game framework to be a strong candidate for optimizing two-player games in various applications. A Stackelberg game is a bi-level optimization problem:

(text{max}_x (f(x,y_x)),~~~~text{s.t}~~y_xin text{min}_x(g(x,y)))

In this optimization, we have two players—Leader (x) and Follower (y)—that are trying to maximize and minimize their own payoff (f) and (g), respectively. We cast (texttt{rank-game}) as a Stackelberg game and propose two algorithms depending on which player is set to be the leader:

  • Policy as Leader (PAL): (text{max}_pi J(R,pi)~~~~~text{s.t}~~ R=text{argmin}_R~L(D^p;R))
  • Reward as Leader (RAL): (text{min}_R L(D^p;R)~~~text{s.t}~~pi = text{argmax}_pi~J(R;pi))

Aside from improving training stability, both methods have complementary benefits in the non-stationary imitation learning setting. PAL can adjust more quickly when the intent of the expert changes, while RAL can handle environmental changes better.

How well does (texttt{rank-game}) perform in practice?

In testing the capabilities of (texttt{rank-game}), one of the scenarios we consider is the learning from observations alone (LfO) setting, in which only expert observations are provided with no expert actions. This more challenging setting better reflects the learning conditions robots will operate under if we want them to be more widely deployed in both controlled and dynamic environments. People can more naturally provide demonstrations by performing tasks themselves (observations only) versus performing the task indirectly by operating a robot (observations and precise actions). We investigate the LfO performance of (texttt{rank-game}) on simulated locomotion tasks like hopping, walking, and running and benchmark it with respect to representative baselines. (texttt{Rank-game}) approaches require fewer environment interactions to succeed and outperform recent methods in final performance and training stability.

Additionally, our experiments reveal that none of the prior LfO methods can solve complex manipulation tasks such as door opening with a parallel jaw gripper and pen manipulation with a dexterous hand. This failure is potentially a result of the exploration requirements of LfO, which are high because of the unavailability of expert actions coupled with the fact that in these tasks observing successes is rare.

In this setting, we show that using only a handful of expert-annotated preferences in the (texttt{rank-game}) framework can allow us to solve these tasks. We cannot solve these tasks using only expert data—adding preferences is key.

Next steps

Equipping agents to learn from different sources of information present in the world is a promising direction toward more capable agents that can better assist people in the dynamic environments in which they live and work. The (texttt{rank-game}) framework has the potential to be extended directly to the setting where humans present their preferences interactively as the robot is learning. There are some promising future directions and open questions for researchers interested in this work. First, preferences obtained in the real world are usually noisy, and one limitation of (texttt{rank-game}) is that it does not suggest a way to handle noisy preferences. Second, (texttt{rank-game}) proposes modifications to learn a reward function amenable to policy optimization, but these hyperparameters are set manually. Future work can explore methods to automate such learning of reward functions. Third, despite learning effective policies, we observed that (texttt{rank-game}) did not learn reusable robust reward functions.

For additional details, including experiments in the learning from demonstration (LfD) setting, non-stationary imitation setting, and further framework analysis, check out the paper, project page, code, and video presentation.

Acknowledgments

This research was supported in part by the National Science Foundation, Air Force Office of Scientific Research, and Army Research Office.

The post Unifying learning from preferences and demonstration via a ranking game for imitation learning appeared first on Microsoft Research.

Read More

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Rank Game diagram

For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren’t always so easy.

In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate how good their behavior is compared to the desired outcome, or state. For the described movements, for example, we can specify a reward function that is +1 when the door is successfully opened or the pen is at the desired orientation and 0 otherwise. But this makes the learning task complicated for the robot since it has to try out various motions before stumbling on the successful outcome, or a reward of +1.

The imitation learning (IL) paradigm was introduced to mitigate the amount of trial and error. In IL, the robot is provided with demonstrations of a given task performed by an expert from which it can try to learn the task and possibly gain information about the expert’s reward function, or the expert’s intent, similar to how people pick up various skills. Yet, learning remains difficult in instances where we only have access to the change enacted by the expert in the world, known as the expert observation, and not the precise actions the expert took to achieve the change. Another difficulty the robot faces is that even if it sees infinite expert demonstrations, it can’t fully reason about the intent of the expert—that is, compare whether one of its own learned behaviors is closer to the expert’s than another behavior—as it only knows the best behavior and has no notion of ordering over other behaviors.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

In our paper “A Ranking Game for Imitation Learning,” being presented at Transactions on Machine Learning Research 2023 (TMLR), we propose a simple and intuitive framework, (texttt{rank-game}), that unifies learning from expert demonstrations and preferences by generalizing a key approach to imitation learning. Giving robots the ability to learn from preferences, obtained by having an expert rank which behavior aligns better with their objectives, allows the learning of more informative reward functions. Our approach, which enabled us to propose a new objective for training over behavior preferences, makes the learning process easier for a robot and achieves state-of-the-art results in imitation learning. It also enabled the training of a robot that can solve the tasks of opening a door and moving a pen between its fingers in simulation, a first in imitation learning with expert observations alone. The incorporation of preferences has also seen success in language modeling, where chatbots such as ChatGPT are improving themselves by learning a reward function inferred via preferences over several samples of model responses in addition to learning from desired human conversational data.

Robotics has found a place in controlled environments where the tasks at hand are well-defined and repeatable, such as on a factory floor. Our framework has the potential to help enable robot learning of tasks in more dynamic environments, such as helping people with daily chores around the home.

With (texttt{rank-game}), which combines learning from preferences and demonstrations via a two-player ranking-based game, robots in simulation were trained to manipulate a pen with a dexterous hand (left) and open a door with a parallel jaw gripper (right). The successful completion of these tasks marked a first in imitation learning with expert observations alone.

A ranking game for imitation learning

Inverse reinforcement learning (IRL) is a popular and effective method for imitation learning. IRL learns by inferring the reward function, also referred to as the intent of the expert, and a policy, which specifies what actions the agent—or, in our case, the robot—should take in a given state to successfully mimic the expert.

Notation: We use (pi) and (pi^E) to denote the policy of the agent and the expert, respectively, and (R_{gt}) to be the reward function of the expert, which is unknown to the agent/robot. (rho^pi) denotes the state-action/state visitation distribution of policy (pi) in the environment—the probabilistic collection of states the policy visits in the environment. We use (J(R;pi)) to denote the (textit{cumulative reward}), or the performance of policy (pi) under a reward function (R). We assume policy (pi) belongs to function class (Pi) and reward function R belongs to function class (mathcal{R}).

The goal of imitation learning is to make the agent have the same performance as the expert under the expert’s unknown reward function (R_{gt}). The classical IRL formulation tackles this by minimizing the imitation gap under a reward function that makes the performance gap the largest. We denote this framework by (texttt{imit-game}) and write it below formally:

(texttt{imit-game}(pi,pi^E): text{argmin}_{piinPi}text{max}_{Rinmathcal{R}} [mathbb{E}_{rho^E(s,a)}[R(s,a)]-mathbb{E}_{rho^pi(s,a)}[R(s,a)]])

Simply stated, the (texttt{imit-game}) tries to find a policy that has the lowest worst-case performance difference with the expert policy. This classical IRL formulation learns from expert demonstrations but provides no mechanism to incorporate learning from preferences. In our work, we ask, does IRL really need to consider the worst-case performance difference? We find that relaxing this requirement allows us to incorporate preferences.

Our proposed method treats imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to map more preferred behaviors to a higher total reward for each of the pairwise preferences, while the policy agent learns to maximize the performance on this reward function by interacting with the environment. Contrary to the classical IRL framework, the reward function now has to get only the rankings correct and not optimize for the worst case (see Figure 1).

A flow chart with, clockwise from top left, a green box labeled “policy agent,” a blue box labeled “reward agent,” and an orange box label “Dataset D,” which contains pairwise behavior rankings obtained from three sources. An arrow points from the policy agent to the dataset, indicating the policy’s contribution of rankings. An arrow pointing from the policy agent to the reward is labeled with the optimization strategy. An arrow pointing from the reward agent to the dataset is labeled with the ranking loss function.
Figure 1: The proposed (texttt{rank-game}) method treats imitation learning as a two-player ranking-based game between a policy and a reward. The policy agent maximizes the reward function by interacting with the environment. The reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent, automatically generated via data augmentation, or expert-annotated rankings obtained from a human or offline dataset.

To incorporate preferences, we need to quantify the behaviors in order to compare them. In this work, we choose the behaviors ((rho)) to be the state-action or state-only visitation distribution of the agent. A ranking between behaviors is used to specify that the expert would prefer one behavior over the other. A reward function that satisfies the behavior rankings ensures that the average return under a lower-ranked behavior is smaller than the higher-ranked behavior. More formally, the ranking game is defined as a game where the policy agent (pi) maximizes the expected return (J(R;pi)) of the policy under reward function (R) when deployed in the environment. The reward player takes the dataset of pairwise rankings (D^p) (rankings are denoted as (rho^ipreceqrho^j)) as an input and attempts to learn a reward function that satisfies those rankings using a ranking loss (denoted by (L(D^p;R))).

(underbrace{text{argmax}_{piinPi}J(R;pi)}_{text{Policy Agent}}~~~~~~~~~~~~~~~underbrace{text{argmin}_{Rinmathcal{R}}L(D^p;R)}_{text{Reward Agent}})

The ranking loss induces a reward function (R) that attempts to satisfy each pairwise preference in the dataset as follows:

(mathbb{E}_{rho^i}[R(s,a)]lemathbb{E}_{rho^j}[R(s,a)]~~,~~forall rho^ipreceqrho^j in D^p)

Generalizing prior imitation learning approaches with (texttt{rank-game})

The (texttt{rank-game}) framework neatly encapsulates prior work in IRL and prior work in learning from preferences, respectively. First, let’s see how classical IRL is a part of this framework. Recall that the classical IRL/(texttt{imit-game}) optimization can be written as:

(text{argmin}_{piinPi}text{max}_{Rinmathcal{R}} [mathbb{E}_{rho^E(s,a)}[R(s,a)]-mathbb{E}_{rho^pi(s,a)}[R(s,a)]])

The inner optimization learns a reward function that ensures that the return gap under the reward function is maximized between the current policy’s behavior and the expert behavior. Thus, (texttt{imit-game}) can be seen to be a special case of (texttt{rank-game}) with: (1) a ranking dataset that prefers expert behavior more than the current agent behavior and (2) a form of ranking loss that maximizes the performance gap (termed as (textit{supremum loss})). A number of prior methods in the imitation learning domain can be understood as special cases of (texttt{rank-game}) under various ranking losses, classes of reward functions, and abilities to incorporate preferences (see Figure 2).

A table with a summary of imitation learning (IL) methods demonstrating the data modalities they can handle (expert data and/or preferences), their ranking-loss functions, the assumptions they make on reward function, and whether they require availability of an external agent to provide preferences during training.  

  

The IL methods MaxEntIRL, AdRIL, GAN-GCL, GAIL, f-MAX, and AIRL don’t use offline preferences or active human query, enable Learning from Demonstration (LfD) when incorporating expert data, and use the supremum ranking loss function and a non-linear reward function. 

  

BCO, GAIfO, DACfO, OPOLO, and f-IRL don’t use offline preferences or active human query, enable Learning from Observation (LfO), and use the supremum ranking loss function and a non-linear reward function. 

  

TREX and DREX use offline preferences, the Bradley-Terry ranking loss function and a non-linear reward function; they don’t use active human query or enable LfO or LfD. 

  

BREX uses offline preferences, the Bradley-Terry ranking loss function, and a linear reward function; it doesn’t use active human query or enable LfO or LfD. 

  

DemPref uses offline preferences, the Bradley-Terry ranking loss function, a linear reward function, and active human query; it enables LfO and LfD. 

  

Ibarz et al. (2018) uses offline preferences, the Bradley-Terry ranking loss function, a non-linear reward function, and active human query; it enables LfD. 

  

Rank-game uses offline preferences, a new principled ranking loss that can naturally incorporate rankings provided by diverse sources, and a non-linear reward function; it enables LfO and LfD and doesn’t use active human query.
Figure 2: Previous methods that learn from expert demonstrations or preferences form a special case of (texttt{rank-game}) under a specific choice of ranking loss and a reward function class. Also noted in the table is whether a method enables learning from demonstration (LfD)—that is, learning from both expert states and actions—or learning from observations (LfO), where an agent learns from expert states alone.

Setting up the ranking game

To develop a framework that successfully combines learning from demonstrations and learning from preferences, we addressed several questions:

  1. What is the ranking loss function that allows for the reward to satisfy the preferences in the dataset?
  2. Where do we get the dataset of pairwise preferences?
  3. How can we effectively optimize this two-player game?

Step 1: A new ranking loss function for reward learning

Our proposed framework requires learning a reward function such that the rankings in the dataset are satisfied. While several loss functions exist in prior literature to enable this, such as Luce Shepard, Lovász-Bregman divergences, and the earlier discussed supremum loss, we introduce a new loss function:

(L_k(mathcal{D}^p;R) = mathbb{E}_{(rho^{pi^i},rho^{pi^j})sim mathcal{D}^p} Big[mathbb{E}_{s,asimrho^{pi^i}}{[(R(s,a)-0)^2]} + mathbb{E}_{s,asimrho^{pi^j}}{[(R(s,a)-k)^2]}Big])

The loss function is simple and intuitive: For all the preference pairs in the dataset, the less preferred behavior is regressed to a return of 0 and more preferred behavior is regressed to a return of user-defined parameter (k). This loss function allows us to learn a reward function with user-defined scale (k), which plays an important role in enabling better policy optimization; it’s principled and facilitates near-optimal imitation learning; and by design, it allows us to incorporate preferences.

Step 2: Getting the ranking dataset

Besides giving more information about the expert’s intent and being easy to obtain, another benefit of preferences is that they can also help learn a more informative, or shaped, reward function. This form of reward shaping can provide better guidance for policy optimization, reducing the burden of exploring the environment to find the optimal policy and increasing sample efficiency for IRL. Our initial ranking dataset is generated by the policy agent from its interactions with the environment; we always prefer expert’s behavior to be better or equal to current policy’s behavior in the rankings. To further leverage the benefits of preferences, we consider two methods for augmenting this ranking dataset:

  • Expert-annotated rankings: In situations where we have access to additional rankings, provided by humans or obtained from reward-annotated datasets, we can simply add them to our ranking dataset.
  • Automatically generated rankings: It turns out we can improve learning efficiency for imitation by using the rankings already present in the dataset of pairwise preferences to generate more preferences in a procedure similar to Mixup regularization in trajectory space.

Step 3: Improving optimization stability with Stackelberg game

Prior work has found the Stackelberg game framework to be a strong candidate for optimizing two-player games in various applications. A Stackelberg game is a bi-level optimization problem:

(text{max}_x (f(x,y_x)),~~~~text{s.t}~~y_xin text{min}_x(g(x,y)))

In this optimization, we have two players—Leader (x) and Follower (y)—that are trying to maximize and minimize their own payoff (f) and (g), respectively. We cast (texttt{rank-game}) as a Stackelberg game and propose two algorithms depending on which player is set to be the leader:

  • Policy as Leader (PAL): (text{max}_pi J(R,pi)~~~~~text{s.t}~~ R=text{argmin}_R~L(D^p;R))
  • Reward as Leader (RAL): (text{min}_R L(D^p;R)~~~text{s.t}~~pi = text{argmax}_pi~J(R;pi))

Aside from improving training stability, both methods have complementary benefits in the non-stationary imitation learning setting. PAL can adjust more quickly when the intent of the expert changes, while RAL can handle environmental changes better.

How well does (texttt{rank-game}) perform in practice?

In testing the capabilities of (texttt{rank-game}), one of the scenarios we consider is the learning from observations alone (LfO) setting, in which only expert observations are provided with no expert actions. This more challenging setting better reflects the learning conditions robots will operate under if we want them to be more widely deployed in both controlled and dynamic environments. People can more naturally provide demonstrations by performing tasks themselves (observations only) versus performing the task indirectly by operating a robot (observations and precise actions). We investigate the LfO performance of (texttt{rank-game}) on simulated locomotion tasks like hopping, walking, and running and benchmark it with respect to representative baselines. (texttt{Rank-game}) approaches require fewer environment interactions to succeed and outperform recent methods in final performance and training stability.

Additionally, our experiments reveal that none of the prior LfO methods can solve complex manipulation tasks such as door opening with a parallel jaw gripper and pen manipulation with a dexterous hand. This failure is potentially a result of the exploration requirements of LfO, which are high because of the unavailability of expert actions coupled with the fact that in these tasks observing successes is rare.

In this setting, we show that using only a handful of expert-annotated preferences in the (texttt{rank-game}) framework can allow us to solve these tasks. We cannot solve these tasks using only expert data—adding preferences is key.

Next steps

Equipping agents to learn from different sources of information present in the world is a promising direction toward more capable agents that can better assist people in the dynamic environments in which they live and work. The (texttt{rank-game}) framework has the potential to be extended directly to the setting where humans present their preferences interactively as the robot is learning. There are some promising future directions and open questions for researchers interested in this work. First, preferences obtained in the real world are usually noisy, and one limitation of (texttt{rank-game}) is that it does not suggest a way to handle noisy preferences. Second, (texttt{rank-game}) proposes modifications to learn a reward function amenable to policy optimization, but these hyperparameters are set manually. Future work can explore methods to automate such learning of reward functions. Third, despite learning effective policies, we observed that (texttt{rank-game}) did not learn reusable robust reward functions.

For additional details, including experiments in the learning from demonstration (LfD) setting, non-stationary imitation setting, and further framework analysis, check out the paper, project page, code, and video presentation.

Acknowledgments

This research was supported in part by the National Science Foundation, Air Force Office of Scientific Research, and Army Research Office.

The post Unifying learning from preferences and demonstration via a ranking game for imitation learning appeared first on Microsoft Research.

Read More