Microsoft AI – Page 32

ECCV 2022 highlights: Advancing the foundations of mixed reality

October 24, 2022

by Alyssa Hughes Microsoft AI

This image contains two panels. The panel on the left is titled “3D face reconstruction” and depicts two faces of real people and corresponding face models developed using the dense landmarks method with 703 facial landmarks. The panel on the right is titled “Benchmarking localization and mapping for AR” and shows the interior of a building with paths—or sequences—where people had captured the environment using Microsoft HoloLens.

By Microsoft Mixed Reality & AI Labs in Cambridge and Zurich

Computer vision is one of the most remarkable developments to emerge from the field of computer science. It’s among the most rapidly growing areas in the technology landscape and has the potential to significantly impact the way people live and work. Advances at the intersection of machine learning (ML) and computer vision have been accelerating in recent years, leading to significant progress in numerous fields, including healthcare, robotics, the automotive industry, and augmented reality (AR). Microsoft is proud to be a prominent contributor to computer vision research.

Microsoft researchers have long been collaborating with academics and experts in the field on numerous computer vision projects with the goal of expanding what’s possible and helping people achieve more. One example is PeopleLens, a head-worn device that helps children who are blind or have low vision more easily interact in social situations by identifying people around them through spatialized audio. Another example is Swin Transformer. This computer vision architecture attains high accuracy in object detection and provides an opportunity to unify computer vision and natural language processing (NLP) architectures—increasing the capacity and adaptability of computer vision models.

Microsoft Research is excited to share some of its newest work in this space at the European Conference on Computer Vision (ECCV) 2022, with 45 accepted papers that will be presented through live presentations, tutorials, and poster sessions. This post highlights two of these papers, which showcase the latest research from Microsoft and its collaborators. One involves increasing the number of facial landmarks for more accurate 3D face reconstruction, achieving state-of-the-art results while decreasing the required compute power. The other introduces a dataset that takes advantage of the capabilities of AR devices for visual localization and mapping driven by real-world AR scenarios.

3D face reconstruction with dense landmarks

Facial landmarks are points that correspond across all faces, and they often play a key role in face analysis. Researchers frequently rely on them when performing basic computer vision tasks, such as estimating head position and identifying gaze direction and more generally the position in space of all the details of the face. Facial landmarks include such areas as the tip of the nose, corners of the eyes, and points along the jawline. Typically, public datasets that practitioners use to train ML models contain annotations for 68 facial landmarks. However, numerous aspects of human faces are not precisely represented by 68 landmarks alone, and additional methods are often needed to supplement landmark detection, adding complexity to the training workflow and increasing the required compute power.

Image depicting two head models. The one on the left has the 68 commonly used facial landmarks identified along the jawline, eyebrows, eyes, nose, and mouth. The one on the right has 703 facial landmarks applied, covering the entire head in great detail. — Figure 1: Compared with a typical sparse set of 68 facial landmarks (a), dense landmarks (b) cover the entire head in detail, including ears, eyes, and teeth. These dense landmarks are better at encoding facial identity and subtle expressions.

GitHub

Dense Landmarks

With the goal of achieving accurate 3D face reconstruction, we propose increasing the number of facial landmarks. In our paper “3D Face Reconstruction with Dense Landmarks,” we introduce a method to accurately predict 703 facial landmarks, more than 10 times as many as are commonly used, covering the entire face in great detail, including the eyes, ears, and teeth, as shown in Figure 1. We show that the increased number of landmarks are very precise when visible, and when they are occluded, for example, when someone lifts a coffee mug to their lips, we can estimate the location of these landmarks and what the part of the face looks like behind the object blocking it. We can use these landmarks to constrain a model-fitting problem to efficiently and precisely estimate all aspects of a face model, shown in the right-most column in Figure 2. This includes the head pose, eye gaze, as well as the identity of the person whose face is being reconstructed, for example, the thickness of the lips and the shape of the nose.

This simple pipeline is comprised only of dense landmarks and continuous mathematical optimization, allowing for extreme compute efficiency and enabling the entire system to run at over 150 frames per second on a single core of a laptop.

Image with three rows and seven columns. Real faces are in the first column. Different baselines are shown in columns two through six and depict models based on the real faces with varying degrees of distinction in terms of facial identity. The column on the right shows face models with our proposed system of 703 landmarks and depicts a great amount of facial identity. — Figure 2: In this image, the original faces are on the left, the baselines are in the second through sixth columns, and our results are in the right-most column. Compared with previous recent monocular 3D face reconstruction methods, ours better captures gaze, expressions like winks and sneers, and the subtleties of facial identity. In addition, our method can run in real time with only a minor loss of fidelity.

Increasing privacy, fairness, and efficiency with synthetic data

In computer vision, and particularly the area of face reconstruction, there are understandable concerns about anonymity when training ML models because training data often comes from real people. Our proposed method significantly reduces privacy concerns, as it uses only synthetic data to train ML models, compared with methods that use images of real people as part of their training datasets. That said, when we built the synthetic data pipeline, we needed to preserve the privacy of the people whose data we used, and we took care to acquire the consent of those several hundred subjects. This contrasts with the feasibility of acquiring consent from thousands (or even tens of thousands) of subjects, which would have been necessary if we were using real data.

It’s especially challenging, if not impossible, to preserve the privacy of people appearing in “found images” online, where the subject is often unknown. Using synthetic data helps us protect the privacy of data subjects and the rights of photographers and content creators. It’s another tool we can use in our mission to build technology in an ethical and responsible manner. Additionally, because people’s private information is not included in our dataset, if the ML model were to be attacked, only synthetic data would be subject to compromise.

Synthetic data also provides an opportunity to address inclusivity and fairness. Largely because the distribution of the data is fully controlled, ML practitioners can manage the fairness of representation by including diverse samples in their datasets, and all the data needed to do this would be perfectly labeled. For further details on how we build the synthetics model and training data and our approach to capturing the diversity of the human population, please see our face analysis paper.

There are other advantages to using synthetic data to train ML models, as well. For example, these models require a lot of data, giving rise to numerous difficulties that practitioners must navigate to obtain this data, such as the logistics of finding the number of people required, scheduling time in a lab, and situating multiple cameras to capture the various angles of a person’s face. These concerns are greatly reduced with synthetic data.

In addition, because data doesn’t need to be sourced from a real person, the iteration speed to improve the quality of the 3D face reconstruction is remarkably high, creating a robust workflow. And it isn’t necessary to apply quality assurance (QA) processes on each labeled image when using synthetic data—another cost- and time-saving benefit. Another advantage is the increase in accuracy, speed, and cost-effectiveness in labeling data. It would be nearly impossible to ask someone to consistently label 703 landmarks in a set of images.

Image showing 30 faces created using synthetic data. Each face varies in terms of age, race, hair, expression, gaze, and other identifying characteristics. The faces are both female and male and are depicted in a different environments. — Figure 3: Examples of the synthetic training data used in this project. Without the perfectly consistent annotations provided by synthetic data, dense landmark prediction would not be possible.

Face analysis is a foundational piece for many ML systems, such as facial recognition and controlling avatars, and using a method that provides both accuracy and efficiency while also addressing privacy and fairness concerns pushes the boundaries of the state of the art. Up until now, there hasn’t been much work, if any, on methods that can yield this level of quality with only synthetic data. The ability to achieve 3D face reconstruction using dense landmarks and synthetic data has the potential to truly transform what’s possible with ML.

Acknowledgments

This research was conducted by Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt, Matthew Johnson, Jingjing Shen, Nikola Milosavljević, Daniel Wilde, Stephan Garbin, Chirag Raman, Jamie Shotton, Toby Sharp, Ivan Stojiljković, Tom Cashman, and Julien Valentin.

LaMAR: Benchmarking localization and mapping for augmented reality

To unlock the full potential of augmented reality (AR), anyone using a mixed reality headset should be able to place virtual content in the physical world, share it with others, and expect it to remain in place over time. However, before they can augment digital content in the real world in the form of holograms, AR devices need to build a digital map of the physical 3D world. These devices then position, or re-localize, themselves with respect to this map, as illustrated in Figure 4, which allows them to retrieve previously placed holograms and show them to the user at a designated location. The computer vision foundations enabling these capabilities are called mapping and visual localization.

Figure 4: The mapping and localization process.

In general, research in visual localization focuses on single images, usually carefully selected views of famous attractions, shown on the left in Figure 5. However, this doesn’t reflect real AR scenarios—the combination of AR devices and applications—and the opportunity they provide. AR devices can locally map the environment and provide spatially registered sequences rather than single images, as shown in the image on the right in Figure 5. These sequences can also include additional data, like inertial or radio signals from sensors, which are typically available on modern AR devices, such as Microsoft HoloLens 2. Yet it’s challenging to use such sequences for localization because they are typically just collected during normal device usage and not generally aimed at facilitating localization.

This graphic contains two images: The image on the left shows six different views of a single point of interest. These are large and expansive views that don’t contain any details of the environment. In contrast, the image on the right shows numerous paths, each representing a different AR sequence, in a single location. These sequences are densely sampled and do not focus on any specific spot. — Figure 5: On the left, single views are spread throughout large spaces and concentrate at points of interest. In contrast, the AR sequences on the right are densely sampled through the environment and do not focus on specific spots. While our dataset (on the right) captures a much smaller spatial area than landmark datasets (on the left), they contain a similar number of images. This illustrates how much more densely typical AR data is sampled within the scene. Each path represents a different sequence.

To close this gap, we introduce a new benchmark, the first to focus on this more realistic setting for AR, with the understanding that visual re-localization is a key element for compelling, shared, and persistent AR experiences. Given the spatial scale of the environment for typical AR scenarios, such as navigating an airport or inspecting a factory, we had to design a pipeline that could automatically compute the ground-truth camera positions of real AR sequences captured by a variety of readily available AR devices, such as the HoloLens or iPhone. By evaluating state-of-the-art methods on our benchmark, we offer novel insights on current research and provide avenues for future work in the field of visual localization and mapping for AR.

Various images showing paths in the different environments that were captured. They illustrate how we revisited localization and mapping in the context of AR by introducing the LaMAR dataset. — Figure 6: We revisited localization and mapping in the context of AR by introducing LaMAR, a large-scale dataset captured using AR devices (iPhone, iPad, HoloLens 2) and laser scanners.

Publication

LaMAR: Benchmarking Localization and Mapping for Augmented Reality

This research is a result of a two-year collaboration between the Microsoft Mixed Reality & AI Lab in Zurich and ETH Zurich (Swiss Federal Institute of Technology) and will be published at ECCV 2022 in the paper, “LaMAR: Benchmarking Localization and Mapping for Augmented Reality.” We will also be giving a tutorial called Localization and Mapping for AR at ECCV.

Developing a large-scale AR dataset

To enable the research community to address the specifics of mapping and visual localization in the context of AR, we collected multi-sensor data streams from modern AR devices. These sensor streams come with camera poses (the camera’s position and orientation) from the on-device tracker at each instant. These data streams also contain images, depth measurements, samples from inertial measurement units (IMUs), and radio signals. Exploiting these can lead to more efficient algorithms. For example, radio signals such as Wi-Fi or Bluetooth can simplify image retrieval. Similarly, sequence localization can exploit the temporal aspect of sensor streams to provide a more spatial context, which can lead to more accurate estimates of camera poses. This typifies the realistic use case of a user launching an AR application and streaming sensorial data to localize the camera with respect to a previously built map, and it reflects how AR applications built on mixed reality cloud services, like Azure Spatial Anchors, work.

Figure 7: Sample sequences from the dataset.

The initial release of the LaMAR dataset contains more than 100 hours of recordings covering 45,000 square meters (484,000 square feet) captured over the course of two years using the head-mounted HoloLens 2 and handheld iPhone/iPad devices. The data was captured at various indoor and outdoor locations (a historical building, a multi-story office building, and part of a city center) and represents typical AR scenarios. It includes changes in illumination and the movement of objects—either slowly, such as the placement of a book on a desk, or more quickly, like anonymized people walking down a sidewalk.

Automatically aligning AR sequences to establish ground truth

To estimate the ground-truth camera poses, we aligned the captured data with reference 3D models of the locations, as shown in Figure 8. These reference models were captured using NavVis M6 and VLX mapping systems, both equipped with laser scanners (lidars) that generate dense, textured, and highly accurate 3D models of the locations. To align the data, we developed a robust pipeline that does not require manual labeling or setting custom infrastructure, such as fiducial markers, and this enabled us to robustly handle crowd-sourced data from a variety of AR devices captured over extended periods.

Four images, each divided into four parts. The top-right and bottom-left parts show the reference 3D model of the environment rendered from an estimated ground-truth pose and overlayed with the image captured using the AR device. The top-left and bottom-right show pixel-level accuracy of the alignment. — Figure 8: The top-right and bottom-left of each square show the reference 3D model of the environment rendered from estimated ground-truth poses and overlayed with the image captured using the AR device. The top-left and bottom-right of each square show pixel-level accuracy of the alignment.

The actual alignment process is fully automatic and utilizes the on-device real-time tracker of AR devices, which provides camera poses in their local coordinate system. We aligned each captured sequence individually with the dense ground truth reference model, as illustrated in Figure 9. Once completed, all camera poses were refined jointly by optimizing the visual constraints within and across sequences.

Diagram showing the sequence-to-scan alignment process in developing the ground-truth reference model. — Figure 9: Sequence-to-scan alignment. We first estimated the absolute pose of each sequence frame using image retrieval and matching. This initial localization prior was used to obtain a single rigid alignment between the input trajectory and the reference 3D model via voting. This allows us to discard outliers, as shown in the bottom part of the figure. The alignment was then relaxed by optimizing the individual frame poses in a pose graph (PGO) based on both relative and absolute pose constraints. We bootstrapped this initialization by mining relevant image pairs and re-localizing the queries. Given these improved absolute priors, we optimized the pose graph again and finally included reprojection errors of the visual correspondences in a bundle adjustment (BA) problem, yielding a refined trajectory.

Evaluating localization and mapping in the context of AR

We evaluated current state-of-the-art approaches in the single-frame setting as localizing i) single images obtained from phones and ii) single images and full camera rigs from HoloLens 2. Then we adapted these state-of-the-art approaches to take advantage of radio signals. Finally, we designed baselines, building on these methods and utilizing the on-device real-time tracker in a multi-frame localization setting corresponding to a real-world AR application. The results show that performance of state-of-the-art methods can be significantly improved by including these additional data streams generally available in modern AR devices, as shown in Figure 10.

Two bar graphs that show localization recall for state-of-the-art methods on both HoloLens 2 and iPhone queries. The results show that performance of state-of-the-art methods can be significantly improved by including additional data streams generally available in modern AR devices. — Figure 10: These bar graphs show localization recall for state-of-the-art methods on both HoloLens 2 and iPhone queries. We considered several tracks: single-image and rig localization with and without radios, and similarly for sequence localization. In addition, we report a theoretical upper bound: the percentage of queries with at least 5 percent ground truth overlap with respect to the best database image.

For a compelling user experience, AR applications should strive to retrieve and visualize content as quickly as possible after starting a session. To quantify this, we introduce a new metric called time-to-recall, which measures the sequence duration needed for successful localization. This encourages researchers to develop algorithms to accurately localize the camera as quickly as possible, as shown in Figure 11.

Figure 11: We show the time-to-recall at 80 percent for the HoloLens 2 and at 70 percent for phone queries. Using radio signals reduces the time-to-recall from 10 seconds to 1.40 seconds and 3.58 seconds, respectively.

Using the LaMAR benchmark

LaMAR is the first benchmark that focuses on a realistic setup for visual localization and mapping using AR devices. The evaluation results show enormous potential for leveraging posed sequences instead of single frames and for leveraging other sensor modalities, like radio signals, to localize the camera and map the environment.

Researchers can access the LaMAR benchmark, evaluation server, implementations of the ground-truth pipeline, as well as baselines with additional sensory data at the LaMAR Benchmark page. We hope this work inspires future research in developing localization and mapping algorithms tailored to real AR scenarios.

Acknowledgments

This research was conducted by Paul-Edouard Sarlin, Mihai Dusmanu, Johannes L. Schönberger, Pablo Speciale, Lukas Gruber, Viktor Larsson, Ondrej Miksik, and Marc Pollefeys.

The post ECCV 2022 highlights: Advancing the foundations of mixed reality appeared first on Microsoft Research.

From Hot Wheels to handling content: How brands are using Microsoft AI to be more productive and imaginative

October 12, 2022

by John Roach Microsoft AI

The post From Hot Wheels to handling content: How brands are using Microsoft AI to be more productive and imaginative appeared first on The AI Blog.

Microsoft open sources its ‘farm of the future’ toolkit

October 6, 2022

by Jake Siegel Microsoft AI

The post Microsoft open sources its ‘farm of the future’ toolkit appeared first on The AI Blog.

Research Focus: Week of September 26, 2022

October 4, 2022

by Alyssa Hughes Microsoft AI

Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Clifford neural layers for PDE modeling

Johannes Brandstetter, Rianne van den Berg, Max Welling, and Jayesh K. Gupta

Partial differential equations (PDEs) are widely used to describe simulation of physical processes as scalar and vector fields interacting and coevolving over time.

Recent research has focused on using neural surrogates to accelerate such simulations. However, current methods do not explicitly model relationships between fields and their correlated internal components, such as the scalar temperature field and the vector wind velocity field in weather simulations.

In this work, we view the time evolution of such correlated fields through the lens of mulitvector fields, which consist of scalar, vector, and higher-order components, such as bivectors. An example of a bivector is the cross-product of two vectors in 3-D, which is a plane segment spanned by these two vectors, and is often represented as a vector itself. But the cross-product has a sign flip under reflection, which a vector does not. It is a bivector.

The algebraic properties of multivectors, such as multiplication and addition, can be described by Clifford algebras, leading to Clifford neural layers such as Clifford convolutions and Clifford Fourier transforms. These layers are universally applicable and will find direct use in the areas of fluid dynamics, weather forecasting, and the modeling of physical systems in general.

Read the paper

InAs-Al Hybrid Devices Passing the Topological Gap Protocol

Dr. Chetan Nayak

While the promise of quantum is great, we are still in the early days of what is possible. Today’s quantum computers enable interesting research, however, innovators find themselves limited by the inadequate scale of these systems and are eager to do more.  Microsoft is taking a more challenging, but ultimately a more promising approach to scaled quantum computing. We are engineering a full-stack quantum machine powered by topological qubits. We theorize that this type of qubit will be inherently more stable than qubits produced with existing methods without sacrificing size or speed. Earlier this year we had a major scientific breakthrough that cleared a significant hurdle – we discovered that we could produce the topological superconducting phase and its concomitant Majorana zero modes. In essence, we now have the building block for a topological qubit and quantum at scale. Learn more about this discovery from Dr. Chetan Nayak, Distinguished Engineer of Quantum at Microsoft, who just released a preprint to arXiv and met with leaders in the quantum computing industry.

Read the paper

Watch the webinar

DaVinci Image and Video Enhancement Toolkit

Huan Yang, Jianlong Fu

To help users enhance low quality videos in real time, on edge equipment and with limited computing power, researchers from Microsoft Research Asia launched a set of intelligent image/video enhancement tools called DaVinci. This toolkit aims to solve the pain points of existing video enhancement and restoration tools, give full play to the advantages of AI technology and lower the threshold for users to process video footage.

The targeted features of the DaVinci toolkit include low-level image and video enhancement tasks, real-time image and video filters, and visual quality enhancement such as super-resolution, denoising, and video frame interpolation. The backend technology of the DaVinci toolkit is supported by the industry-leading large-scale, low-level vision pre-training technology, supplemented by a large amount of data training. To maximize the robustness of the model, researchers use four million publicly available images and video data, with contents covering landscapes, buildings, people, and so on. To ensure an adequate amount of training data and rich data types, researchers synthesized data with various degradations, so that the entire model training could cover more actual user application scenarios. The DaVinci toolkit has been packaged and released on GitHub.

Explore More

Publication

Learning Texture Transformer Network for Image Super-Resolution

Publication

Learning Trajectory-Aware Transformer for Video Super-Resolution

Publication

Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution

GitHub

RetrieverTTS: Modeling decomposed factors for text-based speech insertion

Dacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao, Chong Luo

In the post-pandemic world, a large portion of meetings, conferences, and trainings have been moved online, leading to a sharp increase in demand for audio and video creation. However, making zero-mistake audio/video recordings is time consuming, with re-recordings required to fix even the smallest mistakes. Therefore, a simple technique for making partial corrections to recordings is urgently needed. Researchers from Microsoft Research Asia’s Intelligent Multimedia Group and from Azure Speech Team developed a text-based speech editing system to address this problem. The system supports cutting, copying, pasting operations, and inserting synthesized speech segments based on text transcriptions.

Read the paper

AI4Science expands with new research lab in Berlin

Deep learning is set to transform the natural sciences, dramatically improving our ability to model and predict natural phenomena over widely varying scales of space and time. In fact, this may be the dawn of a new paradigm of scientific discovery. To help make this new paradigm a reality, Microsoft Research recently established AI4Science, a new global organization that will bring together experts in machine learning, quantum physics, computational chemistry, molecular biology, fluid dynamics, software engineering, and other disciplines.

This month, the AI4Science organization expanded with a new presence in Berlin, to be led by Dr. Frank Noé. As a “bridge professor,” Noé specializes in work at the interfaces between the fields of mathematics and computer science as well as physics, biology, chemistry and pharmacy. For his pioneering work in developing innovative computational methods in biophysics, he has received awards and funding from the American Chemical Society and the European Research Council, among others.

In the video below, Noé discusses the new lab in Berlin, and potential for machine learning to advance the natural sciences, with Chris Bishop, Microsoft Technical Fellow and director, AI4Science.

The post Research Focus: Week of September 26, 2022 appeared first on Microsoft Research.

Assessing AI system performance: thinking beyond models to deployment contexts

September 26, 2022

by Brenda Potts Microsoft AI

A graphic overview of the way performance assessment methods change across the development lifecycle. It has four phases: getting started, connecting with users, tuning the user experience, and performance assessment in the deployment context. It visually shows how the balance of user experience and tech development change over these four phases. — Figure 1: Performance assessment methods change across the development lifecycle for complex AI systems in ways that differ from general purpose AI. The emphasis shifts from rapid technical innovation that requires easy-to-calculate aggregate performance metrics at the beginning of the development process to metrics that reflect the performance of critical AI system attributes needed to underpin the user experience at the end.

AI systems are becoming increasingly complex as we move from visionary research to deployable technologies such as self-driving cars, clinical predictive models, and novel accessibility devices. Unlike singular AI models, it is more difficult to assess whether these more complex AI systems are performing consistently and as intended to realize human benefit.

1. Real-world contexts for which the data might be noisy or different from training data;
2. Multiple AI components interact with each other, creating unanticipated dependencies and behaviors;
3. Human-AI feedback loops that come from repeated engagements between people and AI system.
4. Very large AI models (e.g., transformer models)
5. AI models that interact with other parts of a system (e.g., user interface or heuristic algorithm)

How do we know when these more advanced systems are ‘good enough’ for their intended use? When assessing the performance of AI models, we often rely on aggregate performance metrics like percentage of accuracy. But this ignores the many, often human elements, that make up an AI system.

Our research on what it takes to build forward-looking, inclusive AI experiences has demonstrated that getting to ‘good enough’ requires multiple performance assessment approaches at different stages of the development lifecycle, based upon realistic data and key user needs (figure 1).

Shifting emphasis gradually from iterative adjustments in the AI models themselves toward approaches that improve the AI system as a whole has implications not only in terms of how performance is assessed, but who should be involved in the performance assessment process. Engaging (and training) non-technical domain experts earlier (i.e., for choosing test data or defining experience metrics) and in a larger capacity throughout the development lifecycle can enhance relevance, usability, and reliability of the AI system.

Performance assessment best practices from the PeopleLens

The PeopleLens (figure 2) is a new Microsoft technology designed to enable children who are born blind to experience social agency and build up the range of social attention skills needed to initiate and maintain social interactions. Running on smart glasses, it provides the wearer with continuous, real-time information about the people around them through spatial audio, helping them build up a dynamic map of the whereabouts of others. Its underlying technology is a complex AI system using several computer vision algorithms to calculate, pose, identify registered people, and track those entities over time.

The PeopleLens offers a useful illustration of the wide range of performance assessment methods and people necessary to comprehensively gauge its efficacy.

A young boy wearing the PeopleLens sits on the floor of a playroom holding a blind tennis ball in his hands. His attention is directed toward a woman sitting on the floor in front of him holding her hands out. The PeopleLens looks like small goggles that sit on the forehead. The image is marked with visual annotations to indicate what the PeopleLens is seeing and what sounds are being heard. — Figure 2: The PeopleLens is a new research technology designed to help people who are blind or have low vision better understand their immediate social environments by locating and identifying people in the space dynamically in real-time.

Getting started: AI model or AI system performance?

Calculating aggregate performance metrics on open-source benchmarked datasets may demonstrate the capability of an individual AI model, but may be insufficient when applied to an entire AI system. It can be tempting to believe a single aggregate performance metric (such as accuracy) can be sufficient to validate multiple AI models individually. But the performance of two AI models in a system cannot be comprehensively measured by simple summation of each model’s aggregate performance metric.

We used two AI models to test the accuracy of the PeopleLens to locate and identify people: the first was a benchmarked, state-of-the-art pose model used to indicate the location of people in an image. The second was a novel facial recognition algorithm previously demonstrated to have greater than 90% accuracy. Despite strong historical performance of these two models, when applied to the PeopleLens, the AI system recognized only 10% of people from a realistic dataset in which people were not always facing the camera.

This finding illustrates that multi-algorithm systems are more than a sum of their parts, requiring specific performance assessment approaches.

Connecting to the human experience: Metric scorecards and realistic data

Metrics scorecards, calculated on a realistic reference dataset, offer one way to connect to the human experience while the AI system is still undergoing significant technical iteration. A metrics scorecard can combine several metrics to measure aspects of the system that are most important to users.

We used ten metrics in the development of PeopleLens. The most valuable two metrics were time-to-first-identification, which measured how long it took from the time a person was seen in a frame to the user hearing the name of that person, and number of repeat false positives, which measured how often a false positive occurred in three frames or more in a row within the reference dataset.

The first metric captured the core value proposition for the user: having the social agency to be the first to say hello when someone approaches. The second was important because the AI system would self-correct single misidentifications, while repeated mistakes would lead to a poor user experience. This measured the ramifications of that accuracy throughout the system, rather than just on a per-frame basis.

Beyond metrics: Using visualization tools to finetune the user experience

While metrics play a critical role in the development of AI systems, a wider range of tools is needed to finetune the intended user experience. It is essential for development teams to test on realistic datasets to understand how the AI system generates the actual user experience. This is especially important with complex systems, where multiple models, human-AI feedback loops, or unpredictable data (e.g., user-controlled data capture) can cause the AI system to respond unpredictably.

Visualization tools can enhance the top-down statistical tools of data scientists, helping domain experts contribute to system development. In the PeopleLens, we used custom-built visualization tools to compare side-by-side renditions of the experience with different model parameters (figure 3). We leveraged these visualizations to enable domain experts—in this case parents and teachers—to spot patterns of odd system behavior across the data.

Project Tokyo studio interface — Figure 3: Visualization tools helped the development team, including domain experts, in connecting the AI system to the user experience using realistic data. In this image, the top bar shows images taken from the wearable camera stream overlayed with the various model outcomes. The bottom bar shows the output of the world-state tracking algorithm on the left and the ground truth on the right. The panel in the middle shows model parameters that are being changed with the impact on the user experience being viewed in real time.

AI system performance in the context of the user experience

A user experience can only be as good as the underlying AI system. Testing the AI system in a realistic context, measuring things that matter to the users, is a critical stage before wide-spread deployment. We know, for example, that improving AI system performance does not necessarily correspond to improved performance of AI teams (reference).

We also know that human-to-AI feedback loops can make it difficult to measure an AI system’s performance. Essentially repeated interactions between AI system and user, these feedback loops can surface (and intensify) errors. They can also, through good intelligibility, be repaired by the user.

The PeopleLens system gave users feedback about the people’s locations and their faces. A missed identification (e.g., because the person is looking at a chest rather than a face) can be resolved once the user responds to feedback (e.g., by looking up). This example shows us that we do not need to focus on missed identification as they will be resolved by the human-AI feedback loop. However, users were very perplexed by the identification of people who were no longer present, and therefore performance assessments needed to focus on these false positive misidentifications.

1. Multiple performance assessment methods should be used in AI system development. In contrast to developing individual AI models, general aggregate performance metrics are a small component, relevant primarily in the earliest stages of development.
2. Documenting AI system performance should include a range of approaches, from metrics scorecards to system performance metrics for a deployed user experience, to visualization tools.
3. Domain experts play an important role in performance assessment, beginning early in the development lifecycle. Domain experts are often not prepared or skilled for the in-depth participation optimal in AI system development.
4. Visualization tools are as important as metrics in creating and documenting an AI system for a particular intended use. It is critical that domain experts have access to these tools as key decision-makers in AI system deployment.

Bringing it all together

For complex AI systems, performance assessment methods change across the development lifecycle in ways that differ from individual AI models. Shifting performance assessment techniques from rapid technical innovation requiring easy-to-calculate aggregate metrics at the beginning of the development process, to the performance metrics that reflect critical AI system attributes that make up the user experience toward the end of development helps every type of stakeholder precisely and collectively define what is ‘good enough’ to achieve the intended use.

It is useful for developers to remember performance assessment is not an end goal in itself; it is a process that defines how the system has reached its best state and whether that state is ready for deployment. The performance assessment process must include a broad range of stakeholders, including domain experts, who may need new tools to fulfill critical (sometimes unexpected) roles in the development and deployment of an AI system.

The post Assessing AI system performance: thinking beyond models to deployment contexts appeared first on Microsoft Research.

AI Models vs. AI Systems: Understanding Units of Performance Assessment

September 19, 2022

by Alyssa Hughes Microsoft AI

As AI becomes more deeply integrated into every aspect of our lives, it is essential that AI systems perform appropriately for their intended use. We know AI models can never be perfect, so how do we decide when AI performance is ‘good enough’ for use in a real life application? Is level of accuracy a sufficient gauge? What else matters? These are questions Microsoft Research tackles every day as part of our mission to follow a responsible, human-centered approach to building and deploying future-looking AI systems.

To answer the question, “what is good enough?”, it becomes necessary to distinguish between an AI model and an AI system as the unit of performance assessment. An AI model typically involves some input data, a pattern-matching algorithm, and an output classification. For example, a radiology scan of the patient’s chest might be shown to an AI model to predict whether a patient has COVID-19. An AI system, by contrast, would evaluate a broader range of information about the patient, beyond the COVID-19 prediction, to inform a clinical decision and treatment plan.

Research has shown that human-AI collaboration can increase the accuracy of AI models alone (reference). In this blog, we share key learnings from the recently retired Project Talia, the prior collaboration between Microsoft Research and SilverCloud Health to understand how thinking about the AI system as a whole—beyond the AI model—can help to more precisely define and enumerate ‘good enough’ for real-life application.

In Project Talia, we developed two AI models to predict treatment outcomes for patients receiving human-supported, internet-delivered cognitive behavioral treatment (iCBT) for symptoms of depression and anxiety. These AI models have the potential to assist the work practices of iCBT coaches. These iCBT coaches are practicing behavioral health professionals specifically trained to guide patients on the use of the treatment platform, recommend specific treatments, and help the patient work through identified difficulties.

Project Talia offers an illustration of the distinction between the AI model produced during research and a resulting AI system that could potentially get implemented to support real-life patient treatment. In this scenario, we demonstrate every system element that must be considered to ensure effective system outcomes, not just AI model outcomes.

Project Talia: Improving Mental Health Outcomes

SilverCloud Health (acquired by Amwell in 2021) is an evidence-based, digital, on-demand mental health platform that delivers iCBT-based programs to patients in combination with limited but regular contact from the iCBT coach. The platform offers more than thirty iCBT programs, predominantly for treating mild-to-moderate symptoms of depression, anxiety, and stress.

Patients work through the program(s) independently and with regular assistance from the iCBT coach, who provides guidance and encouragement through weekly reviews and feedback on the treatment journey.

Previous research (reference) has shown that involving a human coach within iCBT leads to more effective treatment outcomes for patients than unsupported interventions. Aiming to maximize the effects and outcomes of human support in this format, AI models were developed to dynamically predict the likelihood of a patient achieving a reliable improvement[1] in their depression and anxiety symptoms by the end of the treatment program (typically 8 to 14 weeks in length).

Existing literature on feedback-informed therapy (reference) and Project Talia research (reference) suggest that having access to these predictions could provide reassurance for those patients ‘on track’ toward achieving a positive outcome from treatment, or prompt iCBT coaches to make appropriate adjustments therein to better meet those patients’ needs.

AI Model vs. AI System

A two-part graphic entitled — Figure 1: Differentiating AI performance assessment on the level of AI model versus AI system.

The figure above illustrates the distinction between the AI model and AI system in this example (figure 1). The AI model takes in a clinical score calculated by a patient’s responses to standardized clinical questionnaires that assess symptoms of depression and anxiety at each treatment session. After three treatment sessions, the AI model predicts whether or not the patient will achieve a clinically significant reduction in mental health symptoms at completion of the treatment program. The AI model itself is trained on fully anonymized clinical scores of nearly 50,000 previous SilverCloud Health patients and achieved an acceptable accuracy of 87% (reference).

The outcome prediction could then be embedded into the clinical management interface that guides iCBT coaches in their efforts to make more informed decisions about that patient’s treatment journey (i.e., increase level and frequency of support from the coach).

When AI models are introduced into human contexts such as this, they rarely work in isolation. In this case, the clinical score is entered into a model with parameters tuned to a particular healthcare context. This example illustrates how AI model performance metrics are not sufficient to determine whether an AI system is ‘good enough’ for real-life application. We must examine challenges that arise throughout every element of the AI system.

Following are two specific examples of how AI system performance can be altered while retaining the same model: contextual model parameters and user interface and workflow integration.

Contextual Model Parameters: Which Error Type is Most Costly

Examining overall performance metrics exclusively can limit visibility into the different types of errors an AI model can make, which can have (potentially negative) implications on the AI system as a whole. For example, an AI model’s false positive and false negative errors can impact the AI system differently. A false positive error could mean a patient who needed extra help might not receive it; a false negative would mean a patient may receive unnecessary care. In this case, false positive errors would have a much bigger impact on a patient than false negative errors. But false negative errors can also be problematic when they cause unnecessary resource allocation.

Contextual model parameters can be tuned to change the balance between error types while maintaining the overall accuracy of the model. The clinical team could define these contextual model parameters to minimize false positive errors that could be more detrimental to patients, by specifying the model to produce only 5% false positives errors. Choosing this parameter could come, however, at the expense of a higher false negative rate, which would require monitoring how AI model performance might then impact service costs or staff burn-out.

This example illustrates the challenging decisions domain experts, who may know little about the details of AI, must make and the implications these decisions can have on AI system performance. In this example, we provided a prototype visualization tool to help the clinical team in their understanding of the implications of their choices across different patient groups.

We are moving into a world in which domain experts and business decision makers, who embed AI into their work practices, will bear increasing responsibility in assessing and debugging AI systems to improve quality, functionality, and relevance to their work.

User Interface and Workflow Integration

AI model predictions need to be contextualized for a given workflow. Research on iCBT coaches has shown that listing the predictions for all patients of a coach in a single screen outside the normal patient-centered workflow can be demotivating (reference). If a coach saw that the majority of their patients were predicted to not improve, or if their patients’ outcomes were predicted to be worse for than those of their colleagues, this could lead coaches to question their own competence or invite competitive thoughts about their colleagues’ performances—both unhelpful in this context.

Displaying the AI model prediction inside the individual patient’s profile, as in the illustration below (figure 2), provides a useful indicator of how well the person is doing and therefore can guide clinical practice. It also deliberately encourages the use of the AI model prediction within the context of other relevant clinical information.

Situating the AI output with other patient information can nurture a more balanced relationship between AI-generated insight and coaches’ own patient assessments, which can counterbalance effects of over-reliance and over-trust in potentially fallible AI predictions (also referred to as automation bias).

Screenshot of the supporter intervention management interface that shows at the top the patient status, including details on their assigned iCBT program and their level of active engagement with it. Embedded within the patient status are the reliable improvement prediction outputs for PHQ and GAD alongside means to extend the visual to access more details. Below the patient status are more detailed information about patient activity alongside a chart of the patients' PHQ and GAD score trajectories over time. — Figure 2: User interface for clinical coaches that shows the integration of treatment outcome prediction (here defined as reliable improvement) within the patient status. It shows: A) a general explanation of the prediction; B) visual charts with a text label to convey the prediction results for depression and anxiety symptoms (via PHQ-9 and GAD-7 clinical scores); C) drop-down menus for numerical percentages of the prediction; and D) other contextual information about the patient that are considered in their review, including their clinical score trajectories over time.

This example illustrates the importance of user interface design and workflow integration in how well AI model predictions are understood and can contribute to the success or failure of an AI system as a whole. Domain experts, user research, and service designers start to play a far more important role in the development of AI systems than the typical focus on data scientists.

Final Thoughts

Aggregate performance metrics, such as accuracy, area-under-the-curve (AUC) scores, or mean square error, are easy to calculate on an AI model, but they indicate little about the utility or function of the entire AI system in practice. So, how do we decide when AI system performance is ‘good enough’ for use in real-life application? It is clear that high levels of AI model performance alone are not sufficient—we must consider every element of the AI system.

Contextual model parameters and interface and workflow design present just two examples of how preparing domain experts with expectations, skills, and tools are necessary for optimal benefit from the incorporation of AI systems into human contexts.

[1] Defined as an improvement of 6 or more points on the PHQ-9 depression scale, or 4 or more points on the Gad-7 anxiety scale.

The post AI Models vs. AI Systems: Understanding Units of Performance Assessment appeared first on Microsoft Research.

Microsoft Research Summit 2022: What’s Next for Technology and Humanity?

September 14, 2022

by Alyssa Hughes Microsoft AI

Today, we are experiencing waves of breakthroughs in computing that are transforming just about every aspect of our lives. Artificial intelligence is changing the way we develop and create. Human language technologies are revolutionizing the workflows of healthcare professionals. Deep learning is accelerating our ability to understand and predict natural phenomena, from atomic to galactic scales. Meanwhile, the foundations of cloud computing are undergoing a reinvention from the atoms up.

Realizing the benefits of these new breakthroughs demands that we come together in new ways across the global research community. The vibrancy of invention and innovation increasingly lies at the intersections among traditional research disciplines, from the highly theoretical and to the immediately applicable. Ensuring that the continuing advancement of technology is beneficial to all requires communication, collaboration and co-innovation across the communities that create new technologies and those that aim to use them to improve their lives.

That’s why I’m excited to invite you to join us for this year’s Microsoft Research Summit, which will take place on October 18-20, 2022. This virtual event is where the global research community convenes to explore how emerging research might best address societal challenges and have significant impact on our lives in the coming years. This year’s event will feature over 120 speakers, including researchers and leaders from across the research community at Microsoft, alongside partners and collaborators from industry, academia and government who are advancing the frontiers of research in computing and across the sciences.

Each of our three days will begin with a plenary session during which we’ll explore the potential impact of deep learning on scientific discovery, the opportunity to use technology to make healthcare more precise and accessible, and the re-invention of foundational technologies to enable the cloud of the future. These plenaries will lead into tracks that dive deeper into research that spans from more efficient and adaptable AI, to technologies that amplify human creativity and help foster a more sustainable society.

For further details – and to register to attend – check out the Microsoft Research Summit website.

We hope you will join us.

The post Microsoft Research Summit 2022: What’s Next for Technology and Humanity? appeared first on Microsoft Research.

CCF: Bringing efficiency and usability to a decentralized trust model

September 14, 2022

by Brenda Potts Microsoft AI

Online trust has come a long way since the time of centralized databases, where information was concentrated in one location and the security and validation of that information relied on a core set of people and systems. While convenient, this model of centralized management and oversight had a number of drawbacks. Trust depended on how the workflows of those systems were established and the skillset and integrity of the people involved. It created opportunities for such issues as duplicate digital transactions, human error, and bias, as witnessed in recent history in the financial industry. In response to these systemic issues, a now-famous paper published in late 2008 proposed a distributed ledger, where new transactions could be added and validated only through participant consensus. This model of decentralized trust and execution would become known as distributed ledger technology, or blockchain, and it offered a more trustworthy alternative to centrally managed databases and a new way to store and decentralize data.

In a distributed trust model, network participants validate transactions over a network by performing computation on those transactions themselves and comparing the outputs. While their identities are private and those performing the transactions typically have pseudonyms, the transactions themselves are public, greatly limiting the use cases for decentralized computation systems. One use case where decentralized computation doesn’t work involves handling financial transactions so that they’re compliant with Know Your Client (KYC) standards and anti-money laundering (AML) regulations while also respecting privacy laws. Another involves managing medical records, where multiple organizations, such as healthcare providers and insurers, jointly govern the system.

Distributed trust with centralized confidential computation

While blockchain provided a more reliable option to centralized databases, it isn’t a perfect solution. The Confidential Computing team at Microsoft Research wanted to build a system that retained the advantages of decentralized trust while keeping transactions confidential. This meant we had to develop a way to centralize computation. At the time, no system offered these capabilities.

To tackle this issue, we developed Confidential Consortium Framework (CCF), a framework for building highly available stateful services that require centralized computation while providing decentralized trust. CCF is based on a distributed trust model like that of blockchain while maintaining data confidentiality through secure centralized computation. This centralized confidential computation model also provides another benefit—it addresses the substantial amount of energy used in blockchain and other distributed computation environments.

As widely reported in the media, blockchain comes at a great environmental cost. Cryptocurrency—the most widespread implementation of blockchain—requires a significant amount of computing power to verify transactions. According to the Cambridge Center for Alternative Finance (CCAF), bitcoin, the most common cryptocurrency, as of this writing, currently consumes slightly over 92 terawatt hours per year—0.41 percent of global electricity production, more than the annual energy draw of countries like Belgium or the Philippines.

Our goal was to develop a framework that reduced the amount of computing power it takes to run a distributed system and make it much more efficient, requiring no more energy than the cost of running the actual computation.

To apply the technology in a way that people can use, we worked with the Azure Security team to build Azure confidential ledger, an Azure service developed on CCF that manages sensitive data records in a highly secure way. In this post, we discuss the motivations behind CCF, the problems we set out to solve, and the approaches we took to solve them. We also explain our approach in supporting the development of Azure confidential ledger using CCF.

Overcoming a bias for blockchain

We discovered a strong bias for blockchain as we explained our research to different groups that were interested in this technology, including other teams at Microsoft, academic researchers exploring blockchain consensus, and external partners looking for enterprise-ready blockchain solutions. This bias was in the form of certain assumptions about what was needed to build a distributed ledger: that all transactions had to be public, that computation had to be geographically distributed, and that it had to be resilient to Byzantine faults from executors. First recognizing these biases and then countering them were some of the biggest challenges we had to surmount.

We worked to show how CCF broke from each of these assumptions while still providing an immutable ledger with distributed trust. We also had to prove that there were important use cases for maintaining confidentiality in a distributed trust system. We went through multiple rounds of discussion, explaining how the technology we wanted to build was different from traditional blockchains, why it was a worthwhile investment, and what the benefits were. Through these conversations, we discovered that many of our colleagues were just as frustrated as we were by the very issues in blockchain we were setting out to solve.

Additionally, we encountered skepticism from internal partner teams, who needed more than a research paper to be convinced that we could successfully accomplish our research goals and support our project. There were healthy doubts about the performance that was possible when executing inside an encrypted and isolated memory space, the ability to build a functional and useable system with minimal components that needed to be trusted, and how much of the internal complexity it was possible to hide from operators and users. Early versions of CCF and sample apps were focused on proving we could overcome those risks. We built basic proofs of concept and gave numerous demonstrations showing how we could implement distributed trust with centralized confidential computation. In the end, it was the strength of these demos that helped us get the resources we needed to pursue our research.

Building the compute stack

Another challenge involved was reimagining a secure compute stack for an enclave—the secured portion of the hardware’s processor and memory. At the time, enclaves were very resource constrained compared with traditional hardware, and we could run only small amounts of code on very little memory.

In addition, capabilities are limited when performing computation in an enclave. For example, the code can’t access anything outside the enclave, and it’s difficult to get the code to communicate with an external system. This challenge required us to design and build an entire compute stack from scratch with all the elements needed to establish consensus, implement transactional storage, establish runtimes for user languages, and so on.

Another consideration was the need to build a system that people could use. As researchers, we wanted our work to have real impact, but it was tempting to push the state of the art in the area of confidential computing research and develop very elaborate technology in these enclaves. However, these types of innovations cannot be deployed in actual products because they’re exceedingly difficult to explain and apply. We had committed to creating something that product teams could implement and use as a foundation for building real systems and products, so we worked to calibrate the guarantees and threat model so that our system could be used in actual products.

Establishing a root of trust with CCF

CCF strengthens the trust boundary in scenarios in which both distributed trust and data confidentiality are needed by decreasing the size of the trusted computing base (TCB)—the components of a computing environment that must be trusted for the appropriate level of security to be applied—reducing the attack surface. Specifically, CCF allows operators to greatly decrease or even eliminate their presence in the TCB, depending on the governance configuration.

Instead of a social root of trust—such as a cloud service provider or the participant consensus used in blockchain networks—CCF relies on trusted hardware to enforce transaction integrity and confidentiality, which creates a trusted execution environment (TEE). These TEEs are isolated memory spaces that are kept encrypted at all times, even when data is executing. The memory chip itself strictly enforces this memory encryption. Data in TEEs is never readable.

Decentralized trust is underpinned by remote attestation, providing the guarantee to a remote entity that all computation of user data takes place in a publicly verifiable TEE. The combination of this attestation with the isolated and encrypted TEE creates a distributed trust environment. Nodes in the network establish mutual trust by verifying their respective attestations, which affirm that they’re running the expected code in a TEE. The operator starting the nodes, which can be automated or manual, indicates where in the network they can find each other.

Service governance is performed by a flexible consortium, which is separate from the operator. CCF uses a ledger to provide offline trust. All transactions are reflected in a tamper-protected ledger that users can review to audit service governance and obtain universally verifiable transaction receipts, which can verify the consistency of the service and prove the execution of transactions to other users. This is particularly valuable for users who need to comply with specific laws and regulations.

A circular flowchart connecting three ledgers, each marked with a padlock around a circle representing a confidential network. The arrows connecting the ledgers read — Figure 1: In a confidential network, data is encrypted at rest, in transit, and in use because it’s run in a trusted execution environment. All network administration occurs outside the trust boundary. The network constitution governs participants, configuration, and code, making it resilient to fraud, theft, or unintended data manipulation.

Laying the foundation for Azure confidential ledger

We collaborated with the Azure Security team to refine and improve CCF so that it could be used as a foundation for building new Azure services for confidential computing. We applied Azure API standards and ensured that CCF complied with Azure best practices, including enabling it to log operations and perform error reporting and long-running queries. We then developed a prototype of an Azure application, and from this, the Azure Security team developed Azure confidential ledger, the first generally available managed service built on CCF, which provides tamper-protected audit logging that can be cryptographically verified.

Looking forward

We were pleasantly surprised by how quickly we discovered new use cases for CCF and Azure confidential ledger, both within Microsoft and with third-party users. Now, most of the use cases are those we had not initially foreseen, from atmospheric carbon removal to securing machine learning logs. We’re extremely excited by the potential for CCF to have much more impact than we had originally planned or expected when we first started on this journey, and we’re looking forward to discovering some of the countless ways in which it can be applied.

The post CCF: Bringing efficiency and usability to a decentralized trust model appeared first on Microsoft Research.

A game-theoretic approach to provably correct and scalable offline RL

September 7, 2022

by Alyssa Hughes Microsoft AI

Group

Robot Learning Group

Despite increasingly widespread use of machine learning (ML) in all aspects of our lives, a broad class of scenarios still rely on automation designed by people, not artificial intelligence (AI). In real-world applications that involve making sequences of decisions with long-term consequences, from allocating beds in an intensive-care unit to controlling robots, decision-making strategies to this day are carefully crafted by experienced engineers. But what about reinforcement learning (RL), which gave machines supremacy in games as distinct as Ms. PacMan and Pokémon Go? For all its appeal, RL – specifically, its most famous flavor, online RL – has a significant drawback beyond scenarios that can be simulated and have well-defined behavioral rules. Online RL agents learn by trial and error. They need opportunities to try various actions, observe their consequences, and improve as a result. Making wildly suboptimal decisions just for learning’s sake is acceptable when the biggest stake is a premature demise of a computer game character or showing an irrelevant ad to a website visitor. For tasks such as training a self-driving car’s AI, however, it is clearly not an option.

Offline reinforcement learning (RL) is a paradigm for designing agents that can learn from large existing datasets – possibly collected by recording data from existing reliable but suboptimal human-designed strategies – to make sequential decisions. Unlike the conventional online RL, offline RL can learn policies without collecting online data and even without interacting with a simulator. Moreover, since offline RL does not blindly mimic the behaviors seen in data, like imitation learning (an alternate strategy of RL), offline RL does not require expensive expert-quality decision examples, and the learned policy of offline RL can potentially outperform the best data-collection policy. This means, for example, an offline RL agent in principle can learn a competitive driving policy from logged datasets of regular driving behaviors. Therefore, offline RL offers great potential for large-scale deployment and real-world problem solving.

Two figures and two arrows connecting them. The left figure shows examples of real-world sequential decision-making problems such as robotic manipulation, health care, and autonomous driving. The right figure shows that these problems only have non-exploratory logged data. An arrow pointing from the left figure to the right figure shows data collection is costly and risky in these applications. Another arrow pointing from the left figure to the right figure highlights the question “How to make decisions under systematic uncertainty due to missing data coverage?”.

However, offline RL faces a fundamental challenge: the data we can collect in large quantity lacks diversity, so it is impossible to use it to estimate how well a policy would perform in the real world. While we often associate the term “Big Data” with diverse datasets in ML, it is no longer true when the data concerns real-world “sequential” decision making. In fact, curating diverse datasets for these problems can range from difficult to nearly impossible, because it would require running unacceptable experiments in extreme scenarios (like staging the moments just before a car cash, or conducting unethical clinical trials). As a result, the data that gets collected in large quantity, counterintuitively, lacks diversity, which limits its usefulness.

In this post, we introduce a generic game-theoretic framework for offline RL. We frame the offline RL problem as a two-player game where a learning agent competes with an adversary that simulates the uncertain decision outcomes due to missing data coverage. By this game analogy, we design a systematic and provably correct way to design offline RL algorithms that can learn good policies with state-of-the-art empirical performance. Finally, we show that this framework provides a natural connection between offline RL and imitation learning through the lens of generative adversarial networks (GANs). This connection ensures that the policies learned by this game-theoretic framework are always guaranteed to be no worse than the data collection policies. In other words, with this framework, we can use existing data to robustly learn policies that improve upon the human-designed strategies currently running in the system.

The content of this post is based on our recent papers Bellman-consistent Pessimism for Offline Reinforcement Learning (Oral Presentation, NeurIPS 2021) and Adversarially Trained Actor Critic for Offline Reinforcement Learning (Outstanding Paper Runner-up, ICML 2022).

Publication

Bellman-consistent Pessimism for Offline Reinforcement Learning

Publication

Adversarially Trained Actor Critic for Offline Reinforcement Learning

Fundamental difficulty of offline RL and version space

A major limitation of making decisions with only offline data is that existing datasets do not include all possible scenarios in the real world. Hypothetically, suppose that we have a large dataset of how doctors treated patients in the past 50 years, and we want to design an AI agent to make treatment recommendations through RL. If we run a typical RL algorithm on this dataset, the agent might come up with absurd treatments, such as treating a patient with pneumonia through amputation. This is because the dataset does not have examples of what would happen to pneumonia after amputating patients, and “amputation cures pneumonia” may seem to be a plausible scenario to the learning agent, as no information in the data would falsify such a hypothesis.

To address this issue, the agent needs to carefully consider uncertainties due to missing data. Rather than fixating on a particular data-consistent outcome, the agent should be aware of different possibilities (i.e., while amputating the leg might cure the pneumonia, it also might not, since we do not have sufficient evidence for either scenario) before committing to a decision. Such deliberate conservative reasoning is especially important when the agent’s decisions can cause negative consequences in the real world.

A formal way to express the idea above is through the concept of version space. In machine learning, a version space is a set of hypotheses consistent with data. In the context of offline RL, we will form a version space for each candidate decision, describing the possible outcomes if we are to make the decision. Notably, the version space may contain multiple possible outcomes for a candidate decision whose data is missing.

To understand better what a version space is, we use the figure below to visualize the version space of a simplified RL problem where the decision horizon is one. Here we want to select actions that can obtain the highest reward (such as deciding which drug treatment has the highest chance of curing a disease). Suppose we have data of action-reward pairs, and we want to reason about the reward obtained by taking an action. If the underlying reward function is linear, then we have a version space shown in red, which is a subset of (reward) hypotheses that are consistent with the data. In general, for a full RL problem, a version space can be more complicated than just reward functions, e.g., we would need to start thinking about hypotheses about world models or value functions. Nonetheless, similar concepts like the one shown in the figure below apply.

Thinking offline RL as two-player games

If we think about uncertainties as hypotheses in a version space, then a natural strategy of designing offline RL agents is to optimize for the worst-case scenario among all hypotheses in the version space. In this way, the agent would consider all scenarios in the version space as likely, avoiding making decisions following a delusional outcome.

Below we show that this approach of thinking about worst cases for offline RL can be elegantly described by a two-player game, called the Stackelberg game. To this end, let us first introduce some math notations and explain what a Stackelberg game is.

Notation We use (π) to denote a decision policy and use (J(π)) to denote the performance of (π) in the application environment. We use (Π) to denote the set of policies that the learning agent is considering, and we use (mathcal{H}) to denote the set of all hypotheses. We also define a loss function (psi:Π × mathcal{H}→[0, infty)) such that if (psi(π,H)) is small, (H) is a data-consistent hypothesis with respect to (π); conversely (psi(π,H)) gets larger for data-inconsistent ones (e.g., we can treat each (H) as a potential model of the world and (psi(π,H)) as the modelling error on the data). Consequently, a version space above is defined as (mathcal V_pi = { H: psi(pi,H) leq varepsilon }) for some small (varepsilon).

Given a hypothesis (H in mathcal{H}), we use (H(π)) to denote the performance of (π) (predicted) by (H), which may be different from (π)’s true performance (J(π)). As a standard assumption, we suppose for every (π in Π) there is some (H_{π}^* in mathcal{H}) that describes the true outcome of (π), that is, (J(π) = H_{π}^*(π)) and (psi(π,H_{π}^*) = 0).

Stackelberg Game In short, a Stackelberg game is a bilevel optimization problem,

(displaystyle max_{x} f(x,y_x), quad {rm s.t.} ~~ y_x in min_x g(x,y))

In this two-player game, the Leader (x) and the Follower (y) maximize and minimize the payoffs (f) and (g), respectively, under the rule that the Follower plays after the Leader. (This is reflected in the subscript of (y_{x}), that (y) is decided based on the value of (x)). In the special case of (f = g), a Stackelberg game reduces to a zero-sum game (displaystyle max_{x} min_{y} f(x,y)).

A figure shows a learner that tries to compete with an adversary in a two player game, which resembles a chess game. The learner is thinking about whether to use absolute or relative pessimism as the strategy to choose the policy. The adversary is thinking about which hypothesis from the version space to choose. — Figure 2: Offline RL as two-player Stackelberg game.

We can think about offline RL as a Stackelberg game: We let the learning agent be the Leader and introduce a fictitious adversary as the Follower, which chooses hypotheses from the version space (V_{π}) based on the Leader’s policy (π). Then we define the payoffs above as performance estimates of a policy in view of a hypothesis. By this construction, solving the Stackelberg would mean finding policies that maximize the worst-case performance, which is exactly our starting goal.

Now we give details on how this game can be designed, using absolute pessimism or relative pessimism, so that we can have performance guarantees in offline RL.

A two-player game based on absolute pessimism

Publication

Bellman-consistent Pessimism for Offline Reinforcement Learning

Our first instantiation is a two-player game based on the concept of absolute pessimism, introduced in the paper Bellman-consistent Pessimism for Offline Reinforcement Learning (NeurIPS 2021).

(displaystyle max_{pi in Pi} H_pi(pi), quad {rm s.t.} ~~ H_pi in min_{H in mathcal H} H(pi) + beta psi(pi,H) )

where (beta ge 0) is a hyperparameter which controls how strongly we want the hypothesis to be data consistent. That is, the larger (beta) is, the smaller the version space (mathcal V_pi = { H: psi(pi,H) leq varepsilon }) is (since (varepsilon) is smaller), so we can think that the value of (beta) trades off conservatism and generalization in learning.

This two-player game aims to optimize for the worst-case absolute performance of the agent, as we can treat (H_{π}) as the most pessimistic hypothesis in (mathcal{H}) that is data consistent. Specifically, we can show (H_{π}(π) le J(π)) for any (beta ge 0); as a result, the learner is always optimizing for a performance lower bound.

In addition, we can show that by the design of (psi), this lower bound is tight when (beta) is chosen well (that is, it only underestimates the performance when the policy runs into situations for which we lack data). As a result, with a properly chosen (beta) value, the policy found by the above game formulation is optimal, if the data covers relevant scenarios that an optimal policy would visit, even for data collected by a sub-optimal policy.

In practice, we can define the hypothesis space (mathcal{H}) as a set of value critics (model-free) or world models (model-based). For example, if we set the hypothesis space (mathcal{H}) as candidate Q-functions and (psi) as the Bellman error of a Q function with respect to a policy, then we will get the actor-critic algorithm called PSPI. If we set (mathcal{H}) as candidate models and (psi) as the model fitting error, then we get the CPPO algorithm.

A two-player game based on relative pessimism

Publication

Adversarially Trained Actor Critic for Offline Reinforcement Learning

While the above game formulation guarantees learning optimality for a well-tuned hyperparameter (beta), the policy performance can be arbitrarily bad if the (beta) value is off. To address this issue, we introduce an alternative two-player game based on relative pessimism, proposed in the paper Adversarially Trained Actor Critic for Offline Reinforcement Learning (ICML 2022).

(displaystyle max_{pi in Pi} H_pi(pi) color{red}{- {H_{π}(mu)}}, {rm s.t.} H_{π} in min_{H in mathcal H} H(π) color{red}{- {H_{π}(mu)}} + betapsi(π,H))

where we use (mu) to denote the data collection policy.

Unlike the absolute pessimism version above, this relative pessimism version is designed to optimize for the worst-case performance relative to the behavior policy (mu). Specifically, we can show (H_pi (pi) – H_pi(mu) leq J(pi) – J(mu)) for all (beta ge 0). And again, we can think about the hypotheses in model-free or model-based manners, like the discussion above. When instantiated model-free, we get the ATAC (Adversarially Trained Actor Critic) algorithm, which achieves state-of-the-art empirical results in offline RL benchmarks.

An important benefit of this relative pessimism version is a robust property to the choice of (beta). As a result, we can guarantee that the learned policy is always no worse than the behavior policy, despite data uncertainty and the (beta) choice – a property we call (robust) (policy) (improvement). The intuition behind this is that (pi = mu) achieves a zero objective in this game, so the agent has an incentive to deviate from (mu) only if it finds a policy (pi) that is uniformly better than (mu) for all possible data-consistent hypotheses.

At the same time, this relative pessimism game is also guaranteed with the learning optimality when (beta) is chosen correctly, just like above absolute pessimism game. Therefore, in some sense, we can view the relative pessimism game (e.g., ATAC) as a robust version of the absolute pessimism game (e.g., PSPI).

A connection between offline RL and imitation learning

A figure that shows a spectrum that unifies offline reinforcement learning and imitation learning through the lens of ATAC’s Stackelberg game. The left of the spectrum is offline reinforcement learning, which corresponds to ATAC with a good
𝛽 value; the right of the spectrum is imitation learning, which corresponds to ATAC with 𝛽 equal to zero. On top of the spectrum, there are figures visualizing how the hypothesis space and the learned policy behave in each scenario. Overall, when 𝛽 is larger, the hypothesis space is smaller; on the other hand, when 𝛽 is smaller the learned policy is closer to the behavior policy that collects the data. — Figure 3: ATAC based on a Stackelberg game of relative pessimism provides a natural connection between offline RL and imitation learning.

An interesting takeaway from the discussion above is a clean connection between offline RL and imitation learning (IL) based on GANs or integral probability metric. IL like offline RL also tries to learn good policies from offline data. But IL does not use reward information, so the best strategy for IL is to mimic the data collection policy. Among modern IL algorithms, one effective strategy is to use GANs, where we train the policy (the generator) using an adversarial discriminator to separate the actions generated the policy and the actions from the data.

Now that we understand how IL-based on GANs work, we can see the connection between offline RL and IL through the model-free version of the relative pessimism game, that is, ATAC. We can view the actor and the critic of ATAC as the generator and the discriminator in IL based on GANs. By choosing different (beta) values, we can control the strength of the discriminator; when (beta = 0), the discriminator is the strongest, and the best generator is naturally the behavior policy, which recovers the imitation learning behavior. On the other hand, using larger (beta) weakens the discriminator by Bellman regularization (i.e., (psi)) and leads to offline RL.

In conclusion, our game-theoretic framework shows

Offline RL + Relative Pessimism = IL + Bellman Regularization.

We can view them both as solving a version of GANs problems! The only difference is that offline RL uses more restricted discriminators that are consistent with the observed rewards, since the extra information added in offline RL compared with imitation learning is the reward labels.

The high-level takeaway from this connection is that the policies learned by offline RL with the relative pessimism game are guaranteed to be no worse than the data collection policy. In other words, we look forward to exploring possible applications that robustly improve upon existing human-designed strategies running in the system, by just using existing data despite the lack of data diversity.

The post A game-theoretic approach to provably correct and scalable offline RL appeared first on Microsoft Research.

MoCapAct: Training humanoid robots to “Move Like Jagger”

August 25, 2022

by Alyssa Hughes Microsoft AI

What would it take to get humanoid, bipedal robots to dance like Mick Jagger? Indeed, for something more mundane, what does it take to get them to simply stand still? Sit down? Walk? Move in myriads of other ways many people take for granted? Bipedalism provides unparalleled versatility in an environment designed for and by humans. By mixing and matching a wide range of basic motor skills, from walking to jumping to balancing on one foot, people routinely dance, play soccer, carry heavy objects, and perform other complex high-level motions. If robots are ever to reach their full potential as an assistive technology, mastery of diverse bipedal motion is a requirement, not a luxury. However, even the simplest of these skills can require a fine orchestration of dozens of joints. Sophisticated engineering can rein in some of this complexity, but endowing bipedal robots with the generality to cope with our messy, weakly structured world, or a metaverse that takes after it, requires learning. Training AI agents with humanoid morphology to match human performance across the entire diversity of human motion is one of the biggest challenges of artificial physical intelligence. Due to the vagaries of experimentation on physical robots, research in this direction is currently done mostly in simulation.

Unfortunately, it involves computationally intensive methods, effectively restricting participation to research institutions with large compute budgets. In an effort to level the playing field and make this critical research area more inclusive, Microsoft Research’s Robot Learning group is releasing MoCapAct, a large library of pre-trained humanoid control models along with enriched data for training new ones. This will enable advanced research on artificial humanoid control at a fraction of the compute resources currently required.

Video source: Carnegie Mellon University – CMU Graphics Lab – motion capture library

The reason why humanoid control research has been so computationally demanding is subtle and, at the first glance, paradoxical. The prominent avenue for learning locomotive skills is based on using motion capture (MoCap) data. MoCap is an animation technique widely used in the entertainment industry for decades. It involves recording the motion of several keypoints on a human actor’s body, such as their elbows, shoulders, and knees, while the actor is performing a task of interest, such as jogging. Thus, a MoCap clip can be thought of as a very concise and precise summary of an activity’s video clip. Thanks to this, useful information can be extracted from MoCap clips with much less computation than from the much more high-dimensional, ambiguous training data in other major areas of machine learning, which comes in the form of videos, images, and text. On top of this, MoCap data is widely available. Repositories such as CMU Motion Capture Dataset contain hours of clips for just about any common motion of a human body, with visualizations of several examples shown below. Why, then, is it so hard to make physical and simulated humanoid robots mimic a person’s movements?

The caveat is that MoCap clips don’t contain all the information necessary to imitate the demonstrated motions on a physical robot or in a simulation that models physical forces. They only show us what a motion skill looks like, not the underlying muscular movements that caused the actor’s muscles to yield that motion. Even if MoCap systems recorded these signals, it wouldn’t be of much help: simulated humanoids and real robots typically use motors instead of muscles, which is a dramatically different form of articulation. Nonetheless, actuation in artificial humanoids is also driven by a type of control signal. MoCap clips are a valuable aid in computing these control signals, if combined with additional learning and optimization methods that use MoCap data as guidance. The computational bottleneck that our MoCapAct release aims to remove is created exactly by these methods, collectively known as reinforcement learning (RL). In simulation, where much of AI locomotion research is currently focused, RL can recover the sequence of control inputs that takes a humanoid agent through the sequence of poses from a given MoCap clip. What results is a locomotion behavior that is indistinguishable from the clip’s. The availability of control policies for individual basic behaviors learned from separate MoCap clips can open the doors for fascinating locomotion research, e.g., in methods for combining these behaviors into a single “multi-skilled” neural network and training higher-level locomotion capabilities by switching among them. However, with thousands of basic locomotion skills to learn, RL’s expensive trial-and-error approach creates a massive barrier to entry on this research path. It is this scalability issue that our dataset release aims to address.

A flowchart showing motion capture clips producing clip-tracking agents via reinforcement learning. The agents then generate data using the simulated humanoid. The MoCapAct dataset consists of the agents and corresponding data. — Figure 1: The MoCapAct dataset consists of policies that track individual MoCap clips and data from these agents.

Our MoCapAct dataset, designed to be compatible with the highly popular dm_control humanoid simulation environment and the extensive CMU Motion Capture Dataset, serves the research community in two ways:

For each of over 2500 MoCap clip snippets from the CMU Motion Capture Dataset, it provides an RL-trained “expert” control policy (represented as a PyTorch model) that enables dm_control’s simulated humanoid to faithfully recreate the skill depicted in that clip snippet, as shown in these videos of the experts’ behaviors:

Training this model zoo has taken the equivalent of 50 years over many GPU-equipped Azure NC6v2 virtual machines (excluding hyperparameter tuning and other required experiments) – a testament to the computational hurdle MoCapAct removes for other researchers.

For each of the trained skill policies above, MoCapAct supplies a set of recorded trajectories generated by executing that skill’s control policy on the dm_control’s humanoid agent. These trajectories can be thought of as MoCap clips of the trained experts but, in a crucial difference from the original MoCap data, they contain both low-level sensory measurements (e.g., touch measurements) and control signals for the humanoid agent. Unlike typical MoCap data, these trajectories are suitable for learning to match and improve on skill experts via direct imitation – a much more efficient class of techniques than RL.

We give two examples of how we used the MoCapAct dataset.

First, we train a hierarchical policy based on the neural probabilistic motor primitive. To achieve this, we combine the thousands of MoCapAct’s clip-specialized policies together into a single policy that is capable of executing many different skills. This agent has a high-level component that takes MoCap frames as input and outputs a learned skill. The low-level component takes the learned skill and sensory measurement from the humanoid as input and outputs the motor action.

Two graphics of the hierarchical policy. The first graphic shows a MoCap clip of walking being fed into a high-level policy, which outputs a prediction of “walk forward.” This prediction and the humanoid observation are fed into the low-level policy, which then predicts the motor actions to execute the walking motion. The second graphic is similar to the first, with the only difference being that the MoCap clip shows a “run and jump” motion, and the predicted skill is “run and jump.” — Figure 2: The hierarchical policy consists of a high-level policy and low-level policy. The high-level policy maps the given MoCap frames to a learned skill. The low-level policy takes the skill and the humanoid observation and outputs an action that best realizes the skill.

This hierarchical structure offers an appealing benefit. If we keep the low-level component, we can instead control the humanoid by inputting different skills to the low-level policy (e.g., “walk” instead of the corresponding motor actions). Therefore, we can re-use the low-level policy to efficiently learn new tasks.

Graphic of a task policy feeding into a low-level policy. The task policy takes an observation from the humanoid as input, and outputs a “skill.” The skill and humanoid observation are fed into a low-level policy, which outputs the motor action. — Figure 3: We can replace the high-level policy with a task policy that is trained to output skills required to achieve some new task, such as running to a target.

In light of that, we replace the high-level policy with a task policy that is then trained to steer the low-level policy towards achieving some task. As an example, we train a task policy to have the humanoid reach a target. Notice that the humanoid uses many low-level skills, like running, turning, and side-stepping.

Graphic of the GPT policy. A sequence of humanoid observations is fed into the GPT module, which outputs the motor action. — Figure 4: Our GPT model takes in a sequence of observations from the humanoid (called the “context”) and outputs an action that it thinks best continues the observed motion.

Our second example centers on motion completion, which is inspired by the task of sentence completion. Here, we use the GPT architecture, which accepts a sequence of sensory measurements (the “context”) and outputs a motor action. We train a control policy to take one second of sensory measurements from the dataset and output the corresponding motor actions from the specialized expert. Then, before executing the policy on our humanoid, we first generate a “prompt” (red humanoid in the videos) by executing a specialized expert for one second. Afterwards, we let the policy control the humanoid (bronze humanoid in the videos), at each time step, where it constantly takes the previous second of sensory measurements and predicts the motor actions. We find that this policy can reliably repeat the underlying motion of the clip, which is demonstrated in the first two videos. On other MoCap clips, we find that the policy can deviate from the underlying clip in a plausible way, such as in the third video, where the humanoid transitions from side-stepping to walking backwards.

On top of the dataset, we also release the code used to generate the policies and results. We hope the community can build off of our dataset and work to do incredible research in the control of humanoid robots.

Our paper is available here. You can read more at our website.

The data used in this project was obtained from mocap.cs.cmu.edu.
The database was created with funding from NSF EIA-0196217.

The post MoCapAct: Training humanoid robots to “Move Like Jagger” appeared first on Microsoft Research.

3D face reconstruction with dense landmarks

Increasing privacy, fairness, and efficiency with synthetic data

Acknowledgments

LaMAR: Benchmarking localization and mapping for augmented reality

Developing a large-scale AR dataset

Automatically aligning AR sequences to establish ground truth

Evaluating localization and mapping in the context of AR

Using the LaMAR benchmark

Acknowledgments

Clifford neural layers for PDE modeling

InAs-Al Hybrid Devices Passing the Topological Gap Protocol

DaVinci Image and Video Enhancement Toolkit

Explore More

RetrieverTTS: Modeling decomposed factors for text-based speech insertion

AI4Science expands with new research lab in Berlin

Register today: Microsoft Research Summit 2022

Performance assessment best practices from the PeopleLens

Getting started: AI model or AI system performance?

Connecting to the human experience: Metric scorecards and realistic data

Beyond metrics: Using visualization tools to finetune the user experience

AI system performance in the context of the user experience

Bringing it all together

Project Talia: Improving Mental Health Outcomes

AI Model vs. AI System

Contextual Model Parameters: Which Error Type is Most Costly

User Interface and Workflow Integration

Final Thoughts

Distributed trust with centralized confidential computation

Overcoming a bias for blockchain

Building the compute stack

Establishing a root of trust with CCF

Laying the foundation for Azure confidential ledger

Looking forward

Fundamental difficulty of offline RL and version space

Thinking offline RL as two-player games

A two-player game based on absolute pessimism

A two-player game based on relative pessimism

A connection between offline RL and imitation learning

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.