The post Microsoft open sources its ‘farm of the future’ toolkit appeared first on The AI Blog.
Research Focus: Week of September 26, 2022
Clifford neural layers for PDE modeling
Johannes Brandstetter, Rianne van den Berg, Max Welling, and Jayesh K. Gupta
Partial differential equations (PDEs) are widely used to describe simulation of physical processes as scalar and vector fields interacting and coevolving over time.
Recent research has focused on using neural surrogates to accelerate such simulations. However, current methods do not explicitly model relationships between fields and their correlated internal components, such as the scalar temperature field and the vector wind velocity field in weather simulations.
In this work, we view the time evolution of such correlated fields through the lens of mulitvector fields, which consist of scalar, vector, and higher-order components, such as bivectors. An example of a bivector is the cross-product of two vectors in 3-D, which is a plane segment spanned by these two vectors, and is often represented as a vector itself. But the cross-product has a sign flip under reflection, which a vector does not. It is a bivector.
The algebraic properties of multivectors, such as multiplication and addition, can be described by Clifford algebras, leading to Clifford neural layers such as Clifford convolutions and Clifford Fourier transforms. These layers are universally applicable and will find direct use in the areas of fluid dynamics, weather forecasting, and the modeling of physical systems in general.
InAs-Al Hybrid Devices Passing the Topological Gap Protocol
While the promise of quantum is great, we are still in the early days of what is possible. Today’s quantum computers enable interesting research, however, innovators find themselves limited by the inadequate scale of these systems and are eager to do more. Microsoft is taking a more challenging, but ultimately a more promising approach to scaled quantum computing. We are engineering a full-stack quantum machine powered by topological qubits. We theorize that this type of qubit will be inherently more stable than qubits produced with existing methods without sacrificing size or speed. Earlier this year we had a major scientific breakthrough that cleared a significant hurdle – we discovered that we could produce the topological superconducting phase and its concomitant Majorana zero modes. In essence, we now have the building block for a topological qubit and quantum at scale. Learn more about this discovery from Dr. Chetan Nayak, Distinguished Engineer of Quantum at Microsoft, who just released a preprint to arXiv and met with leaders in the quantum computing industry.
DaVinci Image and Video Enhancement Toolkit
To help users enhance low quality videos in real time, on edge equipment and with limited computing power, researchers from Microsoft Research Asia launched a set of intelligent image/video enhancement tools called DaVinci. This toolkit aims to solve the pain points of existing video enhancement and restoration tools, give full play to the advantages of AI technology and lower the threshold for users to process video footage.
The targeted features of the DaVinci toolkit include low-level image and video enhancement tasks, real-time image and video filters, and visual quality enhancement such as super-resolution, denoising, and video frame interpolation. The backend technology of the DaVinci toolkit is supported by the industry-leading large-scale, low-level vision pre-training technology, supplemented by a large amount of data training. To maximize the robustness of the model, researchers use four million publicly available images and video data, with contents covering landscapes, buildings, people, and so on. To ensure an adequate amount of training data and rich data types, researchers synthesized data with various degradations, so that the entire model training could cover more actual user application scenarios. The DaVinci toolkit has been packaged and released on GitHub.
Explore More
RetrieverTTS: Modeling decomposed factors for text-based speech insertion
Dacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao, Chong Luo
In the post-pandemic world, a large portion of meetings, conferences, and trainings have been moved online, leading to a sharp increase in demand for audio and video creation. However, making zero-mistake audio/video recordings is time consuming, with re-recordings required to fix even the smallest mistakes. Therefore, a simple technique for making partial corrections to recordings is urgently needed. Researchers from Microsoft Research Asia’s Intelligent Multimedia Group and from Azure Speech Team developed a text-based speech editing system to address this problem. The system supports cutting, copying, pasting operations, and inserting synthesized speech segments based on text transcriptions.
AI4Science expands with new research lab in Berlin
Deep learning is set to transform the natural sciences, dramatically improving our ability to model and predict natural phenomena over widely varying scales of space and time. In fact, this may be the dawn of a new paradigm of scientific discovery. To help make this new paradigm a reality, Microsoft Research recently established AI4Science, a new global organization that will bring together experts in machine learning, quantum physics, computational chemistry, molecular biology, fluid dynamics, software engineering, and other disciplines.
This month, the AI4Science organization expanded with a new presence in Berlin, to be led by Dr. Frank Noé. As a “bridge professor,” Noé specializes in work at the interfaces between the fields of mathematics and computer science as well as physics, biology, chemistry and pharmacy. For his pioneering work in developing innovative computational methods in biophysics, he has received awards and funding from the American Chemical Society and the European Research Council, among others.
In the video below, Noé discusses the new lab in Berlin, and potential for machine learning to advance the natural sciences, with Chris Bishop, Microsoft Technical Fellow and director, AI4Science.
The post Research Focus: Week of September 26, 2022 appeared first on Microsoft Research.
Assessing AI system performance: thinking beyond models to deployment contexts
AI systems are becoming increasingly complex as we move from visionary research to deployable technologies such as self-driving cars, clinical predictive models, and novel accessibility devices. Unlike singular AI models, it is more difficult to assess whether these more complex AI systems are performing consistently and as intended to realize human benefit.
-
- Real-world contexts for which the data might be noisy or different from training data;
- Multiple AI components interact with each other, creating unanticipated dependencies and behaviors;
- Human-AI feedback loops that come from repeated engagements between people and AI system.
- Very large AI models (e.g., transformer models)
- AI models that interact with other parts of a system (e.g., user interface or heuristic algorithm)
How do we know when these more advanced systems are ‘good enough’ for their intended use? When assessing the performance of AI models, we often rely on aggregate performance metrics like percentage of accuracy. But this ignores the many, often human elements, that make up an AI system.
Our research on what it takes to build forward-looking, inclusive AI experiences has demonstrated that getting to ‘good enough’ requires multiple performance assessment approaches at different stages of the development lifecycle, based upon realistic data and key user needs (figure 1).
Shifting emphasis gradually from iterative adjustments in the AI models themselves toward approaches that improve the AI system as a whole has implications not only in terms of how performance is assessed, but who should be involved in the performance assessment process. Engaging (and training) non-technical domain experts earlier (i.e., for choosing test data or defining experience metrics) and in a larger capacity throughout the development lifecycle can enhance relevance, usability, and reliability of the AI system.
Performance assessment best practices from the PeopleLens
The PeopleLens (figure 2) is a new Microsoft technology designed to enable children who are born blind to experience social agency and build up the range of social attention skills needed to initiate and maintain social interactions. Running on smart glasses, it provides the wearer with continuous, real-time information about the people around them through spatial audio, helping them build up a dynamic map of the whereabouts of others. Its underlying technology is a complex AI system using several computer vision algorithms to calculate, pose, identify registered people, and track those entities over time.
The PeopleLens offers a useful illustration of the wide range of performance assessment methods and people necessary to comprehensively gauge its efficacy.
Getting started: AI model or AI system performance?
Calculating aggregate performance metrics on open-source benchmarked datasets may demonstrate the capability of an individual AI model, but may be insufficient when applied to an entire AI system. It can be tempting to believe a single aggregate performance metric (such as accuracy) can be sufficient to validate multiple AI models individually. But the performance of two AI models in a system cannot be comprehensively measured by simple summation of each model’s aggregate performance metric.
We used two AI models to test the accuracy of the PeopleLens to locate and identify people: the first was a benchmarked, state-of-the-art pose model used to indicate the location of people in an image. The second was a novel facial recognition algorithm previously demonstrated to have greater than 90% accuracy. Despite strong historical performance of these two models, when applied to the PeopleLens, the AI system recognized only 10% of people from a realistic dataset in which people were not always facing the camera.
This finding illustrates that multi-algorithm systems are more than a sum of their parts, requiring specific performance assessment approaches.
Connecting to the human experience: Metric scorecards and realistic data
Metrics scorecards, calculated on a realistic reference dataset, offer one way to connect to the human experience while the AI system is still undergoing significant technical iteration. A metrics scorecard can combine several metrics to measure aspects of the system that are most important to users.
We used ten metrics in the development of PeopleLens. The most valuable two metrics were time-to-first-identification, which measured how long it took from the time a person was seen in a frame to the user hearing the name of that person, and number of repeat false positives, which measured how often a false positive occurred in three frames or more in a row within the reference dataset.
The first metric captured the core value proposition for the user: having the social agency to be the first to say hello when someone approaches. The second was important because the AI system would self-correct single misidentifications, while repeated mistakes would lead to a poor user experience. This measured the ramifications of that accuracy throughout the system, rather than just on a per-frame basis.
Beyond metrics: Using visualization tools to finetune the user experience
While metrics play a critical role in the development of AI systems, a wider range of tools is needed to finetune the intended user experience. It is essential for development teams to test on realistic datasets to understand how the AI system generates the actual user experience. This is especially important with complex systems, where multiple models, human-AI feedback loops, or unpredictable data (e.g., user-controlled data capture) can cause the AI system to respond unpredictably.
Visualization tools can enhance the top-down statistical tools of data scientists, helping domain experts contribute to system development. In the PeopleLens, we used custom-built visualization tools to compare side-by-side renditions of the experience with different model parameters (figure 3). We leveraged these visualizations to enable domain experts—in this case parents and teachers—to spot patterns of odd system behavior across the data.
AI system performance in the context of the user experience
A user experience can only be as good as the underlying AI system. Testing the AI system in a realistic context, measuring things that matter to the users, is a critical stage before wide-spread deployment. We know, for example, that improving AI system performance does not necessarily correspond to improved performance of AI teams (reference).
We also know that human-to-AI feedback loops can make it difficult to measure an AI system’s performance. Essentially repeated interactions between AI system and user, these feedback loops can surface (and intensify) errors. They can also, through good intelligibility, be repaired by the user.
The PeopleLens system gave users feedback about the people’s locations and their faces. A missed identification (e.g., because the person is looking at a chest rather than a face) can be resolved once the user responds to feedback (e.g., by looking up). This example shows us that we do not need to focus on missed identification as they will be resolved by the human-AI feedback loop. However, users were very perplexed by the identification of people who were no longer present, and therefore performance assessments needed to focus on these false positive misidentifications.
-
- Multiple performance assessment methods should be used in AI system development. In contrast to developing individual AI models, general aggregate performance metrics are a small component, relevant primarily in the earliest stages of development.
- Documenting AI system performance should include a range of approaches, from metrics scorecards to system performance metrics for a deployed user experience, to visualization tools.
- Domain experts play an important role in performance assessment, beginning early in the development lifecycle. Domain experts are often not prepared or skilled for the in-depth participation optimal in AI system development.
- Visualization tools are as important as metrics in creating and documenting an AI system for a particular intended use. It is critical that domain experts have access to these tools as key decision-makers in AI system deployment.
Bringing it all together
For complex AI systems, performance assessment methods change across the development lifecycle in ways that differ from individual AI models. Shifting performance assessment techniques from rapid technical innovation requiring easy-to-calculate aggregate metrics at the beginning of the development process, to the performance metrics that reflect critical AI system attributes that make up the user experience toward the end of development helps every type of stakeholder precisely and collectively define what is ‘good enough’ to achieve the intended use.
It is useful for developers to remember performance assessment is not an end goal in itself; it is a process that defines how the system has reached its best state and whether that state is ready for deployment. The performance assessment process must include a broad range of stakeholders, including domain experts, who may need new tools to fulfill critical (sometimes unexpected) roles in the development and deployment of an AI system.
The post Assessing AI system performance: thinking beyond models to deployment contexts appeared first on Microsoft Research.
AI Models vs. AI Systems: Understanding Units of Performance Assessment
As AI becomes more deeply integrated into every aspect of our lives, it is essential that AI systems perform appropriately for their intended use. We know AI models can never be perfect, so how do we decide when AI performance is ‘good enough’ for use in a real life application? Is level of accuracy a sufficient gauge? What else matters? These are questions Microsoft Research tackles every day as part of our mission to follow a responsible, human-centered approach to building and deploying future-looking AI systems.
To answer the question, “what is good enough?”, it becomes necessary to distinguish between an AI model and an AI system as the unit of performance assessment. An AI model typically involves some input data, a pattern-matching algorithm, and an output classification. For example, a radiology scan of the patient’s chest might be shown to an AI model to predict whether a patient has COVID-19. An AI system, by contrast, would evaluate a broader range of information about the patient, beyond the COVID-19 prediction, to inform a clinical decision and treatment plan.
Research has shown that human-AI collaboration can increase the accuracy of AI models alone (reference). In this blog, we share key learnings from the recently retired Project Talia, the prior collaboration between Microsoft Research and SilverCloud Health to understand how thinking about the AI system as a whole—beyond the AI model—can help to more precisely define and enumerate ‘good enough’ for real-life application.
In Project Talia, we developed two AI models to predict treatment outcomes for patients receiving human-supported, internet-delivered cognitive behavioral treatment (iCBT) for symptoms of depression and anxiety. These AI models have the potential to assist the work practices of iCBT coaches. These iCBT coaches are practicing behavioral health professionals specifically trained to guide patients on the use of the treatment platform, recommend specific treatments, and help the patient work through identified difficulties.
Project Talia offers an illustration of the distinction between the AI model produced during research and a resulting AI system that could potentially get implemented to support real-life patient treatment. In this scenario, we demonstrate every system element that must be considered to ensure effective system outcomes, not just AI model outcomes.
Project Talia: Improving Mental Health Outcomes
SilverCloud Health (acquired by Amwell in 2021) is an evidence-based, digital, on-demand mental health platform that delivers iCBT-based programs to patients in combination with limited but regular contact from the iCBT coach. The platform offers more than thirty iCBT programs, predominantly for treating mild-to-moderate symptoms of depression, anxiety, and stress.
Patients work through the program(s) independently and with regular assistance from the iCBT coach, who provides guidance and encouragement through weekly reviews and feedback on the treatment journey.
Previous research (reference) has shown that involving a human coach within iCBT leads to more effective treatment outcomes for patients than unsupported interventions. Aiming to maximize the effects and outcomes of human support in this format, AI models were developed to dynamically predict the likelihood of a patient achieving a reliable improvement[1] in their depression and anxiety symptoms by the end of the treatment program (typically 8 to 14 weeks in length).
Existing literature on feedback-informed therapy (reference) and Project Talia research (reference) suggest that having access to these predictions could provide reassurance for those patients ‘on track’ toward achieving a positive outcome from treatment, or prompt iCBT coaches to make appropriate adjustments therein to better meet those patients’ needs.
AI Model vs. AI System
The figure above illustrates the distinction between the AI model and AI system in this example (figure 1). The AI model takes in a clinical score calculated by a patient’s responses to standardized clinical questionnaires that assess symptoms of depression and anxiety at each treatment session. After three treatment sessions, the AI model predicts whether or not the patient will achieve a clinically significant reduction in mental health symptoms at completion of the treatment program. The AI model itself is trained on fully anonymized clinical scores of nearly 50,000 previous SilverCloud Health patients and achieved an acceptable accuracy of 87% (reference).
The outcome prediction could then be embedded into the clinical management interface that guides iCBT coaches in their efforts to make more informed decisions about that patient’s treatment journey (i.e., increase level and frequency of support from the coach).
When AI models are introduced into human contexts such as this, they rarely work in isolation. In this case, the clinical score is entered into a model with parameters tuned to a particular healthcare context. This example illustrates how AI model performance metrics are not sufficient to determine whether an AI system is ‘good enough’ for real-life application. We must examine challenges that arise throughout every element of the AI system.
Following are two specific examples of how AI system performance can be altered while retaining the same model: contextual model parameters and user interface and workflow integration.
Contextual Model Parameters: Which Error Type is Most Costly
Examining overall performance metrics exclusively can limit visibility into the different types of errors an AI model can make, which can have (potentially negative) implications on the AI system as a whole. For example, an AI model’s false positive and false negative errors can impact the AI system differently. A false positive error could mean a patient who needed extra help might not receive it; a false negative would mean a patient may receive unnecessary care. In this case, false positive errors would have a much bigger impact on a patient than false negative errors. But false negative errors can also be problematic when they cause unnecessary resource allocation.
Contextual model parameters can be tuned to change the balance between error types while maintaining the overall accuracy of the model. The clinical team could define these contextual model parameters to minimize false positive errors that could be more detrimental to patients, by specifying the model to produce only 5% false positives errors. Choosing this parameter could come, however, at the expense of a higher false negative rate, which would require monitoring how AI model performance might then impact service costs or staff burn-out.
This example illustrates the challenging decisions domain experts, who may know little about the details of AI, must make and the implications these decisions can have on AI system performance. In this example, we provided a prototype visualization tool to help the clinical team in their understanding of the implications of their choices across different patient groups.
We are moving into a world in which domain experts and business decision makers, who embed AI into their work practices, will bear increasing responsibility in assessing and debugging AI systems to improve quality, functionality, and relevance to their work.
User Interface and Workflow Integration
AI model predictions need to be contextualized for a given workflow. Research on iCBT coaches has shown that listing the predictions for all patients of a coach in a single screen outside the normal patient-centered workflow can be demotivating (reference). If a coach saw that the majority of their patients were predicted to not improve, or if their patients’ outcomes were predicted to be worse for than those of their colleagues, this could lead coaches to question their own competence or invite competitive thoughts about their colleagues’ performances—both unhelpful in this context.
Displaying the AI model prediction inside the individual patient’s profile, as in the illustration below (figure 2), provides a useful indicator of how well the person is doing and therefore can guide clinical practice. It also deliberately encourages the use of the AI model prediction within the context of other relevant clinical information.
Situating the AI output with other patient information can nurture a more balanced relationship between AI-generated insight and coaches’ own patient assessments, which can counterbalance effects of over-reliance and over-trust in potentially fallible AI predictions (also referred to as automation bias).
This example illustrates the importance of user interface design and workflow integration in how well AI model predictions are understood and can contribute to the success or failure of an AI system as a whole. Domain experts, user research, and service designers start to play a far more important role in the development of AI systems than the typical focus on data scientists.
Final Thoughts
Aggregate performance metrics, such as accuracy, area-under-the-curve (AUC) scores, or mean square error, are easy to calculate on an AI model, but they indicate little about the utility or function of the entire AI system in practice. So, how do we decide when AI system performance is ‘good enough’ for use in real-life application? It is clear that high levels of AI model performance alone are not sufficient—we must consider every element of the AI system.
Contextual model parameters and interface and workflow design present just two examples of how preparing domain experts with expectations, skills, and tools are necessary for optimal benefit from the incorporation of AI systems into human contexts.
[1] Defined as an improvement of 6 or more points on the PHQ-9 depression scale, or 4 or more points on the Gad-7 anxiety scale.
The post AI Models vs. AI Systems: Understanding Units of Performance Assessment appeared first on Microsoft Research.
CCF: Bringing efficiency and usability to a decentralized trust model
Online trust has come a long way since the time of centralized databases, where information was concentrated in one location and the security and validation of that information relied on a core set of people and systems. While convenient, this model of centralized management and oversight had a number of drawbacks. Trust depended on how the workflows of those systems were established and the skillset and integrity of the people involved. It created opportunities for such issues as duplicate digital transactions, human error, and bias, as witnessed in recent history in the financial industry. In response to these systemic issues, a now-famous paper published in late 2008 proposed a distributed ledger, where new transactions could be added and validated only through participant consensus. This model of decentralized trust and execution would become known as distributed ledger technology, or blockchain, and it offered a more trustworthy alternative to centrally managed databases and a new way to store and decentralize data.
In a distributed trust model, network participants validate transactions over a network by performing computation on those transactions themselves and comparing the outputs. While their identities are private and those performing the transactions typically have pseudonyms, the transactions themselves are public, greatly limiting the use cases for decentralized computation systems. One use case where decentralized computation doesn’t work involves handling financial transactions so that they’re compliant with Know Your Client (KYC) standards and anti-money laundering (AML) regulations while also respecting privacy laws. Another involves managing medical records, where multiple organizations, such as healthcare providers and insurers, jointly govern the system.
Distributed trust with centralized confidential computation
While blockchain provided a more reliable option to centralized databases, it isn’t a perfect solution. The Confidential Computing team at Microsoft Research wanted to build a system that retained the advantages of decentralized trust while keeping transactions confidential. This meant we had to develop a way to centralize computation. At the time, no system offered these capabilities.
To tackle this issue, we developed Confidential Consortium Framework (CCF), a framework for building highly available stateful services that require centralized computation while providing decentralized trust. CCF is based on a distributed trust model like that of blockchain while maintaining data confidentiality through secure centralized computation. This centralized confidential computation model also provides another benefit—it addresses the substantial amount of energy used in blockchain and other distributed computation environments.
As widely reported in the media, blockchain comes at a great environmental cost. Cryptocurrency—the most widespread implementation of blockchain—requires a significant amount of computing power to verify transactions. According to the Cambridge Center for Alternative Finance (CCAF), bitcoin, the most common cryptocurrency, as of this writing, currently consumes slightly over 92 terawatt hours per year—0.41 percent of global electricity production, more than the annual energy draw of countries like Belgium or the Philippines.
Our goal was to develop a framework that reduced the amount of computing power it takes to run a distributed system and make it much more efficient, requiring no more energy than the cost of running the actual computation.
To apply the technology in a way that people can use, we worked with the Azure Security team to build Azure confidential ledger, an Azure service developed on CCF that manages sensitive data records in a highly secure way. In this post, we discuss the motivations behind CCF, the problems we set out to solve, and the approaches we took to solve them. We also explain our approach in supporting the development of Azure confidential ledger using CCF.
Overcoming a bias for blockchain
We discovered a strong bias for blockchain as we explained our research to different groups that were interested in this technology, including other teams at Microsoft, academic researchers exploring blockchain consensus, and external partners looking for enterprise-ready blockchain solutions. This bias was in the form of certain assumptions about what was needed to build a distributed ledger: that all transactions had to be public, that computation had to be geographically distributed, and that it had to be resilient to Byzantine faults from executors. First recognizing these biases and then countering them were some of the biggest challenges we had to surmount.
We worked to show how CCF broke from each of these assumptions while still providing an immutable ledger with distributed trust. We also had to prove that there were important use cases for maintaining confidentiality in a distributed trust system. We went through multiple rounds of discussion, explaining how the technology we wanted to build was different from traditional blockchains, why it was a worthwhile investment, and what the benefits were. Through these conversations, we discovered that many of our colleagues were just as frustrated as we were by the very issues in blockchain we were setting out to solve.
Additionally, we encountered skepticism from internal partner teams, who needed more than a research paper to be convinced that we could successfully accomplish our research goals and support our project. There were healthy doubts about the performance that was possible when executing inside an encrypted and isolated memory space, the ability to build a functional and useable system with minimal components that needed to be trusted, and how much of the internal complexity it was possible to hide from operators and users. Early versions of CCF and sample apps were focused on proving we could overcome those risks. We built basic proofs of concept and gave numerous demonstrations showing how we could implement distributed trust with centralized confidential computation. In the end, it was the strength of these demos that helped us get the resources we needed to pursue our research.
Building the compute stack
Another challenge involved was reimagining a secure compute stack for an enclave—the secured portion of the hardware’s processor and memory. At the time, enclaves were very resource constrained compared with traditional hardware, and we could run only small amounts of code on very little memory.
In addition, capabilities are limited when performing computation in an enclave. For example, the code can’t access anything outside the enclave, and it’s difficult to get the code to communicate with an external system. This challenge required us to design and build an entire compute stack from scratch with all the elements needed to establish consensus, implement transactional storage, establish runtimes for user languages, and so on.
Another consideration was the need to build a system that people could use. As researchers, we wanted our work to have real impact, but it was tempting to push the state of the art in the area of confidential computing research and develop very elaborate technology in these enclaves. However, these types of innovations cannot be deployed in actual products because they’re exceedingly difficult to explain and apply. We had committed to creating something that product teams could implement and use as a foundation for building real systems and products, so we worked to calibrate the guarantees and threat model so that our system could be used in actual products.
Establishing a root of trust with CCF
CCF strengthens the trust boundary in scenarios in which both distributed trust and data confidentiality are needed by decreasing the size of the trusted computing base (TCB)—the components of a computing environment that must be trusted for the appropriate level of security to be applied—reducing the attack surface. Specifically, CCF allows operators to greatly decrease or even eliminate their presence in the TCB, depending on the governance configuration.
Instead of a social root of trust—such as a cloud service provider or the participant consensus used in blockchain networks—CCF relies on trusted hardware to enforce transaction integrity and confidentiality, which creates a trusted execution environment (TEE). These TEEs are isolated memory spaces that are kept encrypted at all times, even when data is executing. The memory chip itself strictly enforces this memory encryption. Data in TEEs is never readable.
Decentralized trust is underpinned by remote attestation, providing the guarantee to a remote entity that all computation of user data takes place in a publicly verifiable TEE. The combination of this attestation with the isolated and encrypted TEE creates a distributed trust environment. Nodes in the network establish mutual trust by verifying their respective attestations, which affirm that they’re running the expected code in a TEE. The operator starting the nodes, which can be automated or manual, indicates where in the network they can find each other.
Service governance is performed by a flexible consortium, which is separate from the operator. CCF uses a ledger to provide offline trust. All transactions are reflected in a tamper-protected ledger that users can review to audit service governance and obtain universally verifiable transaction receipts, which can verify the consistency of the service and prove the execution of transactions to other users. This is particularly valuable for users who need to comply with specific laws and regulations.
Laying the foundation for Azure confidential ledger
We collaborated with the Azure Security team to refine and improve CCF so that it could be used as a foundation for building new Azure services for confidential computing. We applied Azure API standards and ensured that CCF complied with Azure best practices, including enabling it to log operations and perform error reporting and long-running queries. We then developed a prototype of an Azure application, and from this, the Azure Security team developed Azure confidential ledger, the first generally available managed service built on CCF, which provides tamper-protected audit logging that can be cryptographically verified.
Looking forward
We were pleasantly surprised by how quickly we discovered new use cases for CCF and Azure confidential ledger, both within Microsoft and with third-party users. Now, most of the use cases are those we had not initially foreseen, from atmospheric carbon removal to securing machine learning logs. We’re extremely excited by the potential for CCF to have much more impact than we had originally planned or expected when we first started on this journey, and we’re looking forward to discovering some of the countless ways in which it can be applied.
The post CCF: Bringing efficiency and usability to a decentralized trust model appeared first on Microsoft Research.
Microsoft Research Summit 2022: What’s Next for Technology and Humanity?
Today, we are experiencing waves of breakthroughs in computing that are transforming just about every aspect of our lives. Artificial intelligence is changing the way we develop and create. Human language technologies are revolutionizing the workflows of healthcare professionals. Deep learning is accelerating our ability to understand and predict natural phenomena, from atomic to galactic scales. Meanwhile, the foundations of cloud computing are undergoing a reinvention from the atoms up.
Realizing the benefits of these new breakthroughs demands that we come together in new ways across the global research community. The vibrancy of invention and innovation increasingly lies at the intersections among traditional research disciplines, from the highly theoretical and to the immediately applicable. Ensuring that the continuing advancement of technology is beneficial to all requires communication, collaboration and co-innovation across the communities that create new technologies and those that aim to use them to improve their lives.
That’s why I’m excited to invite you to join us for this year’s Microsoft Research Summit, which will take place on October 18-20, 2022. This virtual event is where the global research community convenes to explore how emerging research might best address societal challenges and have significant impact on our lives in the coming years. This year’s event will feature over 120 speakers, including researchers and leaders from across the research community at Microsoft, alongside partners and collaborators from industry, academia and government who are advancing the frontiers of research in computing and across the sciences.
Each of our three days will begin with a plenary session during which we’ll explore the potential impact of deep learning on scientific discovery, the opportunity to use technology to make healthcare more precise and accessible, and the re-invention of foundational technologies to enable the cloud of the future. These plenaries will lead into tracks that dive deeper into research that spans from more efficient and adaptable AI, to technologies that amplify human creativity and help foster a more sustainable society.
For further details – and to register to attend – check out the Microsoft Research Summit website.
We hope you will join us.
The post Microsoft Research Summit 2022: What’s Next for Technology and Humanity? appeared first on Microsoft Research.
A game-theoretic approach to provably correct and scalable offline RL
-
Group
Robot Learning Group
Despite increasingly widespread use of machine learning (ML) in all aspects of our lives, a broad class of scenarios still rely on automation designed by people, not artificial intelligence (AI). In real-world applications that involve making sequences of decisions with long-term consequences, from allocating beds in an intensive-care unit to controlling robots, decision-making strategies to this day are carefully crafted by experienced engineers. But what about reinforcement learning (RL), which gave machines supremacy in games as distinct as Ms. PacMan and Pokémon Go? For all its appeal, RL – specifically, its most famous flavor, online RL – has a significant drawback beyond scenarios that can be simulated and have well-defined behavioral rules. Online RL agents learn by trial and error. They need opportunities to try various actions, observe their consequences, and improve as a result. Making wildly suboptimal decisions just for learning’s sake is acceptable when the biggest stake is a premature demise of a computer game character or showing an irrelevant ad to a website visitor. For tasks such as training a self-driving car’s AI, however, it is clearly not an option.
Offline reinforcement learning (RL) is a paradigm for designing agents that can learn from large existing datasets – possibly collected by recording data from existing reliable but suboptimal human-designed strategies – to make sequential decisions. Unlike the conventional online RL, offline RL can learn policies without collecting online data and even without interacting with a simulator. Moreover, since offline RL does not blindly mimic the behaviors seen in data, like imitation learning (an alternate strategy of RL), offline RL does not require expensive expert-quality decision examples, and the learned policy of offline RL can potentially outperform the best data-collection policy. This means, for example, an offline RL agent in principle can learn a competitive driving policy from logged datasets of regular driving behaviors. Therefore, offline RL offers great potential for large-scale deployment and real-world problem solving.
However, offline RL faces a fundamental challenge: the data we can collect in large quantity lacks diversity, so it is impossible to use it to estimate how well a policy would perform in the real world. While we often associate the term “Big Data” with diverse datasets in ML, it is no longer true when the data concerns real-world “sequential” decision making. In fact, curating diverse datasets for these problems can range from difficult to nearly impossible, because it would require running unacceptable experiments in extreme scenarios (like staging the moments just before a car cash, or conducting unethical clinical trials). As a result, the data that gets collected in large quantity, counterintuitively, lacks diversity, which limits its usefulness.
In this post, we introduce a generic game-theoretic framework for offline RL. We frame the offline RL problem as a two-player game where a learning agent competes with an adversary that simulates the uncertain decision outcomes due to missing data coverage. By this game analogy, we design a systematic and provably correct way to design offline RL algorithms that can learn good policies with state-of-the-art empirical performance. Finally, we show that this framework provides a natural connection between offline RL and imitation learning through the lens of generative adversarial networks (GANs). This connection ensures that the policies learned by this game-theoretic framework are always guaranteed to be no worse than the data collection policies. In other words, with this framework, we can use existing data to robustly learn policies that improve upon the human-designed strategies currently running in the system.
The content of this post is based on our recent papers Bellman-consistent Pessimism for Offline Reinforcement Learning (Oral Presentation, NeurIPS 2021) and Adversarially Trained Actor Critic for Offline Reinforcement Learning (Outstanding Paper Runner-up, ICML 2022).
Fundamental difficulty of offline RL and version space
A major limitation of making decisions with only offline data is that existing datasets do not include all possible scenarios in the real world. Hypothetically, suppose that we have a large dataset of how doctors treated patients in the past 50 years, and we want to design an AI agent to make treatment recommendations through RL. If we run a typical RL algorithm on this dataset, the agent might come up with absurd treatments, such as treating a patient with pneumonia through amputation. This is because the dataset does not have examples of what would happen to pneumonia after amputating patients, and “amputation cures pneumonia” may seem to be a plausible scenario to the learning agent, as no information in the data would falsify such a hypothesis.
To address this issue, the agent needs to carefully consider uncertainties due to missing data. Rather than fixating on a particular data-consistent outcome, the agent should be aware of different possibilities (i.e., while amputating the leg might cure the pneumonia, it also might not, since we do not have sufficient evidence for either scenario) before committing to a decision. Such deliberate conservative reasoning is especially important when the agent’s decisions can cause negative consequences in the real world.
A formal way to express the idea above is through the concept of version space. In machine learning, a version space is a set of hypotheses consistent with data. In the context of offline RL, we will form a version space for each candidate decision, describing the possible outcomes if we are to make the decision. Notably, the version space may contain multiple possible outcomes for a candidate decision whose data is missing.
To understand better what a version space is, we use the figure below to visualize the version space of a simplified RL problem where the decision horizon is one. Here we want to select actions that can obtain the highest reward (such as deciding which drug treatment has the highest chance of curing a disease). Suppose we have data of action-reward pairs, and we want to reason about the reward obtained by taking an action. If the underlying reward function is linear, then we have a version space shown in red, which is a subset of (reward) hypotheses that are consistent with the data. In general, for a full RL problem, a version space can be more complicated than just reward functions, e.g., we would need to start thinking about hypotheses about world models or value functions. Nonetheless, similar concepts like the one shown in the figure below apply.
Thinking offline RL as two-player games
If we think about uncertainties as hypotheses in a version space, then a natural strategy of designing offline RL agents is to optimize for the worst-case scenario among all hypotheses in the version space. In this way, the agent would consider all scenarios in the version space as likely, avoiding making decisions following a delusional outcome.
Below we show that this approach of thinking about worst cases for offline RL can be elegantly described by a two-player game, called the Stackelberg game. To this end, let us first introduce some math notations and explain what a Stackelberg game is.
Notation We use (π) to denote a decision policy and use (J(π)) to denote the performance of (π) in the application environment. We use (Π) to denote the set of policies that the learning agent is considering, and we use (mathcal{H}) to denote the set of all hypotheses. We also define a loss function (psi:Π × mathcal{H}→[0, infty)) such that if (psi(π,H)) is small, (H) is a data-consistent hypothesis with respect to (π); conversely (psi(π,H)) gets larger for data-inconsistent ones (e.g., we can treat each (H) as a potential model of the world and (psi(π,H)) as the modelling error on the data). Consequently, a version space above is defined as (mathcal V_pi = { H: psi(pi,H) leq varepsilon }) for some small (varepsilon).
Given a hypothesis (H in mathcal{H}), we use (H(π)) to denote the performance of (π) (predicted) by (H), which may be different from (π)’s true performance (J(π)). As a standard assumption, we suppose for every (π in Π) there is some (H_{π}^* in mathcal{H}) that describes the true outcome of (π), that is, (J(π) = H_{π}^*(π)) and (psi(π,H_{π}^*) = 0).
Stackelberg Game In short, a Stackelberg game is a bilevel optimization problem,
(displaystyle max_{x} f(x,y_x), quad {rm s.t.} ~~ y_x in min_x g(x,y))
In this two-player game, the Leader (x) and the Follower (y) maximize and minimize the payoffs (f) and (g), respectively, under the rule that the Follower plays after the Leader. (This is reflected in the subscript of (y_{x}), that (y) is decided based on the value of (x)). In the special case of (f = g), a Stackelberg game reduces to a zero-sum game (displaystyle max_{x} min_{y} f(x,y)).
We can think about offline RL as a Stackelberg game: We let the learning agent be the Leader and introduce a fictitious adversary as the Follower, which chooses hypotheses from the version space (V_{π}) based on the Leader’s policy (π). Then we define the payoffs above as performance estimates of a policy in view of a hypothesis. By this construction, solving the Stackelberg would mean finding policies that maximize the worst-case performance, which is exactly our starting goal.
Now we give details on how this game can be designed, using absolute pessimism or relative pessimism, so that we can have performance guarantees in offline RL.
A two-player game based on absolute pessimism
Our first instantiation is a two-player game based on the concept of absolute pessimism, introduced in the paper Bellman-consistent Pessimism for Offline Reinforcement Learning (NeurIPS 2021).
(displaystyle max_{pi in Pi} H_pi(pi), quad {rm s.t.} ~~ H_pi in min_{H in mathcal H} H(pi) + beta psi(pi,H) )
where (beta ge 0) is a hyperparameter which controls how strongly we want the hypothesis to be data consistent. That is, the larger (beta) is, the smaller the version space (mathcal V_pi = { H: psi(pi,H) leq varepsilon }) is (since (varepsilon) is smaller), so we can think that the value of (beta) trades off conservatism and generalization in learning.
This two-player game aims to optimize for the worst-case absolute performance of the agent, as we can treat (H_{π}) as the most pessimistic hypothesis in (mathcal{H}) that is data consistent. Specifically, we can show (H_{π}(π) le J(π)) for any (beta ge 0); as a result, the learner is always optimizing for a performance lower bound.
In addition, we can show that by the design of (psi), this lower bound is tight when (beta) is chosen well (that is, it only underestimates the performance when the policy runs into situations for which we lack data). As a result, with a properly chosen (beta) value, the policy found by the above game formulation is optimal, if the data covers relevant scenarios that an optimal policy would visit, even for data collected by a sub-optimal policy.
In practice, we can define the hypothesis space (mathcal{H}) as a set of value critics (model-free) or world models (model-based). For example, if we set the hypothesis space (mathcal{H}) as candidate Q-functions and (psi) as the Bellman error of a Q function with respect to a policy, then we will get the actor-critic algorithm called PSPI. If we set (mathcal{H}) as candidate models and (psi) as the model fitting error, then we get the CPPO algorithm.
A two-player game based on relative pessimism
While the above game formulation guarantees learning optimality for a well-tuned hyperparameter (beta), the policy performance can be arbitrarily bad if the (beta) value is off. To address this issue, we introduce an alternative two-player game based on relative pessimism, proposed in the paper Adversarially Trained Actor Critic for Offline Reinforcement Learning (ICML 2022).
(displaystyle max_{pi in Pi} H_pi(pi) color{red}{- {H_{π}(mu)}}, {rm s.t.} H_{π} in min_{H in mathcal H} H(π) color{red}{- {H_{π}(mu)}} + betapsi(π,H))
where we use (mu) to denote the data collection policy.
Unlike the absolute pessimism version above, this relative pessimism version is designed to optimize for the worst-case performance relative to the behavior policy (mu). Specifically, we can show (H_pi (pi) – H_pi(mu) leq J(pi) – J(mu)) for all (beta ge 0). And again, we can think about the hypotheses in model-free or model-based manners, like the discussion above. When instantiated model-free, we get the ATAC (Adversarially Trained Actor Critic) algorithm, which achieves state-of-the-art empirical results in offline RL benchmarks.
An important benefit of this relative pessimism version is a robust property to the choice of (beta). As a result, we can guarantee that the learned policy is always no worse than the behavior policy, despite data uncertainty and the (beta) choice – a property we call (robust) (policy) (improvement). The intuition behind this is that (pi = mu) achieves a zero objective in this game, so the agent has an incentive to deviate from (mu) only if it finds a policy (pi) that is uniformly better than (mu) for all possible data-consistent hypotheses.
At the same time, this relative pessimism game is also guaranteed with the learning optimality when (beta) is chosen correctly, just like above absolute pessimism game. Therefore, in some sense, we can view the relative pessimism game (e.g., ATAC) as a robust version of the absolute pessimism game (e.g., PSPI).
A connection between offline RL and imitation learning
An interesting takeaway from the discussion above is a clean connection between offline RL and imitation learning (IL) based on GANs or integral probability metric. IL like offline RL also tries to learn good policies from offline data. But IL does not use reward information, so the best strategy for IL is to mimic the data collection policy. Among modern IL algorithms, one effective strategy is to use GANs, where we train the policy (the generator) using an adversarial discriminator to separate the actions generated the policy and the actions from the data.
Now that we understand how IL-based on GANs work, we can see the connection between offline RL and IL through the model-free version of the relative pessimism game, that is, ATAC. We can view the actor and the critic of ATAC as the generator and the discriminator in IL based on GANs. By choosing different (beta) values, we can control the strength of the discriminator; when (beta = 0), the discriminator is the strongest, and the best generator is naturally the behavior policy, which recovers the imitation learning behavior. On the other hand, using larger (beta) weakens the discriminator by Bellman regularization (i.e., (psi)) and leads to offline RL.
In conclusion, our game-theoretic framework shows
Offline RL + Relative Pessimism = IL + Bellman Regularization.
We can view them both as solving a version of GANs problems! The only difference is that offline RL uses more restricted discriminators that are consistent with the observed rewards, since the extra information added in offline RL compared with imitation learning is the reward labels.
The high-level takeaway from this connection is that the policies learned by offline RL with the relative pessimism game are guaranteed to be no worse than the data collection policy. In other words, we look forward to exploring possible applications that robustly improve upon existing human-designed strategies running in the system, by just using existing data despite the lack of data diversity.
The post A game-theoretic approach to provably correct and scalable offline RL appeared first on Microsoft Research.
MoCapAct: Training humanoid robots to “Move Like Jagger”
What would it take to get humanoid, bipedal robots to dance like Mick Jagger? Indeed, for something more mundane, what does it take to get them to simply stand still? Sit down? Walk? Move in myriads of other ways many people take for granted? Bipedalism provides unparalleled versatility in an environment designed for and by humans. By mixing and matching a wide range of basic motor skills, from walking to jumping to balancing on one foot, people routinely dance, play soccer, carry heavy objects, and perform other complex high-level motions. If robots are ever to reach their full potential as an assistive technology, mastery of diverse bipedal motion is a requirement, not a luxury. However, even the simplest of these skills can require a fine orchestration of dozens of joints. Sophisticated engineering can rein in some of this complexity, but endowing bipedal robots with the generality to cope with our messy, weakly structured world, or a metaverse that takes after it, requires learning. Training AI agents with humanoid morphology to match human performance across the entire diversity of human motion is one of the biggest challenges of artificial physical intelligence. Due to the vagaries of experimentation on physical robots, research in this direction is currently done mostly in simulation.
Unfortunately, it involves computationally intensive methods, effectively restricting participation to research institutions with large compute budgets. In an effort to level the playing field and make this critical research area more inclusive, Microsoft Research’s Robot Learning group is releasing MoCapAct, a large library of pre-trained humanoid control models along with enriched data for training new ones. This will enable advanced research on artificial humanoid control at a fraction of the compute resources currently required.
The reason why humanoid control research has been so computationally demanding is subtle and, at the first glance, paradoxical. The prominent avenue for learning locomotive skills is based on using motion capture (MoCap) data. MoCap is an animation technique widely used in the entertainment industry for decades. It involves recording the motion of several keypoints on a human actor’s body, such as their elbows, shoulders, and knees, while the actor is performing a task of interest, such as jogging. Thus, a MoCap clip can be thought of as a very concise and precise summary of an activity’s video clip. Thanks to this, useful information can be extracted from MoCap clips with much less computation than from the much more high-dimensional, ambiguous training data in other major areas of machine learning, which comes in the form of videos, images, and text. On top of this, MoCap data is widely available. Repositories such as CMU Motion Capture Dataset contain hours of clips for just about any common motion of a human body, with visualizations of several examples shown below. Why, then, is it so hard to make physical and simulated humanoid robots mimic a person’s movements?
The caveat is that MoCap clips don’t contain all the information necessary to imitate the demonstrated motions on a physical robot or in a simulation that models physical forces. They only show us what a motion skill looks like, not the underlying muscular movements that caused the actor’s muscles to yield that motion. Even if MoCap systems recorded these signals, it wouldn’t be of much help: simulated humanoids and real robots typically use motors instead of muscles, which is a dramatically different form of articulation. Nonetheless, actuation in artificial humanoids is also driven by a type of control signal. MoCap clips are a valuable aid in computing these control signals, if combined with additional learning and optimization methods that use MoCap data as guidance. The computational bottleneck that our MoCapAct release aims to remove is created exactly by these methods, collectively known as reinforcement learning (RL). In simulation, where much of AI locomotion research is currently focused, RL can recover the sequence of control inputs that takes a humanoid agent through the sequence of poses from a given MoCap clip. What results is a locomotion behavior that is indistinguishable from the clip’s. The availability of control policies for individual basic behaviors learned from separate MoCap clips can open the doors for fascinating locomotion research, e.g., in methods for combining these behaviors into a single “multi-skilled” neural network and training higher-level locomotion capabilities by switching among them. However, with thousands of basic locomotion skills to learn, RL’s expensive trial-and-error approach creates a massive barrier to entry on this research path. It is this scalability issue that our dataset release aims to address.
Our MoCapAct dataset, designed to be compatible with the highly popular dm_control humanoid simulation environment and the extensive CMU Motion Capture Dataset, serves the research community in two ways:
- For each of over 2500 MoCap clip snippets from the CMU Motion Capture Dataset, it provides an RL-trained “expert” control policy (represented as a PyTorch model) that enables dm_control’s simulated humanoid to faithfully recreate the skill depicted in that clip snippet, as shown in these videos of the experts’ behaviors:
Training this model zoo has taken the equivalent of 50 years over many GPU-equipped Azure NC6v2 virtual machines (excluding hyperparameter tuning and other required experiments) – a testament to the computational hurdle MoCapAct removes for other researchers.
- For each of the trained skill policies above, MoCapAct supplies a set of recorded trajectories generated by executing that skill’s control policy on the dm_control’s humanoid agent. These trajectories can be thought of as MoCap clips of the trained experts but, in a crucial difference from the original MoCap data, they contain both low-level sensory measurements (e.g., touch measurements) and control signals for the humanoid agent. Unlike typical MoCap data, these trajectories are suitable for learning to match and improve on skill experts via direct imitation – a much more efficient class of techniques than RL.
We give two examples of how we used the MoCapAct dataset.
First, we train a hierarchical policy based on the neural probabilistic motor primitive. To achieve this, we combine the thousands of MoCapAct’s clip-specialized policies together into a single policy that is capable of executing many different skills. This agent has a high-level component that takes MoCap frames as input and outputs a learned skill. The low-level component takes the learned skill and sensory measurement from the humanoid as input and outputs the motor action.
This hierarchical structure offers an appealing benefit. If we keep the low-level component, we can instead control the humanoid by inputting different skills to the low-level policy (e.g., “walk” instead of the corresponding motor actions). Therefore, we can re-use the low-level policy to efficiently learn new tasks.
In light of that, we replace the high-level policy with a task policy that is then trained to steer the low-level policy towards achieving some task. As an example, we train a task policy to have the humanoid reach a target. Notice that the humanoid uses many low-level skills, like running, turning, and side-stepping.
Our second example centers on motion completion, which is inspired by the task of sentence completion. Here, we use the GPT architecture, which accepts a sequence of sensory measurements (the “context”) and outputs a motor action. We train a control policy to take one second of sensory measurements from the dataset and output the corresponding motor actions from the specialized expert. Then, before executing the policy on our humanoid, we first generate a “prompt” (red humanoid in the videos) by executing a specialized expert for one second. Afterwards, we let the policy control the humanoid (bronze humanoid in the videos), at each time step, where it constantly takes the previous second of sensory measurements and predicts the motor actions. We find that this policy can reliably repeat the underlying motion of the clip, which is demonstrated in the first two videos. On other MoCap clips, we find that the policy can deviate from the underlying clip in a plausible way, such as in the third video, where the humanoid transitions from side-stepping to walking backwards.
On top of the dataset, we also release the code used to generate the policies and results. We hope the community can build off of our dataset and work to do incredible research in the control of humanoid robots.
Our paper is available here. You can read more at our website.
The data used in this project was obtained from mocap.cs.cmu.edu.
The database was created with funding from NSF EIA-0196217.
The post MoCapAct: Training humanoid robots to “Move Like Jagger” appeared first on Microsoft Research.