CCF: Bringing efficiency and usability to a decentralized trust model

A list of bullet points describing key features of Confidential Consortium Framework: distributed trust, secure enclaves, no admin access, ledger approval by consensus, and flexible authentication and authorization. Next to them is a green and blue geometric sphere.

Online trust has come a long way since the time of centralized databases, where information was concentrated in one location and the security and validation of that information relied on a core set of people and systems. While convenient, this model of centralized management and oversight had a number of drawbacks. Trust depended on how the workflows of those systems were established and the skillset and integrity of the people involved. It created opportunities for such issues as duplicate digital transactions, human error, and bias, as witnessed in recent history in the financial industry. In response to these systemic issues, a now-famous paper published in late 2008 proposed a distributed ledger, where new transactions could be added and validated only through participant consensus. This model of decentralized trust and execution would become known as distributed ledger technology, or blockchain, and it offered a more trustworthy alternative to centrally managed databases and a new way to store and decentralize data.

In a distributed trust model, network participants validate transactions over a network by performing computation on those transactions themselves and comparing the outputs. While their identities are private and those performing the transactions typically have pseudonyms, the transactions themselves are public, greatly limiting the use cases for decentralized computation systems. One use case where decentralized computation doesn’t work involves handling financial transactions so that they’re compliant with Know Your Client (KYC) standards and anti-money laundering (AML) regulations while also respecting privacy laws. Another involves managing medical records, where multiple organizations, such as healthcare providers and insurers, jointly govern the system.

Distributed trust with centralized confidential computation 

While blockchain provided a more reliable option to centralized databases, it isn’t a perfect solution. The Confidential Computing team at Microsoft Research wanted to build a system that retained the advantages of decentralized trust while keeping transactions confidential. This meant we had to develop a way to centralize computation. At the time, no system offered these capabilities.

To tackle this issue, we developed Confidential Consortium Framework (CCF), a framework for building highly available stateful services that require centralized computation while providing decentralized trust. CCF is based on a distributed trust model like that of blockchain while maintaining data confidentiality through secure centralized computation. This centralized confidential computation model also provides another benefit—it addresses the substantial amount of energy used in blockchain and other distributed computation environments.

As widely reported in the media, blockchain comes at a great environmental cost. Cryptocurrency—the most widespread implementation of blockchain—requires a significant amount of computing power to verify transactions. According to the Cambridge Center for Alternative Finance (CCAF), bitcoin, the most common cryptocurrency, as of this writing, currently consumes slightly over 92 terawatt hours per year—0.41 percent of global electricity production, more than the annual energy draw of countries like Belgium or the Philippines.

Our goal was to develop a framework that reduced the amount of computing power it takes to run a distributed system and make it much more efficient, requiring no more energy than the cost of running the actual computation.

To apply the technology in a way that people can use, we worked with the Azure Security team to build Azure confidential ledger, an Azure service developed on CCF that manages sensitive data records in a highly secure way. In this post, we discuss the motivations behind CCF, the problems we set out to solve, and the approaches we took to solve them. We also explain our approach in supporting the development of Azure confidential ledger using CCF.

Icon depicting three separate nodes on a network that can communicate to one another but cannot read each other's data.

Overcoming a bias for blockchain 

We discovered a strong bias for blockchain as we explained our research to different groups that were interested in this technology, including other teams at Microsoft, academic researchers exploring blockchain consensus, and external partners looking for enterprise-ready blockchain solutions. This bias was in the form of certain assumptions about what was needed to build a distributed ledger: that all transactions had to be public, that computation had to be geographically distributed, and that it had to be resilient to Byzantine faults from executors. First recognizing these biases and then countering them were some of the biggest challenges we had to surmount.

We worked to show how CCF broke from each of these assumptions while still providing an immutable ledger with distributed trust. We also had to prove that there were important use cases for maintaining confidentiality in a distributed trust system. We went through multiple rounds of discussion, explaining how the technology we wanted to build was different from traditional blockchains, why it was a worthwhile investment, and what the benefits were. Through these conversations, we discovered that many of our colleagues were just as frustrated as we were by the very issues in blockchain we were setting out to solve.

Additionally, we encountered skepticism from internal partner teams, who needed more than a research paper to be convinced that we could successfully accomplish our research goals and support our project. There were healthy doubts about the performance that was possible when executing inside an encrypted and isolated memory space, the ability to build a functional and useable system with minimal components that needed to be trusted, and how much of the internal complexity it was possible to hide from operators and users. Early versions of CCF and sample apps were focused on proving we could overcome those risks. We built basic proofs of concept and gave numerous demonstrations showing how we could implement distributed trust with centralized confidential computation. In the end, it was the strength of these demos that helped us get the resources we needed to pursue our research.

Building the compute stack

Another challenge involved was reimagining a secure compute stack for an enclave—the secured portion of the hardware’s processor and memory. At the time, enclaves were very resource constrained compared with traditional hardware, and we could run only small amounts of code on very little memory.

In addition, capabilities are limited when performing computation in an enclave. For example, the code can’t access anything outside the enclave, and it’s difficult to get the code to communicate with an external system. This challenge required us to design and build an entire compute stack from scratch with all the elements needed to establish consensus, implement transactional storage, establish runtimes for user languages, and so on.

Another consideration was the need to build a system that people could use. As researchers, we wanted our work to have real impact, but it was tempting to push the state of the art in the area of confidential computing research and develop very elaborate technology in these enclaves. However, these types of innovations cannot be deployed in actual products because they’re exceedingly difficult to explain and apply. We had committed to creating something that product teams could implement and use as a foundation for building real systems and products, so we worked to calibrate the guarantees and threat model so that our system could be used in actual products.

Establishing a root of trust with CCF

CCF strengthens the trust boundary in scenarios in which both distributed trust and data confidentiality are needed by decreasing the size of the trusted computing base (TCB)—the components of a computing environment that must be trusted for the appropriate level of security to be applied—reducing the attack surface. Specifically, CCF allows operators to greatly decrease or even eliminate their presence in the TCB, depending on the governance configuration.

Instead of a social root of trust—such as a cloud service provider or the participant consensus used in blockchain networks—CCF relies on trusted hardware to enforce transaction integrity and confidentiality, which creates a trusted execution environment (TEE). These TEEs are isolated memory spaces that are kept encrypted at all times, even when data is executing. The memory chip itself strictly enforces this memory encryption. Data in TEEs is never readable.

Decentralized trust is underpinned by remote attestation, providing the guarantee to a remote entity that all computation of user data takes place in a publicly verifiable TEE. The combination of this attestation with the isolated and encrypted TEE creates a distributed trust environment. Nodes in the network establish mutual trust by verifying their respective attestations, which affirm that they’re running the expected code in a TEE. The operator starting the nodes, which can be automated or manual, indicates where in the network they can find each other.

Service governance is performed by a flexible consortium, which is separate from the operator. CCF uses a ledger to provide offline trust. All transactions are reflected in a tamper-protected ledger that users can review to audit service governance and obtain universally verifiable transaction receipts, which can verify the consistency of the service and prove the execution of transactions to other users. This is particularly valuable for users who need to comply with specific laws and regulations.

A circular flowchart connecting three ledgers, each marked with a padlock around a circle representing a confidential network. The arrows connecting the ledgers read
Figure 1: In a confidential network, data is encrypted at rest, in transit, and in use because it’s run in a trusted execution environment. All network administration occurs outside the trust boundary. The network constitution governs participants, configuration, and code, making it resilient to fraud, theft, or unintended data manipulation.

Laying the foundation for Azure confidential ledger

We collaborated with the Azure Security team to refine and improve CCF so that it could be used as a foundation for building new Azure services for confidential computing. We applied Azure API standards and ensured that CCF complied with Azure best practices, including enabling it to log operations and perform error reporting and long-running queries. We then developed a prototype of an Azure application, and from this, the Azure Security team developed Azure confidential ledger, the first generally available managed service built on CCF, which provides tamper-protected audit logging that can be cryptographically verified.

Looking forward

We were pleasantly surprised by how quickly we discovered new use cases for CCF and Azure confidential ledger, both within Microsoft and with third-party users. Now, most of the use cases are those we had not initially foreseen, from atmospheric carbon removal to securing machine learning logs. We’re extremely excited by the potential for CCF to have much more impact than we had originally planned or expected when we first started on this journey, and we’re looking forward to discovering some of the countless ways in which it can be applied.

The post CCF: Bringing efficiency and usability to a decentralized trust model appeared first on Microsoft Research.

Read More

Microsoft Research Summit 2022: What’s Next for Technology and Humanity?

Microsoft Research Summit setup 2022

Today, we are experiencing waves of breakthroughs in computing that are transforming just about every aspect of our lives. Artificial intelligence is changing the way we develop and create. Human language technologies are revolutionizing the workflows of healthcare professionals. Deep learning is accelerating our ability to understand and predict natural phenomena, from atomic to galactic scales. Meanwhile, the foundations of cloud computing are undergoing a reinvention from the atoms up. 

Realizing the benefits of these new breakthroughs demands that we come together in new ways across the global research community. The vibrancy of invention and innovation increasingly lies at the intersections among traditional research disciplines, from the highly theoretical and to the immediately applicable. Ensuring that the continuing advancement of technology is beneficial to all requires communication, collaboration and co-innovation across the communities that create new technologies and those that aim to use them to improve their lives. 

That’s why I’m excited to invite you to join us for this year’s Microsoft Research Summit, which will take place on October 18-20, 2022. This virtual event is where the global research community convenes to explore how emerging research might best address societal challenges and have significant impact on our lives in the coming years. This year’s event will feature over 120 speakers, including researchers and leaders from across the research community at Microsoft, alongside partners and collaborators from industry, academia and government who are advancing the frontiers of research in computing and across the sciences. 

Each of our three days will begin with a plenary session during which we’ll explore the potential impact of deep learning on scientific discovery, the opportunity to use technology to make healthcare more precise and accessible, and the re-invention of foundational technologies to enable the cloud of the future. These plenaries will lead into tracks that dive deeper into research that spans from more efficient and adaptable AI, to technologies that amplify human creativity and help foster a more sustainable society.

For further details – and to register to attend – check out the Microsoft Research Summit website

We hope you will join us. 

The post Microsoft Research Summit 2022: What’s Next for Technology and Humanity? appeared first on Microsoft Research.

Read More

A game-theoretic approach to provably correct and scalable offline RL

A figure that illustrates the concept of the version space in a bandit example. It is a 2D plot where the x-axis denotes actions, and the y-axis denotes reward. It shows data of sampled reward values of different actions as dots, and different hypotheses of how reward depends on action as a function. The functions that are consistent with the observed data form the version space.

Despite increasingly widespread use of machine learning (ML) in all aspects of our lives, a broad class of scenarios still rely on automation designed by people, not artificial intelligence (AI). In real-world applications that involve making sequences of decisions with long-term consequences, from allocating beds in an intensive-care unit to controlling robots, decision-making strategies to this day are carefully crafted by experienced engineers. But what about reinforcement learning (RL), which gave machines supremacy in games as distinct as Ms. PacMan and Pokémon Go? For all its appeal, RL – specifically, its most famous flavor, online RL – has a significant drawback beyond scenarios that can be simulated and have well-defined behavioral rules. Online RL agents learn by trial and error. They need opportunities to try various actions, observe their consequences, and improve as a result. Making wildly suboptimal decisions just for learning’s sake is acceptable when the biggest stake is a premature demise of a computer game character or showing an irrelevant ad to a website visitor. For tasks such as training a self-driving car’s AI, however, it is clearly not an option.

Offline reinforcement learning (RL) is a paradigm for designing agents that can learn from large existing datasets – possibly collected by recording data from existing reliable but suboptimal human-designed strategies – to make sequential decisions. Unlike the conventional online RL, offline RL can learn policies without collecting online data and even without interacting with a simulator. Moreover, since offline RL does not blindly mimic the behaviors seen in data, like imitation learning (an alternate strategy of RL), offline RL does not require expensive expert-quality decision examples, and the learned policy of offline RL can potentially outperform the best data-collection policy. This means, for example, an offline RL agent in principle can learn a competitive driving policy from logged datasets of regular driving behaviors. Therefore, offline RL offers great potential for large-scale deployment and real-world problem solving.

Two figures and two arrows connecting them. The left figure shows examples of real-world sequential decision-making problems such as robotic manipulation, health care, and autonomous driving. The right figure shows that these problems only have non-exploratory logged data. An arrow pointing from the left figure to the right figure shows data collection is costly and risky in these applications. Another arrow pointing from the left figure to the right figure highlights the question “How to make decisions under systematic uncertainty due to missing data coverage?”.

However, offline RL faces a fundamental challenge: the data we can collect in large quantity lacks diversity, so it is impossible to use it to estimate how well a policy would perform in the real world. While we often associate the term “Big Data” with diverse datasets in ML, it is no longer true when the data concerns real-world “sequential” decision making. In fact, curating diverse datasets for these problems can range from difficult to nearly impossible, because it would require running unacceptable experiments in extreme scenarios (like staging the moments just before a car cash, or conducting unethical clinical trials). As a result, the data that gets collected in large quantity, counterintuitively, lacks diversity, which limits its usefulness.

In this post, we introduce a generic game-theoretic framework for offline RL. We frame the offline RL problem as a two-player game where a learning agent competes with an adversary that simulates the uncertain decision outcomes due to missing data coverage. By this game analogy, we design a systematic and provably correct way to design offline RL algorithms that can learn good policies with state-of-the-art empirical performance. Finally, we show that this framework provides a natural connection between offline RL and imitation learning through the lens of generative adversarial networks (GANs). This connection ensures that the policies learned by this game-theoretic framework are always guaranteed to be no worse than the data collection policies. In other words, with this framework, we can use existing data to robustly learn policies that improve upon the human-designed strategies currently running in the system.

The content of this post is based on our recent papers Bellman-consistent Pessimism for Offline Reinforcement Learning (Oral Presentation, NeurIPS 2021) and Adversarially Trained Actor Critic for Offline Reinforcement Learning (Outstanding Paper Runner-up, ICML 2022).

Fundamental difficulty of offline RL and version space

A major limitation of making decisions with only offline data is that existing datasets do not include all possible scenarios in the real world. Hypothetically, suppose that we have a large dataset of how doctors treated patients in the past 50 years, and we want to design an AI agent to make treatment recommendations through RL. If we run a typical RL algorithm on this dataset, the agent might come up with absurd treatments, such as treating a patient with pneumonia through amputation. This is because the dataset does not have examples of what would happen to pneumonia after amputating patients, and “amputation cures pneumonia” may seem to be a plausible scenario to the learning agent, as no information in the data would falsify such a hypothesis.

To address this issue, the agent needs to carefully consider uncertainties due to missing data. Rather than fixating on a particular data-consistent outcome, the agent should be aware of different possibilities (i.e., while amputating the leg might cure the pneumonia, it also might not, since we do not have sufficient evidence for either scenario) before committing to a decision. Such deliberate conservative reasoning is especially important when the agent’s decisions can cause negative consequences in the real world.

A formal way to express the idea above is through the concept of version space. In machine learning, a version space is a set of hypotheses consistent with data. In the context of offline RL, we will form a version space for each candidate decision, describing the possible outcomes if we are to make the decision. Notably, the version space may contain multiple possible outcomes for a candidate decision whose data is missing.

To understand better what a version space is, we use the figure below to visualize the version space of a simplified RL problem where the decision horizon is one. Here we want to select actions that can obtain the highest reward (such as deciding which drug treatment has the highest chance of curing a disease). Suppose we have data of action-reward pairs, and we want to reason about the reward obtained by taking an action. If the underlying reward function is linear, then we have a version space shown in red, which is a subset of (reward) hypotheses that are consistent with the data. In general, for a full RL problem, a version space can be more complicated than just reward functions, e.g., we would need to start thinking about hypotheses about world models or value functions. Nonetheless, similar concepts like the one shown in the figure below apply.

A figure that illustrates the concept of the version space in a bandit example. It is a 2D plot where the x-axis denotes actions, and the y-axis denotes reward. It shows data of sampled reward values of different actions as dots, and different hypotheses of how reward depends on action as a function. The functions that are consistent with the observed data form the version space.
Figure 1: Illustration of version space and hypotheses in a bandit example.

Thinking offline RL as two-player games

If we think about uncertainties as hypotheses in a version space, then a natural strategy of designing offline RL agents is to optimize for the worst-case scenario among all hypotheses in the version space. In this way, the agent would consider all scenarios in the version space as likely, avoiding making decisions following a delusional outcome.

Below we show that this approach of thinking about worst cases for offline RL can be elegantly described by a two-player game, called the Stackelberg game. To this end, let us first introduce some math notations and explain what a Stackelberg game is.

Notation We use (π) to denote a decision policy and use (J(π)) to denote the performance of (π) in the application environment. We use (Π) to denote the set of policies that the learning agent is considering, and we use (mathcal{H}) to denote the set of all hypotheses. We also define a loss function (psi:Π × mathcal{H}→[0, infty)) such that if (psi(π,H)) is small, (H) is a data-consistent hypothesis with respect to (π); conversely (psi(π,H)) gets larger for data-inconsistent ones (e.g., we can treat each (H) as a potential model of the world and (psi(π,H)) as the modelling error on the data). Consequently, a version space above is defined as (mathcal V_pi = { H: psi(pi,H) leq varepsilon }) for some small (varepsilon).

Given a hypothesis (H in mathcal{H}), we use (H(π)) to denote the performance of (π) (predicted) by (H), which may be different from (π)’s true performance (J(π)). As a standard assumption, we suppose for every (π in Π) there is some (H_{π}^* in mathcal{H}) that describes the true outcome of (π), that is, (J(π) = H_{π}^*(π)) and (psi(π,H_{π}^*) = 0).

Stackelberg Game In short, a Stackelberg game is a bilevel optimization problem,

(displaystyle max_{x} f(x,y_x), quad {rm s.t.} ~~ y_x in min_x g(x,y))

In this two-player game, the Leader (x) and the Follower (y) maximize and minimize the payoffs (f) and (g), respectively, under the rule that the Follower plays after the Leader. (This is reflected in the subscript of (y_{x}), that (y) is decided based on the value of (x)). In the special case of (f = g), a Stackelberg game reduces to a zero-sum game (displaystyle max_{x} min_{y} f(x,y)).

A figure shows a learner that tries to compete with an adversary in a two player game, which resembles a chess game. The learner is thinking about whether to use absolute or relative pessimism as the strategy to choose the policy. The adversary is thinking about which hypothesis from the version space to choose.
Figure 2: Offline RL as two-player Stackelberg game.

We can think about offline RL as a Stackelberg game: We let the learning agent be the Leader and introduce a fictitious adversary as the Follower, which chooses hypotheses from the version space (V_{π}) based on the Leader’s policy (π). Then we define the payoffs above as performance estimates of a policy in view of a hypothesis. By this construction, solving the Stackelberg would mean finding policies that maximize the worst-case performance, which is exactly our starting goal.

Now we give details on how this game can be designed, using absolute pessimism or relative pessimism, so that we can have performance guarantees in offline RL.

A two-player game based on absolute pessimism

Our first instantiation is a two-player game based on the concept of absolute pessimism, introduced in the paper Bellman-consistent Pessimism for Offline Reinforcement Learning (NeurIPS 2021).

(displaystyle max_{pi in Pi} H_pi(pi), quad {rm s.t.} ~~ H_pi in min_{H in mathcal H} H(pi) + beta psi(pi,H) )

where (beta ge 0) is a hyperparameter which controls how strongly we want the hypothesis to be data consistent. That is, the larger (beta) is, the smaller the version space (mathcal V_pi = { H: psi(pi,H) leq varepsilon }) is (since (varepsilon) is smaller), so we can think that the value of (beta) trades off conservatism and generalization in learning.

This two-player game aims to optimize for the worst-case absolute performance of the agent, as we can treat (H_{π}) as the most pessimistic hypothesis in (mathcal{H}) that is data consistent. Specifically, we can show (H_{π}(π) le J(π)) for any (beta ge 0); as a result, the learner is always optimizing for a performance lower bound.

In addition, we can show that by the design of (psi), this lower bound is tight when (beta) is chosen well (that is, it only underestimates the performance when the policy runs into situations for which we lack data). As a result, with a properly chosen (beta) value, the policy found by the above game formulation is optimal, if the data covers relevant scenarios that an optimal policy would visit, even for data collected by a sub-optimal policy.

In practice, we can define the hypothesis space (mathcal{H}) as a set of value critics (model-free) or world models (model-based). For example, if we set the hypothesis space (mathcal{H}) as candidate Q-functions and (psi) as the Bellman error of a Q function with respect to a policy, then we will get the actor-critic algorithm called PSPI. If we set (mathcal{H}) as candidate models and (psi) as the model fitting error, then we get the CPPO algorithm.

A two-player game based on relative pessimism

While the above game formulation guarantees learning optimality for a well-tuned hyperparameter (beta), the policy performance can be arbitrarily bad if the (beta) value is off. To address this issue, we introduce an alternative two-player game based on relative pessimism, proposed in the paper Adversarially Trained Actor Critic for Offline Reinforcement Learning (ICML 2022).

(displaystyle max_{pi in Pi} H_pi(pi) color{red}{- {H_{π}(mu)}}, {rm s.t.} H_{π} in min_{H in mathcal H} H(π) color{red}{- {H_{π}(mu)}} + betapsi(π,H))

where we use (mu) to denote the data collection policy.

Unlike the absolute pessimism version above, this relative pessimism version is designed to optimize for the worst-case performance relative to the behavior policy (mu). Specifically, we can show (H_pi (pi) – H_pi(mu) leq J(pi) – J(mu)) for all (beta ge 0). And again, we can think about the hypotheses in model-free or model-based manners, like the discussion above. When instantiated model-free, we get the ATAC (Adversarially Trained Actor Critic) algorithm, which achieves state-of-the-art empirical results in offline RL benchmarks.

An important benefit of this relative pessimism version is a robust property to the choice of (beta). As a result, we can guarantee that the learned policy is always no worse than the behavior policy, despite data uncertainty and the (beta) choice – a property we call (robust) (policy) (improvement). The intuition behind this is that (pi = mu) achieves a zero objective in this game, so the agent has an incentive to deviate from (mu) only if it finds a policy (pi) that is uniformly better than (mu) for all possible data-consistent hypotheses.

At the same time, this relative pessimism game is also guaranteed with the learning optimality when (beta) is chosen correctly, just like above absolute pessimism game. Therefore, in some sense, we can view the relative pessimism game (e.g., ATAC) as a robust version of the absolute pessimism game (e.g., PSPI).

A connection between offline RL and imitation learning

A figure that shows a spectrum that unifies offline reinforcement learning and imitation learning through the lens of ATAC’s Stackelberg game. The left of the spectrum is offline reinforcement learning, which corresponds to ATAC with a good 
𝛽 value; the right of the spectrum is imitation learning, which corresponds to ATAC with 𝛽 equal to zero. On top of the spectrum, there are figures visualizing how the hypothesis space and the learned policy behave in each scenario. Overall, when 𝛽 is larger, the hypothesis space is smaller; on the other hand, when 𝛽 is smaller the learned policy is closer to the behavior policy that collects the data.
Figure 3: ATAC based on a Stackelberg game of relative pessimism provides a natural connection between offline RL and imitation learning.

An interesting takeaway from the discussion above is a clean connection between offline RL and imitation learning (IL) based on GANs or integral probability metric. IL like offline RL also tries to learn good policies from offline data. But IL does not use reward information, so the best strategy for IL is to mimic the data collection policy. Among modern IL algorithms, one effective strategy is to use GANs, where we train the policy (the generator) using an adversarial discriminator to separate the actions generated the policy and the actions from the data.

Now that we understand how IL-based on GANs work, we can see the connection between offline RL and IL through the model-free version of the relative pessimism game, that is, ATAC. We can view the actor and the critic of ATAC as the generator and the discriminator in IL based on GANs. By choosing different (beta) values, we can control the strength of the discriminator; when (beta = 0), the discriminator is the strongest, and the best generator is naturally the behavior policy, which recovers the imitation learning behavior. On the other hand, using larger (beta) weakens the discriminator by Bellman regularization (i.e., (psi)) and leads to offline RL.

In conclusion, our game-theoretic framework shows

Offline RL + Relative Pessimism = IL + Bellman Regularization.

We can view them both as solving a version of GANs problems! The only difference is that offline RL uses more restricted discriminators that are consistent with the observed rewards, since the extra information added in offline RL compared with imitation learning is the reward labels.

The high-level takeaway from this connection is that the policies learned by offline RL with the relative pessimism game are guaranteed to be no worse than the data collection policy. In other words, we look forward to exploring possible applications that robustly improve upon existing human-designed strategies running in the system, by just using existing data despite the lack of data diversity.

The post A game-theoretic approach to provably correct and scalable offline RL appeared first on Microsoft Research.

Read More

MoCapAct: Training humanoid robots to “Move Like Jagger”

A montage of four animated figures completing humanoid actions: standing up, walking, running, and jumping.

What would it take to get humanoid, bipedal robots to dance like Mick Jagger? Indeed, for something more mundane, what does it take to get them to simply stand still? Sit down? Walk? Move in myriads of other ways many people take for granted? Bipedalism provides unparalleled versatility in an environment designed for and by humans. By mixing and matching a wide range of basic motor skills, from walking to jumping to balancing on one foot, people routinely dance, play soccer, carry heavy objects, and perform other complex high-level motions. If robots are ever to reach their full potential as an assistive technology, mastery of diverse bipedal motion is a requirement, not a luxury. However, even the simplest of these skills can require a fine orchestration of dozens of joints. Sophisticated engineering can rein in some of this complexity, but endowing bipedal robots with the generality to cope with our messy, weakly structured world, or a metaverse that takes after it, requires learning. Training AI agents with humanoid morphology to match human performance across the entire diversity of human motion is one of the biggest challenges of artificial physical intelligence. Due to the vagaries of experimentation on physical robots, research in this direction is currently done mostly in simulation. 

Unfortunately, it involves computationally intensive methods, effectively restricting participation to research institutions with large compute budgets. In an effort to level the playing field and make this critical research area more inclusive, Microsoft Research’s Robot Learning group is releasing MoCapAct, a large library of pre-trained humanoid control models along with enriched data for training new ones. This will enable advanced research on artificial humanoid control at a fraction of the compute resources currently required. 

The reason why humanoid control research has been so computationally demanding is subtle and, at the first glance, paradoxical. The prominent avenue for learning locomotive skills is based on using motion capture (MoCap) data. MoCap is an animation technique widely used in the entertainment industry for decades. It involves recording the motion of several keypoints on a human actor’s body, such as their elbows, shoulders, and knees, while the actor is performing a task of interest, such as jogging. Thus, a MoCap clip can be thought of as a very concise and precise summary of an activity’s video clip. Thanks to this, useful information can be extracted from MoCap clips with much less computation than from the much more high-dimensional, ambiguous training data in other major areas of machine learning, which comes in the form of videos, images, and text. On top of this, MoCap data is widely available. Repositories such as CMU Motion Capture Dataset contain hours of clips for just about any common motion of a human body, with visualizations of several examples shown below. Why, then, is it so hard to make physical and simulated humanoid robots mimic a person’s movements? 

The caveat is that MoCap clips don’t contain all the information necessary to imitate the demonstrated motions on a physical robot or in a simulation that models physical forces. They only show us what a motion skill looks like, not the underlying muscular movements that caused the actor’s muscles to yield that motion. Even if MoCap systems recorded these signals, it wouldn’t be of much help: simulated humanoids and real robots typically use motors instead of muscles, which is a dramatically different form of articulation. Nonetheless, actuation in artificial humanoids is also driven by a type of control signal. MoCap clips are a valuable aid in computing these control signals, if combined with additional learning and optimization methods that use MoCap data as guidance. The computational bottleneck that our MoCapAct release aims to remove is created exactly by these methods, collectively known as reinforcement learning (RL). In simulation, where much of AI locomotion research is currently focused, RL can recover the sequence of control inputs that takes a humanoid agent through the sequence of poses from a given MoCap clip. What results is a locomotion behavior that is indistinguishable from the clip’s. The availability of control policies for individual basic behaviors learned from separate MoCap clips can open the doors for fascinating locomotion research, e.g., in methods for combining these behaviors into a single “multi-skilled” neural network and training higher-level locomotion capabilities by switching among them. However, with thousands of basic locomotion skills to learn, RL’s expensive trial-and-error approach creates a massive barrier to entry on this research path. It is this scalability issue that our dataset release aims to address. 

A flowchart showing motion capture clips producing clip-tracking agents via reinforcement learning. The agents then generate data using the simulated humanoid. The MoCapAct dataset consists of the agents and corresponding data.
Figure 1: The MoCapAct dataset consists of policies that track individual MoCap clips and data from these agents.

Our MoCapAct dataset, designed to be compatible with the highly popular dm_control humanoid simulation environment and the extensive CMU Motion Capture Dataset, serves the research community in two ways: 

  1. For each of over 2500 MoCap clip snippets from the CMU Motion Capture Dataset, it provides an RL-trained “expert” control policy (represented as a PyTorch model) that enables dm_control’s simulated humanoid to faithfully recreate the skill depicted in that clip snippet, as shown in these videos of the experts’ behaviors: 

Training this model zoo has taken the equivalent of 50 years over many GPU-equipped Azure NC6v2 virtual machines (excluding hyperparameter tuning and other required experiments) – a testament to the computational hurdle MoCapAct removes for other researchers. 

  1. For each of the trained skill policies above, MoCapAct supplies a set of recorded trajectories generated by executing that skill’s control policy on the dm_control’s humanoid agent. These trajectories can be thought of as MoCap clips of the trained experts but, in a crucial difference from the original MoCap data, they contain both low-level sensory measurements (e.g., touch measurements) and control signals for the humanoid agent. Unlike typical MoCap data, these trajectories are suitable for learning to match and improve on skill experts via direct imitation – a much more efficient class of techniques than RL. 

We give two examples of how we used the MoCapAct dataset. 

First, we train a hierarchical policy based on the neural probabilistic motor primitive. To achieve this, we combine the thousands of MoCapAct’s clip-specialized policies together into a single policy that is capable of executing many different skills. This agent has a high-level component that takes MoCap frames as input and outputs a learned skill. The low-level component takes the learned skill and sensory measurement from the humanoid as input and outputs the motor action. 

Two graphics of the hierarchical policy. The first graphic shows a MoCap clip of walking being fed into a high-level policy, which outputs a prediction of “walk forward.” This prediction and the humanoid observation are fed into the low-level policy, which then predicts the motor actions to execute the walking motion. The second graphic is similar to the first, with the only difference being that the MoCap clip shows a “run and jump” motion, and the predicted skill is “run and jump.”
Figure 2: The hierarchical policy consists of a high-level policy and low-level policy. The high-level policy maps the given MoCap frames to a learned skill. The low-level policy takes the skill and the humanoid observation and outputs an action that best realizes the skill. 

This hierarchical structure offers an appealing benefit. If we keep the low-level component, we can instead control the humanoid by inputting different skills to the low-level policy (e.g., “walk” instead of the corresponding motor actions). Therefore, we can re-use the low-level policy to efficiently learn new tasks. 

Graphic of a task policy feeding into a low-level policy. The task policy takes an observation from the humanoid as input, and outputs a “skill.” The skill and humanoid observation are fed into a low-level policy, which outputs the motor action.
Figure 3: We can replace the high-level policy with a task policy that is trained to output skills required to achieve some new task, such as running to a target. 

In light of that, we replace the high-level policy with a task policy that is then trained to steer the low-level policy towards achieving some task. As an example, we train a task policy to have the humanoid reach a target. Notice that the humanoid uses many low-level skills, like running, turning, and side-stepping. 

Graphic of the GPT policy. A sequence of humanoid observations is fed into the GPT module, which outputs the motor action.
Figure 4: Our GPT model takes in a sequence of observations from the humanoid (called the “context”) and outputs an action that it thinks best continues the observed motion. 

Our second example centers on motion completion, which is inspired by the task of sentence completion. Here, we use the GPT architecture, which accepts a sequence of sensory measurements (the “context”) and outputs a motor action. We train a control policy to take one second of sensory measurements from the dataset and output the corresponding motor actions from the specialized expert. Then, before executing the policy on our humanoid, we first generate a “prompt” (red humanoid in the videos) by executing a specialized expert for one second. Afterwards, we let the policy control the humanoid (bronze humanoid in the videos), at each time step, where it constantly takes the previous second of sensory measurements and predicts the motor actions. We find that this policy can reliably repeat the underlying motion of the clip, which is demonstrated in the first two videos. On other MoCap clips, we find that the policy can deviate from the underlying clip in a plausible way, such as in the third video, where the humanoid transitions from side-stepping to walking backwards.

On top of the dataset, we also release the code used to generate the policies and results. We hope the community can build off of our dataset and work to do incredible research in the control of humanoid robots. 

Our paper is available here. You can read more at our website

The data used in this project was obtained from mocap.cs.cmu.edu.
The database was created with funding from NSF EIA-0196217. 

The post MoCapAct: Training humanoid robots to “Move Like Jagger” appeared first on Microsoft Research.

Read More

DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization

Three bar plots. The first plot shows that the model size of XTC-BERT is 32 times smaller than that of BERT, and two dots show the accuracy of BERT and XTC-BERT, which are 83.95 and 83.44, respectively.  The second one shows that INT8 using ZeroQuant can be 2.6 times faster than Baseline with FP16 using PyTorch and ZeoQuant can reduce the number of GPUs for inference from 2 to 1, which in total provides 5.2 times efficiency. It also shows that ZeroQuant has 50.4 accuracy compared to 50.5 using Baseline PyTorch. The third plot shows that ZeroQuant is more than 5000 times cheaper than baseline to compress a model, and the accuracy of ZeroQuant is 42.26 compared to 42.35 of baseline.

Large-scale models are revolutionizing deep learning and AI research, driving major improvements in language understanding, generating creative texts, multi-lingual translation and many more. But despite their remarkable capabilities, the models’ large size creates latency and cost constraints that hinder the deployment of applications on top of them. In particular, increased inference time and memory consumption inhibit deployment of models on latency-sensitive and resource-constrained applications on both server and client devices. To address these deployment challenges, the DeepSpeed team, as part of Microsoft’s AI at Scale initiative, has been exploring innovations in system optimization and model compression. On the former, we released the DeepSpeed inference system, which consists of a diverse set of optimizations, such as highly optimized CUDA kernels and inference-adapted parallelism to accelerate model inference speed, as well as ZeRO-Inference, which breaks the GPU memory wall and fits large models across heterogeneous memories to address hardware accessibility limitations. These optimizations target improving the inference system efficiency while preserving the model sizes, the amount of computation, and model accuracy: the total work remains the same, but the processing capability and speed are higher. On the latter, the emerging compression algorithms show great potential in reducing model size and inference computation. These algorithms use condensed format to represent, store, communicate, and compute DNN models, reducing the total work needed for inference with little or no loss in accuracy. System optimizations and model compression are very much complementary, and they can be synergistically combined to provide a multiplicative reduction on inference latency and cost. Motivated by combining the best of both worlds, we are proud to announce DeepSpeed Compression—a composable library that combines novel compression technologies and highly efficient system optimizations to make DL model size smaller and inference speed faster, all with much lowered compression cost.

Challenges of compressing large deep learning models

Although there have been numerous efforts to compress model sizes and reduce inference computation, applying existing compression techniques to large scale models still has many challenges in practice:

Complex pipeline for achieving high compression ratio. Various strategies have been proposed to overcome optimization difficulty and accuracy degradation when compressing large models. However, no systematic study on best practices for extreme compression exists, such as using aggressive quantization methods and layer reduction. This leaves the underlying question unanswered: do we really need those ad-hoc tricks to recover the accuracy loss or do simpler yet more effective methods exist?

High compression cost. Existing methods for compressing large models incur high training costs. For example, popular compression methods such as quantize-aware training (QAT) and multi-stage distillation methods lead to long training time and large hardware resource requirement as the model size grows into multi-billion parameters or at even larger scale, making compressing these models costly and difficult. For example, the 20B GPT-NeoX model was pre-trained using 96 NVIDIA A100 GPUs in three months. Performing QAT even with 10% of training samples would still require large amounts of computational resources, which many practitioners cannot afford.

Lack of tailored system optimizations for compressed models. To maximize the benefits of compressed models, specialized system optimizations are often required, e.g., quantized and sparsified models need optimized low-bit arithmetic computation and sparse matrix multiplication to boost the inference speed on commodity hardware. Existing methods often focus on reducing theoretical computation overhead but miss the opportunities to offer the best inference latency reduction via tailored system optimizations for the compressed models.

Limited composability. Existing methods have limited composability from two aspects. First, there is limited composability among multiple compression methods. Although well-performing compression solutions have been proposed independently, combining multiple methods together for the best outcome is still a laborious process, requiring building a complex compression pipeline. Second, there is a lack of composability between compression techniques and system optimizations. As we just mentioned, compressed models require specialized system optimizations to maximize latency and cost reduction. However, few existing methods take an end-to-end approach of composing compressions with system optimizations, as it requires significant efforts to bring modeling, algorithm, and system areas of deep learning to work synergistically together.

DeepSpeed Compression overcomes these challenges by offering novel state-of-the-art compression techniques, such as XTC for 32x smaller model size and ZeroQuant for 5000x lower compression cost reduction. It also takes an end-to-end approach to improve the computation efficiency of compressed models via a highly optimized inference engine. Furthermore, our library has multiple built-in state-of-the-art compression methods and supports synergistic composition of these methods together with the system optimizations, offering the best of both worlds while allowing a seamless and easy-to-use pipeline for efficient DL model inference. Each of these features is explained further below.

Smaller model size: 32x smaller transformer models via simple yet effective binarized extreme compression

Reducing the size of large models is critical when deploying them on both servers and client devices. In DeepSpeed Compression, we provide extreme compression techniques to reduce model size by 32x with almost no accuracy loss or to achieve 50x model size reduction while retaining 97% of the accuracy. We do this through two main techniques: extreme quantization and layer reduction. Extreme quantization via ternarization/binarization reduces the model size significantly but is considered a particularly challenging task due to the large quantization error resulting in model performance degradation. To improve the accuracy of binarized/ternarized models, existing methods often adopt complicated and computationally expensive compression pipelines, such as multi-stage distillation. However, it remains unclear how different components in extreme quantization affect the resulting performance. To tease apart their effects, we perform a systematic study on the impacts of various techniques currently used for extreme compression.

In this process, we have identified several best practices for extreme compression:

  1. A longer training iteration with learning rate decay is highly preferred for closing the accuracy gap of extreme quantization;
  2. Single-stage knowledge distillation with more training budgets is sufficient to match or even exceed accuracy from multi-stage ones;
  3. Training without data augmentation hurts performance on downstream tasks for various compression tasks, especially on smaller tasks;
  4. Lightweight layer reduction matches or even exceeds expensive pre-training distillation for task-specific compression.

Based on these findings, we greatly simplify the procedure of extreme compression and propose a new extreme compression technique, XTC, that compresses a model to its limit with lightweight layer reduction and robust binarization. XTC produces models with little loss in accuracy yet up to 50x model size reduction, as shown in Figure 1. XTC reduces the model size by 32x with almost no loss in the average score on the GLUE tasks via simple yet effective binarization technique. By combining extreme quantization and lightweight layer reduction, we can further improve the binarized model, achieving 50x model size reduction while retaining 97% of the accuracy. Given that transformers are becoming the standard architecture choice for AI, we believe the investigation and the proposed solution could be highly impactful to power large-scale models on resource-constrained devices. If you are interested in XTC, you can also find more details in our technical report “Extreme Compression for Pre-trained Transformers Made Simple and Efficient.”

A pareto frontier plot showing multiple compression methods, including XTC-BERT, BinanyBERT with TWN, BinaryBERT with BWN, TernaryBERT, TernaryTinyBERT, 3-bit BERT, 3-bit TinyBERT, 8-bit BERT, 8-bit TinyBERT, original BERT teacher. The x-axis shows the model size, and the y-axis shows the GLUE score. Different settings of the proposed XTC-BERT sit on the left top corner of the plot, advancing the Pareto frontier with smaller model sizes and better GLUE scores.
Figure 1: Comparison between XTC and other state-of-the-art compression results on BERT

Lower compression cost: Quantizing models with >5000x compression cost reduction and no training data

Large-scale transformer models with hundreds of billions of parameters are usually challenging to quantize due to the lack of training resources and/or data access. To resolve those issues, we propose a method called ZeroQuant, which quantizes large-scale models with little or no fine-tuning cost on limited resources. Under the hood, ZeroQuant contains two major parts: 1) a hardware friendly fine-grained quantization scheme that allows us to quantize weights and activations into low-bit values with minimal errors while still empowering fast inference speed on commodity hardware with low quantization/dequantization cost; and 2) a layer-by-layer knowledge distillation pipeline, which fine-tunes the quantized model to close the accuracy gap from low-precision (e.g., INT4) quantization.

The benefits of ZeroQuant are threefold: First, unlike previous quantization-aware training that requires expensive retraining and parameter tuning, ZeroQuant enables quantizing BERT and GPT-style models from FP32/FP16 into INT8 weight and activations to retain accuracy without incurring any retraining cost, as shown in Figure 2. Second, by loading only one layer for low-precision (e.g., INT4) quantization at a time, the maximum memory footprint required to quantize the model depends solely on the size of individual layer instead of the entire model, allowing one to quantize gigantic models with as little as one GPU. Third, our quantization method is data-free, which means that it does not require the original training data of the model to obtain a quantized model. This is especially useful when the data is not available due to privacy related reasons, for example.

A graph demonstrating two ways to convert an FP16 or FP32 model to INT8. On top, it shows quantization aware training with a crying emoji, which needs training data and training GPUs to perform a retraining-evaluation-parameter tuning loop in order to get the INT8 model. At the bottom, it shows ZeroQuant (with a smile emoji) which does not require training data and GPUs, and it can directly convert the FP16 or FP32 model to INT8.
Figure 2: Comparison between ZeroQuant and standard Quantization Aware Training. ZeroQuant can significantly reduce training resources and time cost, without requiring the original training data.

We demonstrated the scalability of ZeroQuant on a GPT-3-style model with 1.3B parameters (GPT-3-1.3B) and one of the largest open-source language models, GPT-NeoX (20B). Particularly, thanks to the fine-grained quantization scheme, ZeroQuant can convert GPT-3-1.3B (trained with 128 NVIDIA A100 with five days) and GPT-NeoX (trained with 96 A100 with three months) to INT8 without any cost or training data while delivering comparable accuracy. Furthermore, with the lightweight layer-by-layer knowledge distillation, ZeroQuant can quantize GPT-3-1.3B with mixed INT4/INT8 precision in three hours on a single GPU, which leads to 5000x compression cost reduction compared to quantization-aware training. To find more details about ZeroQuant, refer to “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”.

Faster inference speed: Latency reduction via highly optimized DeepSpeed Inference system

System optimizations play a key role in efficiently utilizing the available hardware resources and unleashing their full capability through inference optimization libraries like ONNX runtime and DeepSpeed. We build our work on top of DeepSpeed inference, which provides high-performance model serving with inference optimized kernels, parallelism, and memory optimizations, covering a wide variety of models for both latency sensitive and throughput-oriented applications. Besides leveraging these, we also extend the inference capability to support models in compressed formats. For example, we developed variations of efficient low-bit computation such as INT8 GeMM kernels. These kernels load INT8 parameters and activations from GPU device memory to the registers and use the customized INT8 GeMM implemented on top of CUTLASS tuned for different batch sizes to deliver faster GeMM computation. The kernels also fuse quantization and dequantization operations before and after GeMM, further reducing the kernel invocation overhead and improving the memory bandwidth utilization. Furthermore, our inference engine supports many-GPU transformer layers for serving transformer models across GPUs using inference-adapted parallelism strategies. For compressed models that have a smaller memory footprint, the inference engine can automatically shrink the number of GPUs required to serve a model, leading to reduced cross-GPU communication and hardware cost. For example, DeepSpeed compression leverages INT8 for GPT-NeoX (20B) and reduces the GPU requirement of serving the model from two to one, reducing latency from 65ms to 25ms, and achieving a 5.2x cost reduction. As shown in Figure 3, DeepSpeed INT8 kernels can boost performance by up to 2x compared to our own FP16 kernels, and they achieve 2.8-5.2x latency cost reduction compared to the baseline FP16 in PyTorch, significantly reducing the latency and cost of large-scale model inference.

A bar plot comparing the inference efficiency among three methods (PyTorch FP16, DeepSpeed Inference FP16, and DeepSpeed Inference INT8) on four models (GPT 2-XL with 1.5 billion parameters, GPT-Neo with 2.7 billion parameters, GPT-J with 6 billion parameters and GPT-NeoX with 20 billion parameters). Overall, it shows that DeepSpeed Inference FP16 is 2-4 times more efficient than PyTorch FP16, and DeepSpeed Inference INT8 is 3-5x more efficient than PyTorch FP16.
Figure 3: Inference efficiency improvements for large-scale transformer models with publicly available checkpoints using optimized DeepSpeed Inference engine. Inference efficiency is calculated by the inference latency speedup divided by the hardware cost reduction rate. DeepSpeed Inference achieves 2.8-4.8x latency reduction and up to 5.2x inference efficiency improvements.

A library that synergistically composes compression algorithms and system optimizations

DeepSpeed Compression proposes a seamless pipeline to address the compression composability challenges, as shown in Figure 4. The core piece of DeepSpeed Compression is a component called compression composer, which includes several significant features:

A graph about the DeepSpeed Compression library. It has two levels. On top, it shows a trained model (with certain accuracy, latency, and model size) getting passed into the compression composer and becoming the compressed model. Then the compressed model gets into the optimized inference engine to get the final faster or smaller model depending on the requirement on accuracy and latency. At the bottom, it shows how the compression composer makes compression decisions based on the network architecture of the model and composition of compression techniques, including distillation, pruning, and quantization; and it shows optimized inference engine improving inference performance by parallelization and efficient kernels.
Figure 4: The DeepSpeed Compression library
  1. It offers multiple cutting-edge compression methods, as shown in Table 1, including extreme quantization, head/row/channel pruning, and knowledge distillation, that can effectively reduce model size and inference cost. The list will expand as we continually integrate more state-of-the-art compression methods.
Category Methods Targets
Quantization INT8/INT4 Activations
INT8/INT4/Ternary/Binary Weights
Sparsification Head pruning Attention head
(Transformer)
Sparse/Row pruning Weights
Channel pruning Conv2D weights
Layer Reduction Arbitrary subset of network layers Layers
Distillation Output logits, feature map, attn. map Layers
Table 1: Compression techniques supported in DeepSpeed Compression composer.
  1. It offers an easy-to-use API that automatically takes care of the complexities of assembling different compression techniques to deliver the compound benefits of multiple compression methods. For example, XTC requires composition of lightweight layer reduction, binarization, and knowledge distillation. However, composing them together is non-trivial. With our compression composer, applying extreme compression is as easy as adding two new API calls to enable compression and clean the compressed model.
  2. It is designed in a modular way so that it will be easy for users to add new compression schemes. For example, additional compression methods can be added through custom compression layers and, by registering them with the compression composer, the new methods can be composed with existing methods that are already managed by the composer.
  3. It seamlessly works with the existing DeepSpeed library. This has two benefits. First, DeepSpeed Compression can be specified and enabled the same way as DeepSpeed training and inference via a JSON file, where enabling different combination of compression techniques only requires a few lines of modification in the JSON file. Second, once the compression schemes have been configured, the compression composer automatically modifies the model layers and training to enable the compression process and does not require additional changes from the user to the model structure or the training procedure.

After the DNN model has been compressed, DeepSpeed Compression replaces the compressed layers with highly optimized kernels in the DeepSpeed Inference engine to maximize hardware efficiency. Together, the compression composer and inference engine achieve the best of both worlds of compression and system optimization, delivering a compound effect of inference cost reduction.

Use Cases of DeepSpeed Compression

Although we started DeepSpeed Compression quite recently, we have successfully leveraged it to optimize several large-scale open-source models and Microsoft production workloads. It delivers significant latency and cost reduction, widely applicable on both various NLP and CV tasks.

We applied INT8 quantization of DeepSpeed Compression to optimize two large-scale open-source models in GPT-3 style: GPT-J (6B) and GPT-NeoX (20B) on the Azure AI platform. As shown in Figure 5, our quantized models achieve similar accuracy as the original models on 19 zero-shot evaluation tasks and WikiText, while achieving 3.67x and 5.2x inference cost savings, respectively, compared with PyTorch FP16 baseline on ND A100 v4 Azure instances. Very importantly, we quantize these models without requiring any training data, expensive compression time or GPU resources, bringing huge training cost savings compared with QAT!

Figure 5: DeepSpeed Compression results of GPT-J (6B)/GPT-NeoX (20B) using ZeroQuant. Left table shows the results of model quality and inference latency for the FP16 baseline and ZeroQuant; Right figures show the compression cost comparison between Quantization-aware Training (QAT, estimated) and ZeroQuant for INT8 quantization.
Figure 5: DeepSpeed Compression results of GPT-J (6B)/GPT-NeoX (20B) using ZeroQuant. Left table shows the results of model quality and inference latency for the FP16 baseline and ZeroQuant; Right figures show the compression cost comparison between Quantization-aware Training (QAT, estimated) and ZeroQuant for INT8 quantization.

Beyond open-source models, DeepSpeed Compression has also demonstrated its effectiveness to optimize production workloads in Microsoft:

  • It reduces the Microsoft Turing Image Super Resolution model (T-ISR) model size by 3.1x together with 1.85x latency reduction by composing different compression schemes like pruning and distillation with efficient system optimizations. The model has also been deployed in Bing Maps and Microsoft Edge, where it automatically derives high-resolutions images from lower-resolution images, which can be seen in this blog post.
  • It also successfully compresses the Microsoft Relevance Fusion models—a Transformer-based ranking model used in Bing’s core search stack. Without DeepSpeed Compression, it took three days to quantize the model using QAT. With DeepSpeed Compression, we can quantize the model in a few minutes with improved accuracy and reduced latency compared to QAT.

DeepSpeed Compression release plan

DeepSpeed Compression is still at its early stage and under active development, but we’d like to share the results and tools to DeepSpeed users as soon as possible. At this first release, we open-source the core DeepSpeed Compression components, including the compression composer, which supports various compression methods consisting of INT8/INT4/Ternary/Binary quantization, lightweight layer reduction, pretraining and task specific knowledge distillation, head pruning, row pruning, and channel pruning, for compressing both NLP and computer vision models. Together with the compression composer, we are releasing the two novel technologies XTC and ZeroQuant introduced in this blog as part of the library.

We hope you will try DeepSpeed Compression. Please find the code, tutorial, and documents at the DeepSpeed GitHub, and website. We highly value your feedback and comments, so let us know what you think and how we can improve. As for the next steps, we plan to extend our offerings with more compression methods, an extended coverage of specialized kernels for compressed models, and an optimization module that automatically finds the best compression schemes. We believe that our composable library and new innovations will help close the gap between what is possible in AI and what is deployable as well as making DL inference faster, cheaper, and simpler.

Acknowledgement

We are a group of system and modeling researchers—Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Conglong Li, Reza Yazdani Aminabadi, Elton Zheng, Samyam Rajbhandari, Ammar Ahmad Awan, Jeff Rasley, Cheng Li, Olatunji Ruwase, Shaden Smith, Du Li, Michael Wyatt, Arash Bakhtiari, Guanhua Wang, Connor Holmes, Sam Ade Jacobs, Martin Cai, Yuxiong He (team lead)—who are enthusiastic about performance optimization of large-scale systems. We have recently focused on deep learning systems, optimizing deep learning’s speed to train, speed to convergence, and speed to develop.

The post DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization appeared first on Microsoft Research.

Read More

Confidential Containers: Verifiably secure computation in the cloud

White lock within a geometric circle over top a blue to orange color gradient background

For many organizations, trusting their data to the cloud requires having a complete understanding of and control over the environment in which that data resides and how it’s being processed. Microsoft understands this, and we are committed to building a trustworthy cloud—one in which security, privacy, and transparency are built into its core. A key part of this vision is confidential computing—a set of hardware and software capabilities that give data owners visibility into the data environment and verifiable security protection of their data in use. 

The Confidential Computing team at Microsoft Research is collaborating with hardware developers to create trusted execution environments (TEEs), where data stays encrypted not just when stored (encryption at rest) and in transit, but also during use. This work underpins the Azure confidential cloud platform, where users can upload encrypted code and data and get encrypted results back with strong privacy. 

At Microsoft Build 2022, the company announced serverless confidential containers with lift-and-shift support, the next step in the evolution of confidential computing. This service builds on the Confidential Containers work conducted at Microsoft Research. Confidential Containers offers a verifiably secure container environment in Azure where users can confirm that the software performing computations on their data is exactly the software they expect to be running, that it will do what they want it to do with their data, and that they can trust the results it returns. Confidential Containers enables users to take existing container workloads, and with a small amount of configuration, use them in a confidential environment.

Smaller trusted computing base 

Confidential Containers decreases the size of the trusted computing base (TCB)—the totality of elements in a computing environment that must be trusted not to violate the confidentiality of computation. The TCB can include software, hardware, and human administrators, among other things. By removing elements from the TCB, the components that can be compromised are reduced, decreasing the attack surface. Confidential Containers removes Microsoft administrators from the TCB, minimizing it as much as possible while still enabling customers to run existing workloads without modifying them.

This reduced TCB provides an option for organizations that currently run computations on their data on premises because they are concerned about the security of their data in the cloud. Even though setting up a computation environment in the cloud offers flexibility, data can be exposed to anyone who operates the servers on which the system runs. With Confidential Containers, the individuals who can access the data can be tightly controlled. This can be a single designated employee of the organization that owns the data or the business partner that is processing the data. It is never a Microsoft employee or another third party. 

Encrypted, policy-constrained computing environment 

A secure hardware environment enables data protection in use. Confidential Containers runs on AMD processors backed by AMD Secure Encrypted Virtualization-Secure Nested Paging (SEV-SNP), which provides a TEE. This hardware-enforced security boundary provides a shield so that nothing outside the encrypted memory space can read the data.

Users of Confidential Containers create a policy defining precisely what can run in the confidential container environment and how. The AMD SEV-SNP hardware produces an attestation report, which provides a succinct representation of everything in the confidential environment, including information about the code that will be enforcing the policy. Users can request this attestation report any time before providing the container with a key to unlock the encrypted dataset for processing. 

A cloud outline within a security shield over top a blue to orange color gradient background.

Sensitive data handling in the cloud 

Before the development of HTTPS, businesses could not securely run a storefront on the public web because communication over the internet was not secure. In the same way, today individuals and organizations cannot run containerized computation over sensitive data in the public cloud. Confidential Containers addresses this need. 

This is a game-changer for organizations that must comply with local and international regulations on how sensitive data is handled. For example, healthcare organizations that store encrypted patient information in the cloud are required by HIPAA regulations to download that data to perform computations on premises. This multistep process entails decrypting the data once it has been downloaded to an organization’s servers, performing the required computations, and then re-encrypting the data before re-uploading it to the cloud. It also requires ensuring that the on-premises environment contains the security architecture necessary to comply with HIPAA and other regulations. 

Because Confidential Containers provides advanced security safeguards for data in use in Azure, organizations no longer need to perform these time-consuming steps. This also means they no longer need to maintain servers on premises. Moreover, Azure users can define even stricter policies for their container environment in the cloud than they have in place in their on-premises environment.

Secure multiparty computations 

Another benefit of Confidential Containers is they enable secure multiparty computations. A single organization can securely process multiple datasets that contain sensitive information, or multiple organizations with datasets that must remain secure can share those datasets with the assurance that their data will not leak. Organizations can perform computations on multiple datasets, such as for training a machine learning model, and gain better results than they would if performing computations on a single dataset, all without knowing what is in those datasets. 

Easy deployment and lift-and-shift of Linux containers 

Creating a confidential container is straightforward for Azure users who are currently using or getting ready to use containers, requiring a small amount of configuration to move existing workloads. Linux users can easily lift-and-shift their Linux containers to Confidential Containers on Azure. 

Unlimited potential with Confidential Containers 

We believe that in the future, all computing in the cloud will be confidential, and we’re excited to share Confidential Containers—a technology that plays a role in making this happen. The capabilities it provides will have implications that we have yet to imagine. We’re particularly excited by the potential of multiparty computations. The ability to perform computations in a protected environment on multiple datasets brings limitless possibilities, unlocking great value to Azure users. 

Confidential Containers is currently available for limited preview and will be available for public preview later this year. Sign up for the Confidential Containers preview. 

The post Confidential Containers: Verifiably secure computation in the cloud appeared first on Microsoft Research.

Read More

AI4Science to empower the fifth paradigm of scientific discovery

Christopher Bishop, Distinguished Scientist, Managing Director, Microsoft Research Cambridge Lab

Over the coming decade, deep learning looks set to have a transformational impact on the natural sciences. The consequences are potentially far-reaching and could dramatically improve our ability to model and predict natural phenomena over widely varying scales of space and time. Could this capability represent the dawn of a new paradigm of scientific discovery?

Jim Gray, a Turing Award winner, and former Microsoft Technical Fellow, characterised the historical evolution of scientific discovery through four paradigms. With origins dating back thousands of years, the first paradigm was purely empirical and based on direct observation of natural phenomena. While many regularities were apparent in these observations, there was no systematic way to capture or express them. The second paradigm was characterised by theoretical models of nature, such as Newton’s laws of motion in the seventeenth century, or Maxwell’s equations of electrodynamics in the nineteenth century. Derived by induction from empirical observation, such equations allowed generalization to a much broader range of situations than those observed directly. While these equations could be solved analytically for simple scenarios, it was not until the development of digital computers in the twentieth century that they could be solved in more general cases, leading to a third paradigm based on numerical computation. By the dawn of the twenty-first century computation was again transforming science, this time through the ability to collect, store and process large volumes of data, leading to the fourth paradigm of data-intensive scientific discovery. Machine learning forms an increasingly important component of the fourth paradigm, allowing the modelling and analysis of large volumes of experimental scientific data. These four paradigms are complementary and coexist. 

The pioneering quantum physicist Paul Dirac commented in 1929 that “The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble.” For example, Schrödinger’s equation describes the behaviour of molecules and materials at the subatomic level with exquisite precision, and yet numerical solution with high accuracy is only possible for very small systems consisting of a handful of atoms. Scaling to larger systems requires increasingly drastic approximations leading to a challenging trade-off between scale and accuracy. Even so, quantum chemistry calculations are already of such high practical value that they form one of the largest supercomputer workloads. 

However, over the last year or two, we have seen the emergence of a new way to exploit deep learning, as a powerful tool to address this speed-versus-accuracy trade-off for scientific discovery. This is a very different use of machine learning from the modelling of data that characterizes the fourth paradigm, because the data that is used to train the neural networks itself comes from numerical solution of the fundamental equations of science rather than from empirical observation. We can view the numerical solutions of scientific equations as simulators of the natural world that can be used, at high computational cost, to compute quantities of interest in applications such as forecasting the weather, modelling the collision of galaxies, optimizing the design of fusion reactors, or calculating the binding affinities of candidate drug molecules to a target protein. From a machine learning perspective, however, the intermediate details of the simulation can be viewed as training data which can be used to train deep learning emulators. Such data is perfectly labelled, and the quantity of data is limited only by computational budget. Once trained, the emulator can perform new calculations with high efficiency, achieving significant improvements in speed, sometimes by several orders of magnitude. 

This ‘fifth paradigm’ of scientific discovery represents one of the most exciting frontiers for machine learning as well as for the natural sciences. While there is a long way to go before these emulators are sufficiently fast, robust, and general-purpose to become mainstream, the potential for real-world impact is clear. For example, the number of small-molecule drug candidates alone is estimated at 1060, while the total number of stable materials is thought to be around 10180 (roughly the square of the number of atoms in the known universe). Finding more efficient ways to explore these vast spaces would transform our ability to discover new substances such as better drugs to treat disease, improved substrates for capturing atmospheric carbon dioxide, better materials for batteries, new electrodes for fuel cells to power the hydrogen economy, and myriad others.

AI4Science is an effort deeply rooted in Microsoft’s mission, applying the full breadth of our AI capabilities to develop new tools for scientific discovery so that we and others in the scientific community can confront some of humanity’s most important challenges. Microsoft Research has a 30+ year legacy of curiosity and discovery, and I believe that the AI4Science team – spanning geographies and scientific fields – has the potential to yield extraordinary contributions to that legacy.

Kevin Scott, Executive Vice President and Chief Technology Officer, Microsoft

I’m delighted to announce today that I will be leading a new global team in Microsoft Research, spanning the UK, China and the Netherlands, to focus on bringing this fifth paradigm to reality. Our AI4Science team encompasses world experts in machine learning, quantum physics, computational chemistry, molecular biology, fluid dynamics, software engineering, and other disciplines who are working together to tackle some of the most pressing challenges in this field.

An example project is Graphormer, led by my colleague Tie-Yan Liu in our China team. This is a deep learning package that allows researchers and developers to train custom models for molecule modelling tasks, such as materials science, or drug discovery. Recently, Graphormer won the Open Catalyst Challenge, a molecular dynamics competition that aims to model the catalyst-absorbate reaction system by AI, and has more than 0.66 million catalyst-absorbate relaxation systems (144 million structure-energy frames) simulated by density functional theory (DFT) software. Another project, from our team in Cambridge, in collaboration with Novartis, is Generative Chemistry, where together we are empowering scientists with AI to speed up the discovery and development of break-through medicines.

As Iya Khalil, Global Head of the AI Innovation Lab at Novartis, recently noted, the work is no longer science fiction but science-in-action:

“Not only can AI learn from our past experiments, but, with each new iteration of designing and testing in the lab, the machine learning algorithms can identify new patterns and help guide the early drug discovery and development process. Hopefully in doing this we can augment our human scientists’ expertise so they can design better molecules faster.”

The team has since used the platform to generate several promising early-stage molecules which have been synthesised for further exploration.

Alongside our teams in China and the UK, we have been growing a team in the Netherlands, including hiring the world-renowned machine learning expert, Max Welling. I am also excited to be able to announce today that our brand-new Lab in Amsterdam will be housed in Matrix One, which is currently under construction on the Amsterdam Science Park. This purpose-built space is in close proximity to the University of Amsterdam and the Vrije Universiteit Amsterdam, and we will maintain strong affiliations with both institutions through the co-supervision of PhD students.

Image of Amsterdam office
Matrix One building in Amsterdam

It is with pride and excitement that we take this next step to come together as a cross-geographical team and follow in the footsteps of pioneers before us, to contribute to this next paradigm of scientific discovery, and in doing so impact many important societal challenges. If you share our excitement and ambition, and would like to join us, I encourage you to look at our open positions or get in touch to talk to anyone on the team.

The post AI4Science to empower the fifth paradigm of scientific discovery appeared first on Microsoft Research.

Read More