Figure at the start of a maze showing several paths. Four paths include a medical dead-end, and each stop before reaching the end. Only one path does not include a medical-dead end, and this one goes clear through to the end.

Using reinforcement learning to identify high-risk states and treatments in healthcare

Figure at the start of a maze showing several paths. Four paths include a medical dead-end, and each stop before reaching the end. Only one path does not include a medical-dead end, and this one goes clear through to the end.

As the pandemic overburdens medical facilities and clinicians become increasingly overworked, the ability to make quick decisions on providing the best possible treatment is even more critical. In urgent health situations, such decisions can mean life or death. However, certain treatment protocols can pose a considerable risk to patients who have serious medical conditions and can potentially contribute to unintended outcomes.

In this research project, we built a machine learning (ML) model that works with scenarios where data is limited, such as healthcare. This model was developed to recognize treatment protocols that could contribute to negative outcomes and to alert clinicians when a patient’s health could decline to a dangerous level. You can explore the details of this research project in our research paper, “Medical Dead-ends and Learning to Identify High-risk States and Treatments,” which was presented at the 2021 Conference on Neural Information Processing Systems (NeurIPS 2021).

Reinforcement learning for healthcare

To build our model, we decided to use reinforcement learning—an ML framework that’s uniquely well-suited for advancing safety-critical domains such as healthcare. This is because at its core, healthcare is a sequential decision-making domain, and reinforcement learning is the formal paradigm for modeling and solving problems in such domains. In healthcare, clinicians base their treatment decisions on an overall understanding of a patient’s health; they observe how the patient responds to this treatment, and the process repeats. Likewise, in reinforcement learning, an algorithm, or agent, interprets the state of its environment and takes an action, which, coupled with the internal dynamics of the environment, causes it to transition to a new state, as shown in Figure 1. A reward signal is then assigned to account for the immediate impact of this change. For example, in a healthcare scenario, if a patient recovers or is discharged from the intensive care unit (ICU), the agent may receive a positive reward. However, if the patient does not survive, the agent receives a negative reward, or penalty.

Figure 1: Diagram showing the sequential decision-making process typical in healthcare as an analogous with reinforcement learning. The clinician observes the state of the patient’s health condition and decides on a treatment. The clinician then observes how the patient responded to the treatment and decides on the next steps. Applied to reinforcement learning, the result of each transition in the patient’s state is met with a reward signal.
Figure 1: Sequential decision-making in healthcare: Clinicians or AI agents observe the state of the patient ((s)), select a treatment ((a)), and monitor the next state. The process then repeats. As a result of each such transition of the patient’s state (whose probability is denoted by (T)), a reward signal ((R)) is observed, which accounts for the immediate consequence of the applied treatment.

Reinforcement learning is widely used in gaming, for example, to determine the best sequence of chess moves and maximize an AI system’s chances of winning. Over time, due to trial-and-error experimentation, the desired actions are maximized and the undesired ones are minimized until the optimal solution is identified. Normally, this experimentation is made possible by the proactive collection of extensive amounts of diverse data. However, unlike in gaming, exploratory data collection and experimentation are not possible in healthcare, and our only option in this realm is to work with previously collected datasets, providing very limited opportunities to explore alternative choices. This is where offline reinforcement learning comes into focus. A subarea of reinforcement learning, offline reinforcement learning works only with data that already exists—instead of proactively taking in new data, we’re using a fixed dataset. Even so, to propose the best course of action, an offline reinforcement learning algorithm still requires sufficient trial-and-error with alternatives, and this necessitates a very large dataset, something not feasible in safety-critical domains with limited data, like healthcare.

In the current research literature, when reinforcement learning is applied to healthcare, the focus is on what to do to support the best possible patient outcome, an infeasible objective. In our paper, we propose inverting this paradigm in offline settings to investigate high-risk treatments and identify when the state of patients’ health reaches a critical point. To enable this approach, we developed a methodology called Dead-end Discovery (DeD), which identifies treatments to avoid in order to prevent a medical dead-end—the point at which the patient is most likely to die regardless of future treatment. DeD provably requires exponentially less data than the standard methods, making it significantly more reliable in limited-data situations. By identifying known high-risk treatments, DeD could assist clinicians in making trustworthy decisions in highly stressful situations, where minutes count. Moreover, this methodology could also raise an early warning flag and alert clinicians when a patient’s condition reveals outstanding risk, often before it becomes obvious. We go into more detail on the DeD methodology later in this post.

Medical dead-ends and rescue states

At ICUs, patients experience a trajectory which sequentially tracks the state of their health. It starts with the patient’s condition upon admission, followed by the administration of treatment and then by their response to the treatment. This sequence repeats until the patient reaches a terminal state—the final observation of the patient’s condition that’s still relevant within the ICU. To learn what treatments to avoid, we focus on two types of terminal states: patient recovery and patient death. Other terminal states can also exist. For example, when playing chess, a loss or a win are not the only possible outcomes; draws can also occur. While our framework can encompass additional terminal states, this work focuses on only two possibilities: positive outcomes and negative outcomes.

Building on these two terminal states, we define medical dead-ends as patient states from which all possible future trajectories will lead to the terminal state of the patient’s death. If applied in acute care settings, it’s critical to both avoid medical dead-ends and identify the probability with which any selected treatment will lead to them. It’s also important to note that medical dead-ends can occur considerably earlier than clinicians are able to observe. This makes DeD particularly valuable, as every hour counts when it comes to critical conditions.

To contrast with medical dead-ends, we also propose the concept of rescue states, where recovery is fully reachable. At each rescue state, there exists at least one treatment that would lead, with the probability of 1, either to another rescue state or to recovery. In most cases, a patient’s condition is neither a medical dead-end nor a rescue state, as the minimum and maximum probability of future mortality or recovery is not always 0 and 1, but somewhere in between. Therefore, it’s important to have an alert when a patient is likely to enter a medical dead-end.

Figure 2: Diagram showing possible trajectories for a single patient with sepsis upon admission to the ICU. Each branch represents the septic patient’s trajectory in response to a sample sequence of treatments. A slumping avatar represents a medical dead-end, which is significantly far from the terminal state and may not be observable by the clinicians. A critical point here is one step before this medical dead-end, represented by the grey avatar, where there is still chance to save the patient.
Figure 2: Using sepsis as an example use case, this diagram shows simplified possible trajectories for a single patient upon admission to the ICU. Each branch represents the septic patient’s trajectory in response to a sample sequence of treatments, represented by a black dot (VP = vasopressor + IV = intravenous fluid). Avatars with blue borders and “RS” above them represent rescue states. Avatars with red borders and “MD” above them represent medical dead-ends. The shading of each avatar roughly indicates the state of the patient’s condition in response to treatment. More shading represents an improving condition and less shading represents a worsening condition. No shading represents the terminal state where the patient does not survive. The slumping avatar represents a medical dead-end, which is significantly far from the terminal state and may not be observable by the clinicians. A critical point here is one step before this medical dead-end, represented by the grey avatar, where there is still a chance to save the patient.  
Patient vital signs taken at the ICU: HR=heart rate; BP=blood pressure; RR=respiration rate; SOFA=sequential organ failure assessment score  

Treatment security: How to help doctors

To develop our model, we considered a generic condition that guarantees the merit and reliability of a given treatment-selection policy. In particular, we postulated the following condition we called treatment security:

If at state (s), treatment (a) causes transitioning to a medical dead-end with any given level of certainty, then the policy must refrain from selecting (a) at (s) with the same level of certainty.

For example, if a certain treatment leads to a medical dead-end or immediate death with a probability of more than 80 percent, that treatment should be selected for administration no more than 20 percent of the time.

While treatment security is a desired property, it’s not easy to directly enforce because the required probabilities are not known a priori, nor are they directly measurable from the data. Therefore, we developed a theoretical framework at the core of our method that enables treatment security from data by mapping it to proper learning problems.

DeD: Dead-end Discovery methodology

To precisely define the learning problems, we based our DeD methodology on three core ideas: 1) separating the outcomes, 2) learning the optimal value function of each outcome in isolation without discounting, and 3) proving important properties for these particular value functions, which enable treatment security.

We constructed two simple reward signals for independent learning problems:

  1. -1 in the case of a negative outcome; 0 at all other transitions
  2. +1 in the case of a positive outcome; 0 at all other transitions

Next, we learned their corresponding optimal value functions, (Q_{D}^{*}(s, a)) and (Q_{R}^{*}(s, a)) both with no discounting. It turns out that these value functions are intrinsically important. In fact, we show that:

–(Q_{D}^{*}(s, a)) corresponds to the minimum probability of a future negative outcome if treatment (a) is selected at state (s). Equivalently, (1 + Q_{D}^{*}(s, a)) corresponds to the maximum hope of a positive outcome.

Moreover, the quantity (1 + Q_{D}^{*}(s, a)) proves to be a meaningful threshold for a policy to make it secure. We formally show that: for treatment security, it is sufficient to abide by the maximum hope of recovery.

We further proved that if the probability of treatment selection can be higher than (Q_{R}^{*}(s, a)), the patient is guaranteed to remain in a rescue state when possible. Finally, we also showed that such thresholds for limiting the treatment selection probabilities exist.

Building from these results, we defined a training and deployment pipeline, illustrated in Figure 3.

Figure 3: Diagram showing the DeD pipeline. The training process results in the learned optimal value functions. The deployment of the pipelines ends with providing critical information to the human decision-maker.
Figure 3: The DeD pipeline: section a illustrates the training process, resulting in the learned optimal value functions, and section b shows the deployment of the pipeline, which ends with providing critical information to the human decision-maker.

Applying the DeD methodology to sepsis

To demonstrate the utility of DeD in safety-critical domains and to honor the underlying healthcare motivations behind its development, we applied DeD on publicly available real-world medical data. Specifically, our data pertained to critically ill patients who had developed sepsis and were treated in an ICU.

Sepsis is a syndrome characterized by organ dysfunction due to a patient’s dysregulated response to an infection. In the United States alone, sepsis is responsible for more than 200,000 deaths each year, contributing to over 10 percent of in-hospital mortality, and accounting for over $23 billion in hospitalization costs. Globally, sepsis is a leading cause of mortality, with an estimated 11 million deaths each year, accounting for almost 20 percent of all deaths. It’s also an end-stage to many health conditions. In a recent retrospective study of hospitalized COVID-19 patients, all the fatal cases and more than 40 percent of survivors were septic.

In our study, we envisioned a way to help clinicians identify which subset of treatments could statistically cause further health deterioration so that they could eliminate them when deciding on the next steps. To estimate the value functions of possible treatments, we used the publicly available Medical Information Mart for Intensive Care III (MIMIC-III) dataset (v 1.4), sourced from the Beth Israel Deaconess Medical Center in Boston, Massachusetts. MIMIC-III is comprised of deidentified electronic health records (EHR) of consenting patients admitted to critical care units, collected from 53,423 distinct hospital admissions between 2001 and 2012. Following standard extraction and preprocessing methods, we derived an experimental cohort of 19,611 patients who are presumed to have developed sepsis during their initial admission to the ICU, with an observed mortality rate of approximately 10 percent. We studied 72 hours of the patients’ stay at the ICU—24 hours before the presumed onset of sepsis and 48 hours afterwards. We used 44 observation variables, including various health records and demographic information, and 25 distinct treatment options (five discrete levels for IV fluid and vasopressor volumes in combination), aggregated over four hours.

With this dataset, we sought to demonstrate that medical dead-ends exist in medical data and show the effect of treatment selection on the development of medical dead-ends. We also sought to identify whether alternative treatments were available that could have prevented the occurrence of a medical dead-end.

To flag potentially nonsecure treatments, we examined whether the values estimated ((Q_{D}(s, a)) and (Q_{R}(s, a))) for each treatment passed certain thresholds. To flag potential medical dead-end states, we looked at the median values of available treatments against these same thresholds. Using the median helped mitigate approximation errors due to generalization from potentially insufficient data and extrapolations made by the reinforcement learning formulation. With the specified thresholds, DeD identified increasing percentages of patients raising fatal flags, particularly among the subpopulation that died in the hospital. In Figure 4, note the distinctive difference between the trend of estimated values for surviving and non-surviving patients. Over the course of 72 hours in the ICU, surviving patients rarely raised a flag, while flags were raised at an increased rate for patients who did not survive as they proceeded toward the final observations of their time in the ICU.

Figure 4: Histograms of the flag status for surviving and non-surviving patients, according to the rescue state and medical dead-end values. Bars are plotted according to the time prior to the recorded terminal state and measure the percentage of patients whose states did not raise any flags. There is a clear worsening trend for non-surviving patients as they approached a terminal state, beginning as early as 48 hours prior to expiration.
Figure 4: Histograms of the flag status for both surviving and non-surviving patients, according to the rescue state and medical dead-end values. The bars are plotted according to the time prior to the recorded terminal state and measure the percentage of patients whose states did not raise any flags. There is a clear worsening trend for non-surviving patients as they approached a terminal state, beginning as early as 48 hours prior to expiration.

To further support our hypothesis that medical dead-ends exist among septic patients and may be preventable, we aligned patients according to the point in their care when a flag was first raised by our DeD framework. As shown in Figure 5, we selected all trajectories with at least 24 hours prior to and 16 hours after this flag. The DeD estimates of (V) and (Q) values for administered treatments had similar behavior in both the surviving and non-surviving subpopulations prior to this first flag, but the values quickly diverged afterwards. We observed that the advent of this first flag also corresponded to a similar divergence among various clinical measures and vital signs, shown in Figure 5, sections a and b.

DeD identified a clear critical point in these patients’ care, where non-surviving patients experienced an irreversible negative change to their health, as shown in Figure 5, section c. Additionally, there was a significant gap in the estimated value between the treatments administered to the non-surviving patients and those treatments deemed to be more secure by DeD, shown in Figure 5, section e. There was a clear inflection in the estimated values four to eight hours before this first flag was raised, shown in Figure 5, section c.

Figure 5: A series of graphs that show the trend of measures taken around the first raised flag. Various measures are shown 24 hours (6 steps, 4 hours each) before the first flag is raised and 16 hours (4 steps) afterwards for non-surviving and surviving patients. The shaded areas represent the standard deviation. The first shows selected key vital measures and lab tests, the second section shows established clinical measures. The DeD estimates of heath state and administered treatments had similar behavior in both the surviving and non-surviving subpopulations prior to this first flag, but the values quickly diverged afterwards. We observed that the advent of this first flag also corresponded to a similar divergence among various clinical measures and vital signs. The third section shows DeD value estimates of health state and administered treatment. Here, DeD identified a clear critical point in these patients’ care, where non-surviving patients experienced an irreversible negative change to their health. The fourth section shows the administered treatments. Finally, the last column illustrates value trends for the selected treatments as well as the most secure ones. It shows a significant gap in the estimated value between the treatments administered to the non-surviving patients and those treatments deemed to be more secure by DeD.
Figure 5: Trend of measures around the first raised flag: Various measures are shown 24 hours (6 steps, 4 hours each) before the first flag is raised and 16 hours (4 steps) afterwards for non-surviving (blue) and surviving (green) patients. The shaded areas represent the standard deviation. Section a shows selected key vital measures and lab tests, section b shows established clinical measures, and section c shows DeD value estimates of health state (V) and administered treatment (Q). Section d shows the administered treatments. Finally, the last column, e, illustrates value trends for the selected treatments as well as the most secure ones.

Further analysis of our results, which we describe in detail in our paper, indicates that more than 12 percent of treatments given to non-surviving patients could be detrimental 24 hours before death. We also identified that 2.7 percent of non-surviving patients entered medical dead-end trajectories with a sharply increasing rate up to 48 hours before death, and close to 10 percent when we slightly relaxed our thresholds for predicting medical dead-ends. While these percentages may seem small, more than 200,000 patients die of sepsis every year in US hospitals alone, and any reduction of this rate would result in possibly tens of thousands of individuals who would otherwise survive. We’re excited about the possibility that DeD could help clinicians provide their patients with the best care and that many more patients could potentially survive sepsis.

Looking ahead: Further uses of DeD and offline reinforcement learning

We view DeD as a powerful tool that could magnify human expertise in healthcare by supporting clinicians with predictive models as they make critical decisions. There is significant potential for researchers to use the DeD method to expand on this research and look at other measures, such as the relationship between patient demographics and sepsis treatment, with the goal of preventing certain treatment profiles for particular subgroups of patients.

The principles of offline reinforcement learning and the DeD methodology can also be applied to other clinical conditions, as well as to safety-critical areas beyond healthcare that also rely on sequential decision-making. For example, the domain of finance entails similar core concepts as it is analogously based on sequential decision-making processes. DeD could be used to alert financial professionals when specific actions, such as buying or selling certain assets, are likely to result in unavoidable future loss, or a financial dead-end. We hope our work will inspire active research and discussion in the community. You can learn more about the research and access the code here.

Disclaimer: The research presented in this video, including the referenced paper, code, and models, are shared for research purposes only. They are not to be used in clinical settings, as a stand-alone tool, or as replacement for the decisions of expert medical professionals. The algorithm and technology presented here, and any derivatives of it, should not be used to make clinical decisions, including, but not limited to, decisions about the medical treatment of patients. In addition, further testing and validation are required before the DeD framework may be used in any clinical setting, including, but not limited to, understanding how the information provided by the DeD framework affects clinician care and patient outcomes over time, neither of which have been studied here.

The post Using reinforcement learning to identify high-risk states and treatments in healthcare appeared first on Microsoft Research.

Read More

blue graphic with a light honeycomb pattern background featuring a lightbulb in the middle and various icons around it: handshake, eye, connected people, balanced scale, lock, and shield

Advancing AI trustworthiness: Updates on responsible AI research

blue graphic with a light honeycomb pattern background featuring a lightbulb in the middle and various icons around it: handshake, eye, connected people, balanced scale, lock, and shield

Editor’s note: This year in review is a sampling of responsible AI research compiled by Aether, a Microsoft cross-company initiative on AI Ethics and Effects in Engineering and Research, as outreach from their commitment to advancing the practice of human-centered responsible AI. Although each paper includes authors who are participants in Aether, the research presented here expands beyond, encompassing work from across Microsoft, as well as with collaborators in academia and industry. 

Chief Scientific Officer Eric Horvitz: Efforts to make AI systems worthy of trust are a critical part of building valuable AI applications

Inflated expectations around the capabilities of AI technologies may lead people to believe that computers can’t be wrong. The truth is AI failures are not a matter of if but when. AI is a human endeavor that combines information about people and the physical world into mathematical constructs. Such technologies typically rely on statistical methods, with the possibility for errors throughout an AI system’s lifespan. As AI systems become more widely used across domains, especially in high-stakes scenarios where people’s safety and wellbeing can be affected, a critical question must be addressed: how trustworthy are AI systems, and how much and when should people trust AI? 

As part of their ongoing commitment to building AI responsibly, research scientists and engineers at Microsoft are pursuing methods and technologies aimed at helping builders of AI systems cultivate appropriate trust—that is, building trustworthy models with reliable behaviors and clear communication that set proper expectations. When AI builders plan for failures, work to understand the nature of the failures, and implement ways to effectively mitigate potential harms, they help engender trust that can lead to a greater realization of AI’s benefits. 

Pursuing trustworthiness across AI systems captures the intent of multiple projects on the responsible development and fielding of AI technologies. Numerous efforts at Microsoft have been nurtured by its Aether Committee, a coordinative cross-company council comprised of working groups focused on technical leadership at the frontiers of innovation in responsible AI. The effort is led by researchers and engineers at Microsoft Research and from across the company and is chaired by Chief Scientific Officer Eric Horvitz. Beyond research, Aether has advised Microsoft leadership on responsible AI challenges and opportunities since the committee’s inception in 2016. 


  • abstract pattern background with the text



    Explore the HAX Toolkit


    The Human-AI eXperience (HAX) Toolkit helps builders of AI systems create fluid, responsible human-AI experiences.

  • Responsible AI Toolbox homepage



    Explore the Responsible AI Toolbox


    Customizable dashboards that help builders of AI systems identify, diagnose, and mitigate model errors, as well as debug models and understand causal relationships in data.

The following is a sampling of research from the past year representing efforts across the Microsoft responsible AI ecosystem that highlight ways for creating appropriate trust in AI. Facilitating trustworthy measurement, improving human-AI collaboration, designing for natural language processing (NLP), advancing transparency and interpretability, and exploring the open questions around AI safety, security, and privacy are key considerations for developing AI responsibly. The goal of trustworthy AI requires a shift in perspective at every stage of the AI development and deployment life cycle. We’re actively developing a growing number of best practices and tools to help with the shift to make responsible AI more available to a broader base of users. Many open questions remain, but as innovators, we are committed to tackling these challenges with curiosity, enthusiasm, and humility. 

Facilitating trustworthy measurement

Emre Kiciman, co-chair of the Aether Security working group: Ensuring our measurements capture what we think they’re capturing

AI technologies influence the world through the connection of machine learning models—that provide classifications, diagnoses, predictions, and recommendations—with larger systems that drive displays, guide controls, and activate effectors. But when we use AI to help us understand patterns in human behavior and complex societal phenomena, we need to be vigilant. By creating models for assessing or measuring human behavior, we’re participating in the very act of shaping society. Guidelines for ethically navigating technology’s impacts on society—guidance born out of considering technologies for COVID-19—prompt us to start by weighing a project’s risk of harm against its benefits. Sometimes an important step in the practice of responsible AI may be the decision to not build a particular model or application. 

Human behavior and algorithms influence each other in feedback loops. In a recent Nature publication, Microsoft researchers and collaborators emphasize that existing methods for measuring social phenomena may not be up to the task of investigating societies where human behavior and algorithms affect each other. They offer five best practices for advancing computational social science. These include developing measurement models that are informed by social theory and that are fair, transparent, interpretable, and privacy preserving. For trustworthy measurement, it’s crucial to document and justify the model’s underlying assumptions, plus consider who is deciding what to measure and how those results will be used.

5 Best practices for measuring algorithmically infused societies
Source: Adapted from Nature

In line with these best practices, Microsoft researchers and collaborators have proposed measurement modeling as a framework for anticipating and mitigating fairness-related harms caused by AI systems. This framework can help identify mismatches between theoretical understandings of abstract concepts—for example, socioeconomic status—and how these concepts get translated into mathematics and code. Identifying mismatches helps AI practitioners to anticipate and mitigate fairness-related harms that reinforce societal biases and inequities. A study applying a measurement modeling lens to several benchmark datasets for surfacing stereotypes in NLP systems reveals considerable ambiguity and hidden assumptions, demonstrating (among other things) that datasets widely trusted for measuring the presence of stereotyping can, in fact, cause stereotyping harms.

Flaws in datasets can lead to AI systems with unfair outcomes, such as poor quality of service or denial of opportunities and resources for different groups of people. AI practitioners need to understand how their systems are performing for factors like age, race, gender, and socioeconomic status so they can mitigate potential harms. In identifying the decisions that AI practitioners must make when evaluating an AI system’s performance for different groups of people, researchers highlight the importance of rigor in the construction of evaluation datasets. 

Making sure that datasets are representative and inclusive means facilitating data collection from different groups of people, including people with disabilities. Mainstream AI systems are often non-inclusive. For example, speech recognition systems do not work for atypical speech, while input devices are not accessible for people with limited mobility. In pursuit of inclusive AI, a study proposes guidelines for designing an accessible online infrastructure for collecting data from people with disabilities, one that is built to respect, protect, and motivate those contributing data. 

Related papers

Improving human-AI collaboration

Ece Kamar, Aether technical advisor and co-chair of the Aether Reliability and Safety working group: Investing in research and new techniques for effective human-AI partnership

When people and AI collaborate on solving problems, the benefits can be impressive. But current practice can be far from establishing a successful partnership between people and AI systems. A promising advance and direction of research is developing methods that learn about ideal ways to complement people with problem solving. In the approach, machine learning models are optimized to detect where people need the most help versus where people can solve problems well on their own. We can additionally train the AI systems to make decisions as to when a system should ask an individual for input and to combine the human and machine abilities to make a recommendation. In related work, studies have shown that people will too often accept an AI system’s outputs without question, relying on them even when they are wrong. Exploring how to facilitate appropriate trust in human-AI teamwork, experiments with real-world datasets for AI systems show that retraining a model with a human-centered approach can better optimize human-AI team performance. This means taking into account human accuracy, human effort, the cost of mistakes—and people’s mental models of the AI. 

In systems for healthcare and other high-stakes scenarios, a break with the user’s mental model can have severe impacts. An AI system can compromise trust when, after an update for better overall accuracy, it begins to underperform in some areas. For instance, an updated system for predicting cancerous skin moles may have an increase in accuracy overall but a significant decrease for facial moles. A physician using the system may either lose confidence in the benefits of the technology or, with more dire consequences, may not notice this drop in performance. Techniques for forcing an updated system to be compatible with a previous version produce tradeoffs in accuracy. But experiments demonstrate that personalizing objective functions can improve the performance-compatibility tradeoff for specific users by as much as 300 percent.

System updates can have grave consequences when it comes to algorithms used for prescribing recourse, such as how to fix a bad credit score to qualify for a loan. Updates can lead to people who have dutifully followed a prescribed recourse being denied their promised rights or services and damaging their trust in decision makers. Examining the impact of updates caused by changes in the data distribution, researchers expose previously unknown flaws in the current recourse-generation paradigm. This work points toward rethinking how to design these algorithms for robustness and reliability. 

Complementarity in human-AI performance, where the human-AI team performs better together by compensating for each other’s weaknesses, is a goal for AI-assisted tasks. You might think that if a system provided an explanation of its output, this could help an individual identify and correct an AI failure, producing the best of human-AI teamwork. Surprisingly, and in contrast to prior work, a large-scale study shows that explanations may not significantly increase human-AI team performance. People often over-rely on recommendations even when the AI is incorrect. This is a call to action: we need to develop methods for communicating explanations that increase users’ understanding rather than to just persuade. 

Related papers

Designing for natural language processing 

Hanna Wallach, Aether technical advisor and co-chair of the Aether Fairness and Inclusiveness working group: Developing natural language processing models in a responsible manner

The allure of natural language processing’s potential, including rash claims of human parity, raises questions of how we can employ NLP technologies in ways that are truly useful, as well as fair and inclusive. To further these and other goals, Microsoft researchers and collaborators hosted the first workshop on bridging human-computer interaction and natural language processing, considering novel questions and research directions for designing NLP systems to align with people’s demonstrated needs. 

Language shapes minds and societies. Technology that wields this power requires scrutiny as to what harms may ensue. For example, does an NLP system exacerbate stereotyping? Does it exhibit the same quality of service for people who speak the same language in different ways? A survey of 146 papers analyzing “bias” in NLP observes rampant pitfalls of unstated assumptions and conceptualizations of bias. To avoid these pitfalls, the authors outline recommendations based on the recognition of relationships between language and social hierarchies as fundamentals for fairness in the context of NLP. We must be precise in how we articulate ideas about fairness if we are to identify, measure, and mitigate NLP systems’ potential for fairness-related harms. 

The open-ended nature of language—its inherent ambiguity, context-dependent meaning, and constant evolution—drives home the need to plan for failures when developing NLP systems. Planning for NLP failures with the AI Playbook introduces a new tool for AI practitioners to anticipate errors and plan human-AI interaction so that the user experience is not severely disrupted when errors inevitably occur. 

Related papers

Improving transparency

Jenn Wortman Vaughan, co-chair of the Aether Transparency working group: Providing stakeholders with an appropriate understanding of how AI systems work

To build AI systems that are reliable and fair—and to assess how much to trust them—practitioners and those using these systems need insight into their behavior. If we are to meet the goal of AI transparency, the AI/ML and human-computer interaction communities need to integrate efforts to create human-centered interpretability methods that yield explanations that can be clearly understood and are actionable by people using AI systems in real-world scenarios. 

As a case in point, experiments investigating whether simple models that are thought to be interpretable achieve their intended effects rendered counterintuitive findings. When participants used an ML model considered to be interpretable to help them predict the selling prices of New York City apartments, they had difficulty detecting when the model was demonstrably wrong. Providing too many details of the model’s internals seemed to distract and cause information overload. Another recent study found that even when an explanation helps data scientists gain a more nuanced understanding of a model, they may be unwilling to make the effort to understand it if it slows down their workflow too much. As both studies show, testing with users is essential to see if people clearly understand and can use a model’s explanations to their benefit. User research is the only way to validate what is or is not interpretable by people using these systems.

Explanations that are meaningful to people using AI systems are key to the transparency and interpretability of black-box models. Introducing a weight-of-evidence approach to creating machine-generated explanations that are meaningful to people, Microsoft researchers and colleagues highlight the importance of designing explanations with people’s needs in mind and evaluating how people use interpretability tools and what their understanding is of the underlying concepts. The paper also underscores the need to provide well-designed tutorials.

Traceability and communication are also fundamental for demonstrating trustworthiness. Both AI practitioners and people using AI systems benefit from knowing the motivation and composition of datasets. Tools such as datasheets for datasets prompt AI dataset creators to carefully reflect on the process of creation, including any underlying assumptions they are making and potential risks or harms that might arise from the dataset’s use. And for dataset consumers, seeing the dataset creators’ documentation of goals and assumptions equips them to decide whether a dataset is suitable for the task they have in mind.

Related papers

Advancing algorithms for interpretability

Rich Caruana, co-chair of the Aether Transparency working group: Demonstrating how interpretability shows how much trust to put in your AI models

Interpretability is vital to debugging and mitigating the potentially harmful impacts of AI processes that so often take place in seemingly impenetrable black boxes—it is difficult (and in many settings, inappropriate) to trust an AI model if you can’t understand the model and correct it when it is wrong. Advanced glass-box learning algorithms can enable AI practitioners and stakeholders to see what’s “under the hood” and better understand the behavior of AI systems. And advanced user interfaces can make it easier for people using AI systems to understand these models and then edit the models when they find mistakes or bias in them. Interpretability is also important to improve human-AI collaboration—it is difficult for users to interact and collaborate with an AI model or system if they can’t understand it. At Microsoft, we have developed glass-box learning methods that are now as accurate as previous black-box methods but yield AI models that are fully interpretable and editable. 


  • GAM Changer Demo

    VIDEO

    Editing GAMs with interactive visualization


    Machine learning interpretability techniques reveal that many accurate models learn some problematic and dangerous patterns from the training data. GAM Changer helps address these issues.

Recent advances at Microsoft include a new neural GAM (generalized additive model) for interpretable deep learning, a method for using dropout rates to reduce spurious interaction, an efficient algorithm for recovering identifiable additive models, the development of glass-box models that are differentially private, and the creation of tools that make editing glass-box models easy for those using them so they can correct errors in the models and mitigate bias. 

Related papers

Exploring open questions for safety, security, and privacy in AI

Ben Zorn, co-chair of the Aether Reliability and Safety working group: Considering AI’s significant new challenges to reliability, security, and privacy

When considering how to shape appropriate trust in AI systems, there are many open questions about safety, security, and privacy. How do we stay a step ahead of attackers intent on subverting an AI system or harvesting its proprietary information? How can we avoid a system’s potential for inferring spurious correlations? 

With autonomous systems, it is important to acknowledge that no system operating in the real world will ever be complete. It’s impossible to train a system for the many unknowns of the real world. Unintended outcomes can range from annoying to dangerous. For example, a self-driving car may splash pedestrians on a rainy day or erratically swerve to localize itself for lane-keeping. An overview of emerging research in avoiding negative side effects due to AI systems’ incomplete knowledge points to the importance of giving users the means to avoid or mitigate the undesired effects of an AI system’s outputs as essential to how the technology will be viewed or used. 

When dealing with data about people and our physical world, privacy considerations take a vast leap in complexity. For example, it’s possible for a malicious actor to isolate and re-identify individuals from information in large, anonymized datasets or from their interactions with online apps when using personal devices. Developments in privacy-preserving techniques face challenges in usability and adoption because of the deeply theoretical nature of concepts like homomorphic encryption, secure multiparty computation, and differential privacy. Exploring the design and governance challenges of privacy-preserving computation, interviews with builders of AI systems, policymakers, and industry leaders reveal confidence that the technology is useful, but the challenge is to bridge the gap from theory to practice in real-world applications. Engaging the human-computer interaction community will be a critical component.

Related papers

Reliability and safety

Privacy and security 

A call to personal action

AI is not an end-all, be-all solution; it’s a powerful, albeit fallible, set of technologies. The challenge is to maximize the benefits of AI while anticipating and minimizing potential harms.

Admittedly, the goal of appropriate trust is challenging. Developing measurement tools for assessing a world in which algorithms are shaping our behaviors, exposing how systems arrive at decisions, planning for AI failures, and engaging the people on the receiving end of AI systems are important pieces. But what we do know is change can happen today with each one of us as we pause and reflect on our work, asking: what could go wrong, and what can I do to prevent it? 

The post Advancing AI trustworthiness: Updates on responsible AI research appeared first on Microsoft Research.

Read More

DeepSpeed shares findings and innovations for MoE models and systems that 1) reduce training cost by 5x, 2) reduce MoE parameter size by up to 3.7x and 3) reduce MoE inference latency by 7.3x at an unprecedented scale and offer up to 4.5x faster and 9x cheaper inference for MoE models compared to quality-equivalent dense models.

DeepSpeed: Advancing MoE inference and training to power next-generation AI scale

DeepSpeed shares findings and innovations for MoE models and systems that 1) reduce training cost by 5x, 2) reduce MoE parameter size by up to 3.7x and 3) reduce MoE inference latency by 7.3x at an unprecedented scale and offer up to 4.5x faster and 9x cheaper inference for MoE models compared to quality-equivalent dense models.

In the last three years, the largest trained dense models have increased in size by over 1,000 times, from a few hundred million parameters to over 500 billion parameters in Megatron-Turing NLG 530B (MT-NLG). Improvements in model quality with size suggest that this trend will continue, with larger model sizes bringing better model quality. However, sustaining the growth in model size is getting more difficult due to the increasing compute requirements.

There have been numerous efforts to reduce compute requirements to train large models without sacrificing model quality. To this end, architectures based on Mixture of Experts (MoE) have paved a promising path, enabling sub-linear compute requirements with respect to model parameters and allowing for improved model quality without increasing training cost.

However, MoE models have their own challenges. First, the scope of MoE models is primarily limited on encoder-decoder models and sequence-to-sequence tasks. Second, MoE models require more parameters to achieve the same model quality as their dense counterparts, which requires more memory for training and inference even though MoE models require less compute. Lastly, a critical consideration is that MoE models’ large size makes inference difficult and costly.

To address these above challenges, the DeepSpeed team, as part of Microsoft’s  AI at Scale initiative, has been exploring new applications and optimizations for MoE models at scale. These can lower the training and inference cost of large models, while also enabling the ability to train and serve the next generation of models affordably on today’s hardware. Here, we are happy to share our findings and innovations for MoE models and systems that 1) reduce training cost by 5x, 2) reduce MoE parameter size by up to 3.7x and 3) reduce MoE inference latency by 7.3x at an unprecedented scale and offer up to 4.5x faster and 9x cheaper inference for MoE models compared to quality-equivalent dense models:

  1. 5x reduction in training cost for natural language generation (NLG) models: We extend the scope of MoE models to beyond just encoder-decoder models and sequence-to-sequence tasks, demonstrating that MoE can reduce the training cost of NLG models like those in the GPT family or MT-NLG by 5x while obtaining the same model quality. Data scientists can now train models of superior quality previously only possible with 5x more hardware resources.
  2. Reduced model size and improved parameter efficiency with Pyramid-Residual-MoE (PR-MoE) Architecture and Mixture-of-Students (MoS): The training cost reduction of MoE is not free and comes at the expense of increasing the total number of parameters required to achieve the same model quality as dense models. PR-MoE is a hybrid dense and MoE model created using residual connections, applying experts only where they are most effective. PR-MoE reduces MoE model parameter size by up to 3x with no change to model quality. In addition, we leverage staged knowledge distillation to learn a Mixture-of-Students model that further leads to up to 3.7x model size reduction while retaining similar model quality.
  3. Fast and economical MoE inference at unprecedented scale: The DeepSpeed-MoE (DS-MoE) inference system enables efficient scaling of inference workloads on hundreds of GPUs, providing up to 7.3x reduction in inference latency and cost when compared with existing systems. It offers ultra-fast inference latencies (25 ms) for trillion-parameter MoE models. DS-MoE also offers up to 4.5x faster and 9x cheaper inference for MoE models compared to quality-equivalent dense models by combining both system and model optimizations.

Each of these advances is explored further in the blog post below. For more about the technical details, please read our paper.

DeepSpeed-MoE for NLG: Reducing the training cost of language models by five times

While recent works like GShard and Switch Transformers have shown that the MoE model structure can reduce large model pretraining cost for encoder-decoder model architecture, their impact on the much more compute-intensive transformer-based autoregressive NLG models has been mostly unknown.

Given the tremendous compute and energy requirements for training NLG models, we explore opportunities where MoE can reduce their training cost. We show that MoE can be applied to NLG models to significantly improve their model quality with the same training cost. Also, MoE can achieve 5x reduction in training cost to achieve the same model quality of a dense NLG model. For example, we achieved the quality of a 6.7B-parameter dense NLG model at the cost of training a 1.3B-parameter dense model.  Our observation about MoE training cost savings aligns with parallel explorations from Du et al. and Artetxe et al., where they also demonstrated the savings for models with bigger sizes.

Our MoE-based NLG model architecture

To create an MoE-based NLG model, we studied a transformer-based NLG model similar to those of the GPT family. To complete training in a reasonable timeframe, the following models were selected: 350M (24 layers, 1024 hidden size, 16 attention heads), 1.3B (24 layers, 2048 hidden size, 16 attention heads), and 6.7B (32 layers, 4096 hidden size, 32 attention heads). We use “350M+MoE-128” to denote a MoE model that uses 350M dense model as the base model and adds 128 experts on every other feedforward layer.

MoE training infrastructure and dataset

We pretrained both the dense and MoE versions of the above models using DeepSpeed on 128 NVIDIA Ampere A100 GPUs (Azure ND A100 instances). These Azure instances are powered by the latest Azure HPC docker images that provide a fully optimized environment and best performing library versions of NCCL, Mellanox OFED, Sharp, and CUDA. DeepSpeed uses a combination of data-parallel and expert-parallel training to effectively scale MoE model training and is capable of training MoE models with trillions of parameters on hundreds of GPUs.

We used the same training data as described in the MT-NLG blog post. For a fair comparison, we use 300 billion tokens to train both dense and MoE models.

MoE leads to better quality for NLG models

Figure 1 shows that the validation loss for the MoE versions of the model is significantly better than their dense counterparts. Furthermore, validation loss of the 350M+MoE-128 model is on par validation loss of the 1.3B dense model with 4x larger base. This is also true for 1.3B+MoE-128 in comparison with 6.7B dense model with 5x larger base. Furthermore, the model quality is on par not only with the validation loss but also with six zero-shot evaluation tasks as shown in Table 1, demonstrating that these models have very similar model quality.

Case Model size LAMBADA: completion prediction PIQA: commonsense reasoning BoolQ: reading comprehension RACE-h: reading comprehension TriviaQA: question answering WebQs: question answering
Dense NLG:
(1) 350M 350M 0.5203 0.6931 0.5364 0.3177 0.0321 0.0157
(2) 1.3B 1.3B 0.6365 0.7339 0.6339 0.3560 0.1005 0.0325
(3) 6.7B 6.7B 0.7194 0.7671 0.6703 0.3742 0.2347 0.0512
Standard MoE NLG:
(4) 350M+MoE-128 13B 0.6270 0.7459 0.6046 0.3560 0.1658 0.0517
(5) 1.3B+MoE-128 52B 0.6984 0.7671 0.6492 0.3809 0.3129 0.0719
PR-MoE NLG:
(6) 350M+PR-MoE-32/64 4B 0.6365 0.7399 0.5988 0.3569 0.1630 0.0473
(7) 1.3B+PR-MoE-64/128 31B 0.7060 0.7775 0.6716 0.3809 0.2886 0.0773
PR-MoE NLG + MoS:
(8) 350M+PR-MoE-32/64 + MoS-21L 3.5B 0.6346 0.7334 0.5807 0.3483 0.1369 0.0522
(9) 1.3B+PR-MoE-64/128 + MoS-21L 27B 0.7017 0.7769 0.6566 0.3694 0.2905 0.0822
Table 1: Zero-shot evaluation results (last six columns) for different dense and MoE NLG models. All zero-shot evaluation results use the accuracy metric.
Figure 1: Token-wise validation loss curves for dense and MoE NLG models with different model sizes. This shows that the validation loss for the MoE versions of the model is significantly better than their dense counterparts
Figure 1: Token-wise validation loss curves for dense and MoE NLG models with different model sizes.

Same quality with 5x less training cost

As shown in the results above, adding MoE with 128 experts to the NLG model significantly improves its quality. However, these experts do not change the compute requirements of the model as each token is only processed by a single expert. Therefore, the compute requirements for a dense model and its corresponding MoE models with the same base are similar.

More concretely, training 1.3B+MoE-128 requires roughly the same amount of compute operations as a 1.3B dense model while offering much better quality. Our results show that by applying MoE, the model quality of a 6.7B-parameter dense model can be achieved at the training cost of a 1.3B-parameter dense model, resulting in an effective training compute reduction of 5x.

This compute cost reduction can directly be translated into throughput gain, training time and training cost reduction by leveraging the efficient DeepSpeed MoE training system. Table 2 shows the training throughput of 1.3B+MoE-128 compared with the 6.7B dense model on 128 NVIDIA A100 GPUs.

Training samples per sec Throughput gain/ Cost Reduction
6.7B dense 70 1x
1.3B+MoE-128 372 5x
Table 2: Training throughput (on 128 A100 GPUs) of an MoE-based model versus a dense model, where both achieve the same model quality.

PR-MoE and Mixture-of-Students: Reducing the model size and improving parameter efficiency

While MoE-based models achieve the same quality with 5x training cost reduction in the NLG example, the resulting model has roughly 8x the parameters of the corresponding dense model. For example, a 6.7B dense model has 6.7 billion parameters and 1.3B+MoE-128 has 52 billion parameters. Training such a massive MoE model requires significantly more memory; inference latency and cost could also increase since the primary inference bottleneck is often the memory bandwidth needed to read model weights.

To reduce model size and improve parameter efficiency, we’ve made innovations in the MoE model architecture that reduce the overall model size by up to 3 times without affecting model quality. We also leverage knowledge distillation to learn a Mixture-of-Students (MoS) model, with a smaller model capacity as the teacher PR-MoE but preserve the teacher model accuracy.

Two intuitions for improving MoE architecture

Intuition-I: The standard MoE architecture has the same number and structure of experts in all MoE layers. This relates to a fundamental question in the deep learning community, which has been well-studied in computer vision: do all the layers in a deep neural network learn the same representation? Shallow layers learn general representations and deep layers learn more objective-specific representations. This also leads transfer learning in computer vision to freeze shallow layers for fine-tuning. This phenomenon, however, has not been well-explored in natural language processing (NLP), particularly for MoE.

To investigate the question, we compare the performance of two different half-MoE architectures. More specifically, we put MoE layers in the first half of the model and leave the second half’s layers identical to the dense model. We switch the MoE layers to the second half and use dense at the first half. The results show that deeper layers benefit more from large number of experts. This confirms that not all MoE layers learn the same level of representations.

Intuition-II: To improve the generalization performance of MoE models, there are two common methods: 1) increasing the number of experts while keeping the capacity (that is, for each token, the number of experts it goes through) to be the same; 2) doubling the capacity at the expense of slightly more computation (33%) while keeping the same number of experts. However, for method 1, the memory requirement for training resources needs to be increased due to larger number of experts. For method 2, higher capacity also doubles the communication volume which can significantly slow down training and inference. Is there a way to keep the training and inference efficiency while getting generalization performance gain?

One intuition of why larger capacity helps accuracy is that those extra experts can help correct the “representation” of the first expert. However, does this first expert need to be changed every time? Or can we fix the first and only assign different extra experts to different tokens?

To investigate this, we perform a comparison in two ways: doubling the capacity and fixing one expert while varying the second expert across different experts. For the latter, a token will always pass a dense multilayer perceptron (MLP) module and an expert from MoE module. Therefore, we can achieve the benefit of using two experts per layer but still use one communication. We find out that the generalization performance of these two is on-par with each other. However, the training/inference speed of our new design is faster.

New MoE Architecture: Pyramid-Residual MoE

We propose a novel MoE architecture, Pyramid-Residual MoE (PR-MoE). Figure 2 (right) shows its architecture. Following Intuition-I, PR-MoE utilizes more experts in the last few layers as compared to previous layers, which gives a reverse pyramid design. Following Intuition II, we propose a Residual-MoE structure, where each token separately passes one fixed MLP layer and one chosen expert. Combining them results in the PR-MoE model, where all standard MoE layers are replaced by the new PR-MoE layer.

The left side shows the architecture of standard MoE model, where each expert layer have same amount number of experts. The right side shows the PR-MoE architecture. Two noticeable differences are (1) PR-MoE has more experts at the last two experts; (2) A token will always pass an MLP module plus a selected expert.
Figure 2: The illustration of standard MoE (left) and PR-MoE (right).

Same quality as standard models with up to 3x model size reduction: We evaluate PR-MoE on two model sizes, with bases of 350M and 1.3B parameters, and compare the performance with larger standard MoE architectures. The results are shown in Table 1 above. For both cases, PR-MoE uses much fewer experts but achieves comparable accuracy as standard MoE models. In the 350M model, PR-MoE only uses less than one third of the parameters that the standard MoE uses. In the 1.3B case, PR-MoE only uses about 60 percent of the parameters required for standard MoE.

Mixture-of-Students: Distillation for even smaller model size and faster inference

Model compression and distillation present additional opportunities to improve inference performance further. While there are many ways for model compression, such as quantization and pruning, we focus on reducing the number of layers of each expert in MoE and using knowledge distillation to compress the resulting student model to achieve a similar performance to the teacher MoE.

Since MoE structure brings significant benefits by enabling sparse training and inference, our task-agnostic distilled MoE model, which we call Mixture of Students (MoS), inherits these benefits while still providing the flexibility to compress into a dense model. We note that while existing work primarily considers small transformers (a few hundred parameters) and dense encoder-based LM models (like BERT), we focus on studying knowledge distillation for sparse MoE-based auto-generative language models on a multi-billion parameter scale. Furthermore, given the excellent performance of PR-MoE, we combine PR-MoE with MoS to further reduce the MoE model size.

To apply knowledge distillation for MoE, we first train a teacher MoE model using the same training hyperparameters and datasets as in the previous section. The teacher model is 350M+PR-MoE-32/64 and 1.3B+PR-MoE-64/128, respectively. We reduce the depth of the teacher model to 21 (12.5%) to obtain a student model, and we force the student to imitate the outputs from the teacher MoE on the training dataset.

In particular, we take the knowledge distillation loss as a weighted sum of the cross-entropy loss between predictions and the given hard label and the Kullback–Leibler (KL) divergence loss between the predictions and the teacher’s soft label. In practice, we observe that distillation may adversely affect MoS accuracy. In particular, while knowledge distillation loss improves validation accuracy initially, it begins to hurt accuracy towards the end of training.

We hypothesize that because the PR-MoE already reduces the capacity compared with the standard MoE by exploiting the architecture change (for example, reducing experts in lower layers), further reducing the depth of the model causes the student to have insufficient capacity, making it fall into the underfitting regime. Therefore, we take a staged distillation approach, where we decay the impact from knowledge distillation gradually in the training process.

Our study shows that it is possible to reach similar performance—such as in zero-shot evaluation on many downstream tasks—for a smaller MoE model pretrained with knowledge distillation. The MoS achieve comparable accuracy to the teacher MoE model, retaining 99.3% and 99.1% of the performance despite having 12.5% fewer layers. This enables an additional 12.5% model size reduction. When combined with PR-MoE, it leads to up to 3.7x model size reduction.

DeepSpeed-MoE inference: Serving MoE models at unprecedented scale and speed

Optimizing for MoE inference latency and cost is crucial for MoE models to be useful in practice. During inference the batch size is generally small, so the inference latency of an MoE model depends primarily on time it takes to load the model parameters from main memory, contrasting with the conventional belief that lesser compute should lead to faster inference. So, inference performance mainly depends on two factors: the overall model size and the overall achievable memory bandwidth.

In the previous section, we presented PR-MoE and distillation to optimize the model size. This section presents our solution to maximize the achievable memory bandwidth by creating a multi-GPU MoE inferencing system that can leverage the aggregated memory bandwidth across dozens of distributed GPUs to speed up inference. Together, DeepSpeed offers an unprecedented scale and efficiency to serve massive MoE models with 7.3x better latency and cost compared to baseline MoE systems, and up to 4.5x faster and 9x cheaper MoE inference compared to quality-equivalent dense models.

MoE inference performance is an interesting paradox

From the best-case view, each token of an MoE model only activates a single expert at each MoE layer, resulting in a critical data path that is equivalent to the base model size, orders-of-magnitude smaller than the actual model size. For example, when inferencing with a 1.3B+MoE-128 model, each input token needs just 1.3 billion parameters, even though the overall model size is 52 billion parameters.

From the worst-case view, the aggregate parameters needed to process a group of tokens can be as large as the full model size, in the example, the entire 52 billion parameters, making it challenging to achieve short latency and high throughput.

Design goals for the DS-MoE inference system

The design goal of our optimizations is to steer the performance toward the best-case view. This requires careful orchestration and partitioning of the model to group and route all tokens with the same critical data path together to reduce data access per device and achieve maximum aggregate bandwidth. An overview of how DS-MoE tackles this design goal by embracing multi-dimensional parallelism inherent in MoE models is illustrated in Figure 3.

Figure 3: Illustration of the DS-MoE design that embraces the complexity of multi-dimensional parallelism for different partitions (expert and non-expert) of the model.
Figure 3: DS-MoE design that embraces the complexity of multi-dimensional parallelism for different partitions (expert and non-expert) of the model.

DS-MoE inference system is centered around three well-coordinated optimizations:

The DS-MoE Inference system is designed to minimize the critical data path per device and maximize the achievable aggregate memory bandwidth across devices, which is achieved by: 1) expert parallelism and expert-slicing on expert parameters and 2) data parallelism and tensor-slicing for non-expert parameters.

Expert parallelism and expert-slicing for expert parameters: We partition experts across devices, group all tokens of using the same experts under the same critical data path, and parallelize processing of the token groups with different critical paths among different devices using expert parallelism.

In the example of 1.3B+MoE-128, when expert parallelism is equal to 128, each GPU only processes a single token group corresponding to the experts on that device. This results in a sequential path that is 1.3 billion parameters per device, 5x smaller than its quality-equivalent dense model with 6.7B parameters. Therefore, in theory, an MoE-based model has the potential to run up to 5x faster than its quality-equivalent dense model using expert parallelism assuming no communication overhead, a topic we discuss in the next section.

In addition, we propose “expert-slicing” to leverage the concept of tensor-slicing for the parameters within an expert. This additional dimension of parallelism is helpful for latency stringent scenarios that we scale to more devices than the number of experts.

Data parallelism and Tensor-slicing for non-expert parameters: Within a node, we use tensor-slicing to partition the non-expert parameters, leveraging aggregate GPU memory bandwidth of all GPUs to accelerate the processing. While it is possible to perform tensor-slicing across nodes, the communication overhead of tensor-slicing along with reduced compute granularity generally makes inter-node tensor-slicing inefficient. To scale non-expert parameters across multiple nodes, we use data parallelism by creating non-expert parameter replicas processing different batches across nodes that incurs no communication overhead or reduction in compute granularity.

Figure 3 above shows an example scenario for distributed MoE inference highlighting different parts of the MoE model, how the model and data are partitioned, and what form of parallelism is used to deal with each piece.

Expert parallelism requires all-to-all communication between all expert parallel devices. By default, DS-MoE uses NCCL for this communication via torch. distributed interface, but we observe major overhead when it is used at scale. To optimize this, we develop a custom communication interface to use Microsoft SCCL and achieve better performance than NCCL. Despite the plug-in optimizations, it is difficult to scale expert parallelism to many devices as the latency increases linearly with the increase in devices. To address this critical scaling challenge, we design two new communication optimization strategies that exploit the underlying point-to-point NCCL operations and custom CUDA kernels to perform necessary data-layout transformations.

Hierarchical All-to-All: We implement a hierarchical all-to-all as a two-step process with a data-layout transformation, followed by an intra-node all-to-all, followed by a second data-layout transformation and a final inter-node all-to-all. This reduces the communication hops from O (p) to O (G+p/G), where G is the number of GPUs in a node and p is the total number of GPU devices. Figure 4 shows the design overview of this implementation. Despite the 2x increase in communication volume, this hierarchical implementation allows for better scaling for small batch sizes as communication at this message size is more latency-bound than bandwidth-bound.

Illustration of the proposed hierarchical all-to-all design
Figure 4: Illustration of the proposed hierarchical all-to-all design

Parallelism Coordinated Communication Optimization: Combining expert parallelism and tensor-slicing with data parallelism within a single model is non-trivial. Tensor-slicing splits individual operators across GPUs and requires all-reduce between them, while expert parallelism places expert operators across GPUs without splitting them and requires all-to-all between them. By design, a naïve approach to handle these communication steps will be inefficient.

Illustration of the parallelism coordinated all-to-all optimization
Figure 5: Illustration of the parallelism coordinated communication

To this end, we propose a novel design, as shown in Figure 5, that performs all-to-all only on a subset of devices that share the same tensor-slicing rank instead of all expert-parallel processes. As a result, the latency of all-to-all can be reduced to O(p/L) instead of O(p) where L is the tensor-slicing parallelism degree. This reduced latency enables us to scale inference to hundreds of GPU devices.

DS-MoE inference system consists of highly optimized kernels targeting both transformer and MoE-related operations. These kernels aim for maximizing the bandwidth utilization by fusing the operations that work in producer-consumer fashion. In addition to computation required for the transformer layers (explained in this blog post), MoE models require the following additional operations:

  1. a gating function that determines the assignment of tokens to experts, where the result is represented as a sparse tensor.
  2. a sparse einsum operator, between the one-hot tensor and all the tokens, which sorts the ordering of the tokens based on the assigned expert ID.
  3. a final einsum that scales and re-sorts the tokens back to their original ordering.

The gating function includes numerous operations to create token-masks, select top-k experts, and perform cumulative-sum and sparse matrix-multiply, all of which are not only wasteful due to the sparse tenor representation, but also extremely slow due to many kernel call invocations. Moreover, the sparse einsums have a complexity of SxExMxc (number of tokens S, number of experts E, model dimension M, and expert capacity c that is typically 1), but E-1 out of E operators for each token are multiplication and addition with zeros.

We optimize these operators using dense representation and kernel-fusion. First, we fuse the gating function into a single kernel, and use a dense token-to-expert mapping table to represent the assignment from tokens to experts, greatly reducing the kernel launch overhead, as well as memory and compute overhead from the sparse representation.

Second, to optimize the remaining two sparse einsums, we implement them as data layout transformations using the above-mentioned mapping table, to first sort them based on the expert id and then back to its original ordering without requiring any sparse einsum, reducing the complexity of these operations from SxExMxc to SxMxc. Combined, these optimizations result in over 6x reduction in MoE kernel related latency.

Low latency and high throughput at unprecedented scale

In modern production environments, powerful DL models are often served using hundreds of GPU devices to meet the traffic demand and deliver low latency. Here we demonstrate the performance of DS-MoE Inference System on a 256 A100 with 40 GB GPUs. Table 3 shows various model configurations used for performance comparisons in this section.

Model Size (billions) # of Layers Hidden size Model-Parallel degree Expert-Parallel degree
2.4B+MoE-128 107.7 16 3,584 1 128
8B+MoE-128 349.0 40 4,096 4 128
24B+MoE-128 1,046.9 30 8,192 8 128
47B+MoE-128 2,024.0 58 8,192 8 128
Table 3. The configuration of different MoE models used for the performance evaluation of Figure 6.

We scale MoE models from 107 billion parameters to 2 trillion parameters. To offer a strong baseline for comparison, we utilize a full-featured distributed PyTorch implementation that is capable of both tensor-slicing and expert-parallelism. Figure 6 shows the results for all these model configurations:

  • DeepSpeed MoE achieves up to 7.3x reduction in latency while achieving up to 7.3x higher throughput compared to the baseline.
  • By effectively exploiting hundreds of GPUs in parallel, DeepSpeed MoE achieves an unprecedented scale for inference at incredibly low latencies – a staggering trillion parameter MoE model can be inferenced under 25ms.
Figure 6: Latency and throughput Improvement offered by DeepSpeed-Inference-MoE (Optimized) over PyTorch (Baseline) for different model sizes (107 billion to 2 trillion parameters). We use 128 GPUs for all configurations for baseline, and 128/256 GPUs for DeepSpeed (256 GPUs for the trillion-scale models). The throughputs show that DeepSpeed MoE achieves up to 7.3x reduction in latency while achieving up to 7.3x higher throughput compared to the baseline
Figure 6: Latency and throughput Improvement offered by DeepSpeed-Inference-MoE (Optimized) over PyTorch (Baseline) for different model sizes (107 billion to 2 trillion parameters). We use 128 GPUs for all configurations for baseline, and 128/256 GPUs for DeepSpeed (256 GPUs for the trillion-scale models). The throughputs shown here are per GPU and should be multiplied by number of GPUs to get the aggregate throughput of the cluster.

By combining the system optimizations offered by the DS-MoE inference system and model innovations of PR-MoE and MoS, DeepSpeed MoE delivers two more benefits:

  1. Reduce the minimum number of GPUs required to perform inference on these models. Figure 7 shows a comparison of three model variants along with the baseline: 1) standard MoE Model (8b-MoE-128), 2) PR-MoE model, and 3) PR-MoE+MoS model. The PR-MoE+MoS model performs the best as expected. The key observation is that the PR-MoE and MoS optimizations allow us to use 16 GPUs instead of 32 GPUs to perform this inference.
  2. Further improve both latency and throughput of various MoE model sizes (as shown in Figure 8).
Figure 7: Graph showing 2x fewer resources needed for MoE inference when using PR-MoE+MoS. PR-MoE and MoS optimizations also allow us to use 16 GPUs instead of 32 GPUs to perform this inference.
Figure 7: 2x fewer resources needed for MoE inference when using PR-MoE+MoS.
Figure 8: Graph showing inference latency comparing standard-MoE with PR-MoE and PR-MoE + MoS compression on different GPU count and model sizes.  This shows PR-MoE + MoS achieves up to a 10x latency improvement compared to the baseline.
Figure 8: Inference latency comparing standard-MoE with PR-MoE and PR-MoE + MoS compression on different GPU count and model sizes

Better inference latency and throughput than quality-equivalent dense models

To better understand the inference performance of MoE models compared to quality-equivalent dense models, it is important to note that although MoE models are 5x faster and cheaper to train, that may not be true for inference. Inference performance has different bottlenecks and its primary factor is the amount of data read from memory instead of computation.

We show inference latency and throughput for two MoE models compared to their quality-equivalent dense models: a) 52 billion-parameter MoE (1.3B-MoE-128) model compared to a 6.7 billion-parameter dense model and b) 1.5 trillion-parameter MoE model compared to a 175 billion-parameter dense model in Figures 9 and 10, respectively.

When using PyTorch, MoE model inference is more expensive and slower compared to its quality-equivalent dense models. This is true for both model sizes. However, the optimizations in DS-MoE reverse this trend and make MoE model inference both faster and cheaper compared to quality-equivalent dense models. This is a critical result, showing MoE’s benefits over dense beyond training but also on inference latency and cost, which is important to real-world deployments.

When comparing the results of Figure 9 with Figure 10, we observe that the benefits of MoE models over dense models become even larger with the increase of model size. While the 52 billion-parameter MoE model is 2.4x faster and cheaper than the 6.7 billion-parameter dense model, the 1.5 trillion-parameter MoE model is 4.5x faster and 9x cheaper than the 175 billion-parameter dense model. The benefits increase for larger models because DS-MoE leverages parallelism-coordinated optimization to reduce communication overhead when using tensor-slicing on non-expert part of model. Furthermore, we can take advantage of expert-slicing at this scale, which enables us to scale to a higher number of GPUs compared to the PyTorch baseline. In addition, for the larger 1.5 trillion-parameter MoE model, we observed 2x additional improvement in throughput over latency as shown in Figure 10. This is because MoE models can run with half the tensor-slicing degree of the dense model (8-way vs. 16-way) and thus two times higher batch size.

Overall, DeepSpeed MoE delivers up to 4.5x faster and up to 9x cheaper MoE model inference compared to serving quality-equivalent dense models using PyTorch. With benefits that scale with model size and hardware resources, as shown from these results, it makes us believe that MoE models will be crucial to bring the next generation of advances in AI scale.

Figure 9: Inference latency comparison of a 52 billion-parameter MoE model and its quality-equivalent 6.7 billion-parameter dense model. It shows the 52 billion-parameter MoE model is 2.4x faster and cheaper than the 6.7 billion-parameter dense model
Figure 9: Inference latency comparison of a 52 billion-parameter MoE model and its quality-equivalent 6.7 billion-parameter dense model. We use 1 GPU for 6.7 billion-parameter model as it offers the lowest latency. We use 128 GPUs for the 52 billion-parameter model. The quality-equivalence has been verified by experiments presented in the training section.
Figure 10: Measured inference latency comparison of a 1.5 trillion-parameter MoE model and its quality-equivalent 175 billion dense model. It shows the 1.5 trillion-parameter MoE model is 4.5x faster and 9x cheaper than the 175 billion-parameter dense model.
Figure 10: Measured inference latency comparison of a 1.5 trillion-parameter MoE model and its quality-equivalent 175 billion dense model. We assume the quality equivalence of these two models with the hypothesis that the scaling law of the smaller scale experiments of Figure 9 holds, as well as from the observations of the published literature.

Looking forward to the next generation of AI Scale

With the exponential growth of model size recently, we have arrived at the boundary of what modern supercomputing clusters can do to train and serve large models. It is no longer feasible to achieve better model quality by simply increasing the model size due to insurmountable requirements on hardware resources. The choices we have are to wait for the next generation of hardware or to innovate and improve the training and inference efficiency using current hardware.

We, along with recent literature, have demonstrated how MoE-based models can reduce the training cost of even the largest NLG models by several times compared to their quality-equivalent dense counterparts, offering the possibility to train the next scale of AI models on current generation of hardware. However, prior to this blog post, to our knowledge there have been no existing works on how to serve the MoE models (with many more parameters) with latency and cost better than the dense models. This is a challenging issue that blocks practical use.

To enable practical and efficient inference for MoE models, we offer novel PR-MoE model architecture and MoS distillation technique to significantly reduce the memory requirements of these models. We also offer an MoE inference framework to achieve incredibly low latency and cost at an unprecedented model scale. Combining these innovations, we are able to make these MoE models not just feasible to serve but able to be used for inference at lower latency and cost than their quality-equivalent dense counterparts.

As a whole, the new innovations and infrastructures offer a promising path towards training and inference of the next generation of AI scale, without requiring an increase in compute resources. A shift from dense to sparse MoE models can open a path to new directions in the large model landscape, where deploying higher-quality models is widely possible with fewer resources and is more sustainable by reducing the environmental impact of large-scale AI.

Software: The best place to train and serve models using DeepSpeed is the Microsoft Azure AI platform. To get started with DeepSpeed on Azure, follow the tutorial and experiment with different models using our Azure ML examples. You can also measure your model’s energy consumption using the latest Azure Machine Learning resource metrics.

With this release of DeepSpeed, we are releasing a generic end-to-end framework for training and inference of MoE-based models. The MoE training support and optimizations are made available in full. The MoE inference optimizations will be released in two phases. The generic flexible parallelism framework for MoE inference is being released today. Optimizations related to computation kernels and communication will be released in future.

  • GITHUB

    DeepSpeed


    DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

To enable experimentation with DeepSpeed MoE optimizations, we are also releasing two extensions of the NLG example that enables 5x reduction in training cost for MT-NLG like models: 1) PR-MoE model extension to enable 3x improvement in parameter efficiency and model size reduction and 2) Model code extensions so users can easily experiment with MoE inference at scale. Please find the code, tutorials, and documents at DeepSpeed GitHub and website.

About our great collaborators

This work was done in collaboration with Brandon Norick, Zhun Liu, and Xia Song from the Turing Team, Young Jin Kim, Alex Muzio, and Hany Hassan Awadalla from the Z-Code Team, and both Saeed Maleki and Madan Musuvathi from the SCCL team.

About the DeepSpeed Team

We are a group of system researchers and engineers—Samyam Rajbhandari, Ammar Ahmad Awan, Jeff Rasley, Reza Yazdani Aminabadi, Minjia Zhang, Zhewei Yao, Conglong Li, Olatunji Ruwase, Elton Zheng, Shaden Smith, Cheng Li, Du Li, Yang Li, Xiaoxia Wu, Jeffery Zhu (PM), Yuxiong He (team lead)—who are enthusiastic about performance optimization of large-scale systems. We have recently focused on deep learning systems, optimizing deep learning’s speed to train, speed to convergence, and speed to develop! If this type of work interests you, the DeepSpeed team is hiring both researchers and engineers! Please visit our careers page.

The post DeepSpeed: Advancing MoE inference and training to power next-generation AI scale appeared first on Microsoft Research.

Read More

EzCc provides secure AI model validation. In the diagram poses the following question: Is the accuracy of the AI model on the test dataset greater than 70%? First, an AI vendor provides model weights, and a modular compiler takes as input from the model weights the AI model structure written in ONNX code for ML inference. From this, it automatically generates MPC protocol code, which is then compiled into various MPC protocols. Additionally, a suite of highly performant cryptographic protocols securely compute complex ML functions on an organization’s test dataset. The MPC protocol outputs random bits, keeping the data from both parties secure.

EzPC: Increased data security in the AI model validation process

EzCc provides secure AI model validation. In the diagram poses the following question: Is the accuracy of the AI model on the test dataset greater than 70%? First, an AI vendor provides model weights, and a modular compiler takes as input from the model weights the AI model structure written in ONNX code for ML inference. From this, it automatically generates MPC protocol code, which is then compiled into various MPC protocols. Additionally, a suite of highly performant cryptographic protocols securely compute complex ML functions on an organization’s test dataset. The MPC protocol outputs random bits, keeping the data from both parties secure.

From manufacturing and logistics to agriculture and transportation, the expansion of artificial intelligence (AI) in the last decade has revolutionized a multitude of industries—examples include enhancing predictive analytics on the manufacturing floor and making microclimate predictions so that farmers can respond and save their crops in time. The adoption of AI is expected to accelerate in the coming years, underscoring the need for an efficient adoption process that preserves data privacy.

Currently, organizations that want to adopt AI into their workflow go through the process of model validation, in which they test, or validate, AI models from multiple vendors before selecting the one that best fits their needs. This is usually done with a test dataset that the organization provides. Unfortunately, the two options that are currently available for model validation are insufficient; both risk the exposure of data.

One of these options entails the AI vendor sharing their model with the organization, which can then validate the model on its test dataset. However, by doing this, the AI vendor risks exposing its intellectual property, which it undoubtedly wants to protect. The second option, equally risky, involves the organization sharing its test dataset with the AI vendor. This is problematic on two fronts. First, it risks exposing a dataset with sensitive information. Additionally, there’s the risk that the AI vendor will use the test dataset to train the AI model, thereby “over-fitting” the model to the test dataset to show credible results. To accurately assess how an AI model performs on a test dataset, it’s critical that the model not be trained on it. Currently, these concerns are addressed by complex legal agreements, often taking several months to draft and execute, creating a substantial delay in the AI adoption process.

The risk of data exposure and the need for legal agreements are compounded in the healthcare domain, where patient data—which makes up the test dataset—is incredibly sensitive, and there are strict privacy regulations with which both organizations must comply. Additionally, not only does the vendor’s AI model contain proprietary intellectual property information, but it may also include sensitive patient information as part of the training data that was used to develop it. This makes for a challenging predicament. On one hand, healthcare organizations want to quickly adopt AI due to its enormous potential in such applications as understanding health risks in patients, predicting and diagnosing diseases, and developing personalized health intervention. On the other hand, there’s a fast-growing list of AI vendors in the healthcare space to choose from (currently over 200), making the cumulative legal paperwork of AI validation daunting.

EzPC: Easy Secure Multi-Party Computation

We’re very interested in accelerating the AI model validation process while also ensuring dataset and model privacy. For this reason, we built Easy Secure Multi-party Computation (EzPC). This open-source framework is the result of a collaboration among researchers with backgrounds in cryptography, programming languages, machine learning (ML), and security. At its core, EzPC is based on secure multiparty computation (MPC)—a suite of cryptographic protocols that enable multiple parties to collaboratively compute a function on their private data without revealing that data to one other or any other party. This functionality makes AI model validation an ideal use case for MPC.

However, while MPC has been around for almost four decades, it’s rarely deployed because building scalable and efficient MPC protocols requires deep cryptography expertise. Additionally, while MPC performs well when computing small or simple stand-alone functions, combining several different kinds of functions—which is fundamental to ML applications—is much harder and inefficient if done without a specialized skillset.

EzPC solves these problems, making it easy for all developers, not just cryptography experts, to use MPC as a building block in their applications while providing high computational performance. Two innovations are at the core of EzPC. First, a modular compiler called CrypTFlow takes as input TensorFlow or Open Neural Network Exchange (ONNX) code for ML inference and automatically generates C-like code, which can then be compiled into various MPC protocols. This compiler is both “MPC-aware” and optimized, ensuring that the MPC protocols are efficient and scalable. The second innovation is a suite of highly performant cryptographic protocols for securely computing complex ML functions.

 The EzPC system provides usability, security, and performance. Regarding usability, EzPC provides automatic compilations from TensorFlow or ONNX code to MPC protocols. Also, no cryptography expertise is required. Regarding security, mathematical guarantees ensure that only random bits are exchanged, and sensitive data is demonstrably secured. Regarding performance, real-world benchmarks show that EzPC can run on million-parameter networks and is executable in minutes.

EzPC in practice: Multi-institution medical imaging AI validation

In a recent collaboration with researchers at Stanford University and the Centre for Advanced Research in Imaging, Neuroscience & Genomics (CARING), the EzPC team built a system using EzPC to address the need for secure and performant AI model validation. The team from Stanford University had developed a widely acclaimed 7-million parameter DenseNet-121 AI model trained on the CheXpert dataset to predict certain lung diseases from chest X-rays, while a team from CARING created a labeled test dataset of five hundred patient images. The goal was to test the accuracy of the CheXpert model on CARING’s test dataset while preserving the privacy of both the model and the test data.

With this test, EzPC enabled the first-ever secure validation of a production-grade AI model, proving that it’s not necessary to share data to accurately perform AI model validation. Additionally, the performance overheads of the secure validation were reasonable and practical for the application. In particular, it took 15 minutes to perform secure inference on a single image from the test data between two standard cloud virtual machines, which was about 3000x longer than the time needed to test an image without the added security that EzPC provides. Running all the images from the test data took a total of five days with a nominal overall cost (Multi-institution encrypted medical imaging AI validation without data sharing.

Spotlight: Academic programs

Working with the academic community

Read more about grants, fellowships, events and other ways to connect with Microsoft research.

Looking ahead: Standardizing privacy technology and applications beyond healthcare

With EzPC, MPC technology is now practical and accessible enough to be run on complex AI workloads, making it a game-changer in data collaboration and enabling organizations in all industries, not only healthcare, to select the best AI models for their use cases while simultaneously protecting data and model confidentiality. We want to encourage the use of EzPC with the awareness that it’s possible to validate AI models without sharing data. In doing so, we can prevent the risk of data exposure and potentially overcome current barriers in data collaboration.

Moreover, this technology has the potential to impact the negotiation of complex legal agreements required for the AI model validation process. It’s our hope that these types of legal agreements as well as legislation that aims to protect sensitive and proprietary information can incorporate the understanding that—when using the latest privacy-preserving technology—it’s not necessary to share this type of information to compute functions on joint data.

In addition to AI model validation, EzPC can be applied to a number of different scenarios where it’s essential to maintain data privacy. We’ve successfully evaluated EzPC to securely compute a variety of algorithms across such domains as phishing detection, personalized radiotherapy, speech to keywords, and analytics.

EzPC is open source under MIT license on GitHub. Discover the latest developments on the EzPC research project page, where you can read our publications and watch videos to learn more.

The post EzPC: Increased data security in the AI model validation process appeared first on Microsoft Research.

Read More