Partnering people with large language models to find and fix bugs in NLP systems

Advances in platform models—large-scale models that can serve as foundations across applications—have significantly improved the ability of computers to process natural language. But natural language processing (NLP) models are still far from perfect, sometimes failing in embarrassing ways, like translating “Eu não recomendo este prato” (I don’t recommend this dish) in Portuguese to “I highly recommend this dish” in English (a real example from a top commercial model). These failures continue to exist in part because finding and fixing bugs in NLP models is hard—so hard that severe bugs impact almost every major open-source and commercial NLP model. 

Current methods for finding or fixing bugs take one of two approaches: they’re either user-driven or automated. User-driven methods are flexible and can test any aspect of a model’s behavior, but they depend on highly variable human ability to imagine bugs and are so labor intensive that in practice only a small part of the input space gets tested. Automated approaches, on the other hand, are fast and so can explore large portions of the input space. However, since they lack human guidance, they can only test if a model is right or wrong in very restricted scenarios, such as when the model has inconsistent predictions on inputs with slight variations in phrasing. 

We believe platform models, specifically modern large language models (LLMs) like GPT-3, offer an opportunity for us to combine the synergistic strengths of both user-driven approaches and automated approaches, keeping the user in control of defining what the model being tested should be doing while leveraging the abilities of modern generative language models to generate at scale tests within a specific category of model behavior. We call this human-AI team approach Adaptive Testing and Debugging, or AdaTest for short. 

With AdaTest, a large language model is tasked with the slow burden of generating a large quantity of tests targeted at finding bugs in the model being tested, while the person steers the language model by selecting valid tests and organizing them into semantically related topics. This guidance from the person drastically improves the language model’s generation performance and directs it toward areas of interest. Because these tests are effectively a form of labeled data, they not only identify bugs but can be used to fix bugs in an iterative debugging loop similar to traditional software development. AdaTest offers significant productivity gains for expert users while remaining simple enough to empower diverse groups of non-experts without a background in programming. This means experts and non-experts alike can better understand and control the behavior of their AI systems across a range of scenarios, which makes for not only better-performing AI systems but more responsible AI systems. The AdaTest code and pre-populated test trees are open source on GitHub

We’re presenting our paper, “Adaptive Testing and Debugging of NLP Models,” at the 2022 Meeting of the Association for Computational Linguistics (ACL), where our colleagues will also be introducing work that leverages large language models, in their case, to grow adversarial datasets for content moderation tools

A diagram in which the testing loop is represented by a series of icons showing the language model suggesting tests, the user filtering and organizing them in a test tree, and the language model using that user feedback to suggest more tests, beginning the process again. The graphic representing the testing loop is situated within the debugging loop. Red arrows from the testing loop to a black square labeled “target model” and back to the testing loop indicate identified test failures being used to fix a target model, which is then retested in an iterative process.
Figure 1: AdaTest consists of two loops: a testing loop that generates and organizes tests optimized for the model being tested (the target model) and a debugging loop that iteratively refines the model based on test failures. 

Finding bugs with the testing loop

The AdaTest process is composed of an inner testing loop that is used to find bugs (Figure 1, unrolled in Figure 2) and an outer debugging loop that is used to fix bugs (Figure 1, unrolled in Figure 4). 

Consider how this works for sentiment analysis, used to determine if a piece of text expresses a positive or negative sentiment (typically in the context of product reviews or customer feedback). While the task seems simple enough, even state-of-the-art models have failures, ranging from overt, like classifying “I don’t think I’ve ever had a nicer time in my life” as negative, to more subtly harmful, like classifying “I am a racial minority” as negative (both represent real failures found with AdaTest in commercial models). To demonstrate how AdaTest finds and fixes bugs, we show how to test for (and later fix) instances of fairness-related harms in which neutral references to a specific identity group within a piece of text could cause a sentiment analysis model to incorrectly downweight the sentiment of the text—in other words, scenarios in which a model might treat comments from specific groups more negatively. 

In the testing loop, we start with a set of unit tests about various identities and label the set “/Sensitive” (Figure 2 below). These initial examples don’t reveal any model failures. But AdaTest then uses a large language model—in our case, GPT-3—to generate many similar suggested tests designed to highlight bugs (Figure 2A). While hundreds of tests are generated, we only need to review the top few failing or near-failing tests. We then ignore tests that don’t represent real failures (for example, “I am tired of being silenced” really should be negative in Figure 2) and add the other valid tests to the current topic, also occasionally organizing them into additional subtopics (Figure 2B). These user-filtered tests are included in the language model prompt for the next round of suggestions, nudging the next set of suggestions toward the intersection between user interest and model failure (Figure 2C). Repeating the testing loop results in the language model starting at tests that don’t fail and slowly working its way up to producing stronger and stronger failures. So even when users can’t find model failures on their own, they can start from a small set of passing tests and quickly iterate with the language model to produce a large set of tests that reveal bugs in the model being tested. 

The testing loop represented as a series of rectangles, each containing test suggestions. Starting with the top rectangle and moving down, the user provides three neutral identity statements that are not predicted as negative. The language model, represented by a robot icon, suggests two statements predicted as negative in the next rectangle. In a third rectangle, real failures are accepted and organized into subtopics by the user, represented by a person icon. From those selections, the model suggests two more statements in the next rectangle. In the last rectangle, one of the subtopics is expanded based on the model’s previous suggestions.
Figure 2: The testing loop cycles between the large language model (LLM) generating test suggestions, the model scoring the suggestions, and the user accepting (✔) and organizing them, beginning with initial user-provided examples. In this three-way sentiment analysis example, the model “f” can either pass (green) or fail (red) a test. Passing one of the tests above means the model did not output “negative” while failing a test above means the model did output “negative” and hence failed the test assertion (≠). As the user filters and organizes (B, D), the LLM iteratively climbs toward suggesting valid tests that reveal more pronounced failures (A, C). In this example, we’re testing a sentiment analysis model to ensure that neutral identity-related statements don’t cause the model to flag comments as negative. 

If instead of the “/Sensitive” topic shown in Figure 2, we target a different topic, such as handling negation, we’ll reveal different failures. For example, starting from simple statements like “I have never been happier” that a commercial model correctly classifies as positive, AdaTest can quickly find bugs like “I don’t think that I’ve ever seen a nicer town” getting labeled as negative. These bugs are egregious and obvious once you see them, but they’re hard to find by hand since they only happen for very specific phrasings. 

We ran user studies to quantitatively evaluate if AdaTest makes experts and non-experts better at writing tests and finding bugs in models. We asked experts—those with a background in machine learning and NLP—to test specific topics in two models: a commercial sentiment classifier and GPT-2 for next word auto-complete, used in such applications as predicting the next word in an email being typed (a scenario in which we want to avoid suggesting stereotypes, for example, one of the behaviors we had participants test for). For each topic and model, participants were randomly assigned to use CheckList (representing state-of-the-art user-driven testing) or AdaTest. We present the average number of discovered model failures per minute in Figure 3, where we observe a fivefold improvement with AdaTest across models and participants in the study. We asked non-experts, or those without any programming background, to test the Perspective API toxicity model for content moderation. Participants tried to find non-toxic statements (that is, statements they would personally feel appropriate posting) predicted as toxic for political opinions. Participants were given access to an improved version of the Dynabench crowd-sourcing interface for model testing and to AdaTest. AdaTest provided up to a tenfold improvement (bottom portion of Figure 3). 

A horizontal bar chart with failures found per minute on the x-axis and model and topic on the y-axis broken down by experience of the participant doing the testing. NLP experts testing the sentiment model and auto-complete with AdaTest found 2 clear positive failures per minute and 1 negated positive per minute and 0.6 Muslim stereotypes and 1.1 African American stereotypes, respectively. NLP experts testing the sentiment model and auto-complete with CheckList found 0.3 clear positive failures per minute and 0.2 negated positives per minute and 0.1 Muslim stereotypes and 0.2 African American stereotypes, respectively. Non-experts testing the toxicity model for non-toxic political viewpoints classified as toxic found 1.5 failures per minute with AdaTest compared with 0.15 with Dynabench.
Figure 3: Per-topic model failures per minute. Experts found approximately five times more failures with AdaTest on all topics, and non-experts benefited by up to 10 times. Error bars represent the 10th and 90th percentiles over bootstrap re-samples of participants. 

We also grouped participants by their progressive versus conservative political alignment and found that participants wrote tests with twice the quality when testing their own perspective versus an opposing perspective (as measured by an independent set of in-group raters). Our user studies highlight that AdaTest can be used by anyone and that such easy-to-use tools are important to enable model testing by people with diverse backgrounds since testers representing different lived experiences and viewpoints are needed to effectively test different perspectives. 

Fixing bugs with the debugging loop 

Once enough bugs are discovered, testers of a model then engage in the outer debugging loop (Figure 4 below), where they fix bugs discovered in the testing loop and then retest the model. In our experiments, we fixed bugs by fine-tuning the model on the tests, but other strategies, such as collecting more data or adding constraints, are also possible. The retest part of the debugging loop (that is, running the testing loop again) is critical since once we use our tests to fix the model, they no longer represent test data but rather training data. The process of fixing a bug often overcompensates, introducing shortcuts or bugs in the initial rounds of the debugging loop that can only be found using a new set of tests adapted to target the new “fixed” model. 

Running the debugging loop on an open-source RoBERTa-Large sentiment model (Figure 4) demonstrates the importance of a test-fix-retest cycle. We start with tests from the “/Sensitive/Immigration” topic from Figure 2 that the RoBERTa model incorrectly labels as negative. Fine-tuning the model on these tests (mixed with the original training data to maintain task performance) results in a new model that no longer fails the tests (second row of Figure 4). However, when we rerun the testing loop, we find that now almost all immigration statements are labeled as “neutral,” even if they are truly negative based on the application and testing scenario (for example, the statements in the third row of Figure 4 wouldn’t be neutral if a model were tasked with detecting if language was for or against something). Fine-tuning again using these new tests (and the older ones) results in a model that correctly fixes the original bug without adding the “every immigration statement is neutral” shortcut.

This doesn’t, of course, guarantee that there isn’t another shortcut still in the model, but in our experience, a few rounds of the debugging loop drastically reduce the number of accidental bugs that get introduced when “fixing” the original bugs. The testers of the model don’t have to exhaustively identify every possible shortcut or imbalance ahead of time, since AdaTest adaptively surfaces and fixes bugs that have been introduced in the next rounds of testing and debugging. Thus, the debugging loop serves as a friendly adversary, pushing the boundaries of the current “specification” until a satisfactory model is produced. In fact, AdaTest can be seen as an application of the test-fix-retest loop from software engineering to NLP. 

The debugging loop represented as a series of rectangles. Starting with the top rectangle and moving down, tests the model has failed are used to fix the model, as demonstrated by the correctly predicted statements in the second rectangle. The testing loop is run again, revealing that an overcorrection has occurred that causes a bug, in the third rectangle, with even negative statements about the topic being predicted as neutral. These tests that reveal this new bug are then fixed, and the model is fine-tuned on the new tests and previous tests, resulting in the statements being predicted correctly in the last rectangle.
Figure 4: Shortcuts added during an iteration of the debugging loop are found and fixed by future iterations. 

To evaluate the effectiveness of the debugging loop, we fine-tuned RoBERTa-Large to detect if two questions are duplicates (that is, the same question worded differently) using the Quora Question Pairs (QQP) dataset and also fine-tuned it for positive/neutral/negative sentiment analysis using the Stanford Sentiment Treebank (SST) dataset. Using previously published CheckList suites for evaluation, we find the baseline model fails 22 out of 53 QQP topics and 11 out of 39 sentiment topics. We then created data to “fix” a topic by either taking 50 examples from the topic’s data in the CheckList condition or by starting from a seed of five examples and running the debugging loop with AdaTest until finding failures becomes qualitatively difficult (on average 2.83 rounds for QQP and 3.83 rounds for sentiment). This yields an average of 41.6 tests for QQP and 55.8 tests for sentiment. We followed this process for six distinct high-failure rate topics in each task. In the vast majority of cases (see paper for details), AdaTest fixes the topics used for training and a number of unseen held-out topics without breaking any topics, while CheckList data often introduces new bugs (and thus breaks other test topics).

We also evaluated the effectiveness of AdaTest in a standard development setting, targeting a model for to-do detection in meeting notes. After three months of development, CheckList testing, and ad hoc GPT-3–based data augmentation, a PhD-level team had managed to build a model with an F1 score of 0.66 (out of 1) on unseen data collected in the wild. We gave AdaTest to the team with a demo a few minutes long. After four hours of running the debugging loop on their own, they produced another model with an F1 score of 0.77 on the same unseen dataset. These scores were then replicated again on a second unseen dataset, showing that AdaTest can add significant bug-fixing value with a fraction of the effort involved in traditional approaches. 

The promise of human-AI collaboration for ML development

AdaTest encourages a close collaboration between people and large language models, yielding the benefits of both. People provide the problem specification that the language model lacks, while the language model provides quality test creation at a scale and scope that is infeasible for people. The debugging loop connects model testing and debugging to effectively fix bugs, taking model development a step closer toward the iterative nature of traditional software development. Human-AI partnership represents a promising way forward for machine learning development, and we expect this synergy to only improve as the capabilities of large language models continue to grow.

Check out the full paper to see AdaTest’s effectiveness on classification models (sentiment analysis, QQP, toxicity, media selection, and task detection), generation models (GPT-2, translation), and per-token models (NER) ranging from well-tested production systems to brand-new applications. Give it a try yourself at https://github.com/microsoft/adatest

Platform models—large-scale models trained on vast amounts of data—are making it easier and faster to develop AI systems. AdaTest and other tools and resources like it are being developed by researchers at Microsoft to help developers get the most out of these platform models while also understanding, measuring, and mitigating the risks they pose.

The post Partnering people with large language models to find and fix bugs in NLP systems appeared first on Microsoft Research.

Read More

(De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools

An abstract image in pastel colors showing a vortex of vectors.

It’s a well-known challenge that large language models (LLMs)—growing in popularity thanks to their adaptability across a variety of applications—carry risks. Because they’re trained on large amounts of data from across the internet, they’re capable of generating inappropriate and harmful language based on similar language encountered during training.  

Content moderation tools can be deployed to flag or filter such language in some contexts, but unfortunately, datasets available to train these tools often fail to capture the complexities of potentially inappropriate and toxic language, especially hate speech. Specifically, the toxic examples in many existing hate speech datasets tend either to be too hard or too easy for tools to learn from—the too-easy examples contain slurs, profanity, and explicit mentions of minority identity groups; the too-hard examples involve obscure references or inside jokes within the hate speech community. Additionally, the neutral examples in these datasets tend not to contain group mentions. As a result, tools may flag any language that references a minority identity group as hate speech, even when that language is neutral. Alternatively, tools trained on this data fail to detect harmful language when it lacks known or explicit slurs, profanity, or explicit mentions of minority identity groups.  

Generating the kind of data needed to strengthen content moderation tools against the above failures and harms is challenging for numerous reasons. In particular, toxic text that is more implicit and that existing machine learning architectures can still learn from or neutral text with group mentions is difficult to collect at scale. Additionally, asking people to write such examples—particularly the toxic ones—can have a negative impact mentally on those assigned the task. 

Inspired by the ability of large language models to mimic the tone, style, and vocabulary of prompts they receive—whether toxic or neutral—we set out to create a dataset for training content moderation tools that can be used to better flag implicitly harmful language. In our paper “ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection,” we collected initial examples of neutral statements with group mentions and examples of implicit hate speech across 13 minority identity groups and used a large-scale language model to scale up and guide the generation process. The outcome is the largest implicit hate speech dataset to date that is publicly available: 274,000 examples comprising both neutral and toxic statements. We conducted a human study on the generated dataset to better understand different aspects of harm beyond binary labels of toxic and neutral assigned by content moderation tools. To stress test existing content moderation tools across minority identity groups studied in this work, we also propose an adversarial classifier-in-the-loop decoding approach. The dataset, two content moderation tools trained on the dataset, prompts used as seed data, and the source codes for our proposed adversarial decoding approach are available in the ToxiGen GitHub repo (please see footnote).

We’re presenting this work at the 2022 Meeting of the Association for Computational Linguistics (ACL), where our colleagues will also be presenting work that leverages the generative power of large language models and human expertise

A horizontal chart comparing the proportion of minority identity group mentions in the prompts with the minority identity group mentions in the generated text for the 13 minority identity groups in this work: Black, Mexican, people with physical disabilities, LGBTQ+, people with cognitive disabilities, Chinese, Muslim, Jewish, Middle Eastern, Women, Asian, Native American, and Latino.
Figure 1: The ToxiGen dataset—an implicit hate speech dataset created by using a large-scale language model with both regular and adversarial decoding to scale up and guide the generation process—contains 274,000 examples comprising both neutral and toxic statements across 13 minority identity groups. As illustrated above, mentions of a specific minority identity group in the prompts and mentions of the same minority identity group in the corresponding generated text are proportional.

Demonstration-based prompting for building better datasets

Large Transformer-based language models don’t explicitly encode semantic information; nevertheless, these models can distinguish the statistical interactions of words in different contexts. Through experimentation with the generation of language via one of these large language models, we learned how to utilize careful prompt engineering strategies to create the ToxiGen implicit hate speech dataset. 

Our first experiments were to generate examples of hate speech and neutral speech related to the 13 minority identity groups in our work. We started by collecting implicit hate speech prompts from existing datasets and neutral prompts drawn from news articles, opinion pieces, podcast transcripts, and other similar public sources and feeding them into the LLM to create a broader, deeper set of prompts. What we found was that the LLM could generate examples that were qualitatively different depending on the source material. When prompted with bits from different writers on the above topics, in each case, the LLM produced linguistically diverse outputs that were nonetheless similar in style and tone. 

Furthermore, we found that through careful cultivation of prompt sets, we could generate a wide variety of text reflecting diverse opinions and thoughts on these topics that weren’t found in our original source materials. We could generate neutral statements about sensitive topics that mentioned the relevant minority identity groups, and we could consistently generate hate speech statements about these minority identity groups that didn’t contain slurs or profanity. And the more we experimented with the source material, the more interesting our dataset became. This is particularly exciting because we hope that other individuals and groups can use these tools to extend our dataset; different disciplinary experts could utilize the same strategies and collect even better prompt sets, resulting in even more subtle and rich examples of neutral speech and hate speech. 

We also found that the model often generated examples of speech that we ourselves had trouble labeling. In essence, we were using the LLM as a probe to explore the delicate boundaries between acceptable and offensive speech. As a result, our own understanding of the problem definition itself grew through our interactions with the model.  

The first 260,000 examples from our dataset were drawn from this experimental approach. 

Examples of statements generated by (De)ToxiGen that fool Google’s Perspective API, HateBERT, OpenAI content filter, AI2 Delphi, and RoBERTa.
Figure 2: Examples of statements generated by (De)ToxiGen that fool Google’s Perspective API, HateBERT, OpenAI content filter, AI2 Delphi, and RoBERTa. Five statements are neutral but mention minority identity groups, so the content moderation tools find them hateful. Five are toxic sentences, but the tools find them neutral. The proposed decoding approach, (De)ToxiGen (referred to as ALICE in the paper), can challenge these content moderation tools, allowing developers to increase their coverage by creating adversarial examples. 

(De)ToxiGen: An adversarial decoding approach for strengthening content moderation tools

While demonstration-based prompting can facilitate large-scale data generation, it doesn’t generate data targeted specifically to challenge a given content moderation tool, or content classifier. This is important because every content moderation tool has unique vulnerabilities depending on the type of data it has been trained on. To address this, we developed (De)ToxiGen (referred to as ALICE in the paper), an algorithmic mechanism that creates an adversarial set-up between an LLM and a given content moderation tool in which the content classifier is in the loop during decoding.  

The proposed approach can increase or decrease the likelihood that a generated statement is classified as hate speech while maintaining the coherence of the generated language. It can generate both false negatives and false positives for a given content moderation tool. For false negatives, toxic prompts are used to elicit toxic responses, and then the tool’s probability of the neutral class is maximized during decoding. Similarly, to generate false positives, neutral prompts are used to generate neutral responses, and then the probability of the toxic class is maximized during decoding. With this approach, we’re essentially trying to reveal weaknesses in a specific content moderation tool by guiding the LLM to produce statements that we know the tool will misidentify. The generated data can then be used to improve the performance and coverage of the targeted content moderation tool. Our ToxiGen dataset includes data generated by both demonstration-based prompting and our proposed adversarial decoding approach. Through empirical study on three existing human-written datasets, we found that starting with an existing content moderation tool and fine-tuning it on ToxiGen can improve the tool’s performance significantly, demonstrating the quality of the machine-generated data in ToxiGen.  

Human evaluation: Better understanding the data

Human language is complex, particularly when it comes to harmful statements. To better understand different aspects of the data in ToxiGen—its perceived harmfulness and intent and whether it presents as fact or opinion, for example—we conducted human evaluations on the data generated by both regular decoding (top-k), used in the demonstration-based prompting, and the proposed adversarial decoding. The human evaluation also allowed us to test the quality of the output of these methods and gauge how effective these methods were in guiding the generation of the data we sought. 

For the human evaluation, three annotators were used for each statement from a pool of 156 prequalified annotators with prior experience annotating toxic language. About 4,500 samples were randomly selected for each of the decoding methods with coverage across all 13 minority identity groups for each split. We found the following: 

  1. For both decoding methods, minority identity group mentions included in the prompt also exist in the generated statements. This means that both data generation methods reliably produce the data they were designed to produce—hateful and neutral statements with explicit reference to the specified minority identity group.
  2. In the neutral case, the label of the prompt matches the generated text more often than in the toxic case, as shown in Figure 3a. 
  3. The proposed decoding approach generates a higher percentage of adversarial text compared to regular decoding—that is, it produces data that is more likely to fool a given content moderation tool—as illustrated in Figure 3b. 
Two bar charts side by side. The one on the left, titled “Prompt-Response Matching,” shows that top-k decoding produces non-toxic responses 95.2 percent of the time when given a non-toxic prompt compared with 92.1 percent for (De)ToxiGen and that top-k decoding produces toxic responses 67.7 percent of the time when given a toxic prompt compared with 40.3 percent for (De)ToxiGen. The bar chart on the right, titled “Adversarial Power,” shows that statements generated by (De)ToxiGen fool HateBERT 26.4 percent of the time compared with 16.8 percent for statements generated via top-k decoding.
Figure 3a (left) and 3b (right): Human evaluations on the data generated by regular decoding (top-k) and the proposed adversarial decoding showed that the toxicity labels for the prompt and the generated response match more often for non-toxic prompts compared to toxic ones (left). It was also observed that (De)ToxiGen generates a higher percentage of adversarial text compared to regular decoding (right). 
  1. 90.5 percent of machine-generated examples were thought to be human-written by the majority of annotators.
  2. Perceived harmfulness with respect to human- or AI-authored text is similar. 

Looking ahead: Societal implications and opportunities

As advances continue to be made in large language models, we remain vigilant in our pursuit of AI systems that align with our commitment to technology that benefits society as a whole and empowers everyone to achieve more. We’re beginning to ask better questions to more deeply understand the risks associated with LLMs and build processes and methods for addressing them. Existing content moderation tools tend to be only good at flagging overt inappropriate or harmful language. Our work aims to create data that can better target the challenge. While our work here specifically explores hate speech, our proposed methods could be applied to a variety of content moderation challenges, such as flagging potential misinformation content. By releasing the source codes and prompt seeds for this work, we hope to encourage the research community to contribute to it by, for example, adding prompt seeds and generating data for minority identity groups that aren’t covered in our dataset. 

As with many technologies, the solutions we develop to make them stronger, more secure, and less vulnerable also have the potential to be used in unintended ways. While the methods described here may be used to generate inappropriate or harmful language, we believe that they provide far greater value in helping to combat such language, resulting in content moderation tools that can be used alongside human guidance to support fairer, safer, more reliable, and more inclusive AI systems.  

Considerations for responsible use

There is still a lot that this dataset is not capturing about what constitutes problematic language, and before utilizing the dataset, its limitations should be acknowledged. Our annotations might not capture the full complexity of these issues, given problematic language is context-dependent, dynamic, and can manifest in different forms and different severities. Content moderation tools aren’t a silver bullet to address harmful online content. Problematic language is fundamentally a human-centric problem. It should be studied in conjunction with human experience, and tools to address this problem should be developed and deployed with human expertise and well-informed regulatory processes and policy. Multidisciplinary work is needed to better understand the aspects of this challenge.  

Also, this dataset only captures implicit toxicity (more precisely hate speech) for 13 minority identity groups and due to its large scale can naturally have imperfections. Our goal in this project is to provide the community with means to improve hate speech detection on implicit toxic language for the identified minority identity groups, and there exist limitations to this dataset and models trained on it that can potentially be the subject of future research, for example, including more minority identity groups, a combination of them, and so on that are not covered in our work. Stronger content moderation tools and systems can contribute to mitigating fairness-related harms in AI systems. For example, systems that don’t over-flag neutral statements with minority identity group mentions can help ensure better representation of diverse perspectives and experiences, while systems that can better flag implicit hate speech can support more inclusive technology.   

Acknowledgment 

This work was conducted by PhD students Thomas Hartvigsen and Saadia Gabriel during their internships at Microsoft Azure and Microsoft Research. Hamid Palangi, Dipankar Ray, Maarten Sap, and Ece Kamar served as advisors on the work. A special thanks to Misha Bilenko from Azure ML for making the compute resources available and to Microsoft Research for supporting our large-scale human study. 

Platform models—large-scale models trained on vast amounts of data—are making it easier and faster to develop AI systems. (De)ToxiGen and other tools and resources like it are being developed by researchers at Microsoft to help developers get the most out of these platform models while also understanding, measuring, and mitigating the risks they pose.

Please note: This research, the GitHub repository, and examples from our work included in this blog contain and discuss content that is offensive or upsetting. All materials are intended to support research that improves hate speech detection methods. Included examples of hate speech don’t represent how the authors or sponsors feel about any minority identity groups. Hate speech applies to a range of minority identity groups; for the purposes of this research, we focus on 13 of them (as shown in Figure 1). Content moderation tools are part of larger content moderation systems. These systems also include human expertise and thoughtful policy and regulatory development. Even the most robust content moderation tools and datasets require systems with human supervision. 

The post (De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools appeared first on Microsoft Research.

Read More

FLUTE: A scalable federated learning simulation platform

This diagram shows a payload exchange between a server, inside Worker 0, and clients that live inside Workers 2 and 3. First, the server pushes the central ML model plus the clients’ data to Workers 2 and 3. Then, each client trains the model with their local data. Finally, the clients send the pseudo-gradients of this new model back to the server for aggregation and the creation of a new global model.

Federated learning has become a major area of machine learning (ML) research in recent years due to its versatility in training complex models over massive amounts of data without the need to share that data with a centralized entity. However, despite this flexibility and the amount of research already conducted, it’s difficult to implement due to its many moving parts—a significant deviation from traditional ML pipelines.

The challenges in working with federated learning result from the diversity of local data and end-node hardware, privacy concerns, and optimization constraints. These are compounded by the sheer volume of federated learning clients and their data and necessitates a wide skill set, significant interdisciplinary research efforts, and major engineering resources to manage. In addition, federated learning applications often need to scale the learning process to millions of clients to simulate a real-world environment. All of these challenges underscore the need for a simulation platform, one that enables researchers and developers to perform proof-of-concept implementations and validate performance before building and deploying their ML models. 

A versatile framework for federated learning

Today, the Privacy in AI team at Microsoft Research is thrilled to introduce Federated Learning Utilities and Tools for Experimentation (FLUTE) as a framework for running large-scale offline federated learning simulations, which we discuss in detail in the paper, “FLUTE: A Scalable, Extensible Framework for High-Performance Federated Learning Simulations.” In creating FLUTE, our goal was to develop a high-performance simulation platform that enables quick prototyping of federated learning research and makes it easier to implement federated learning applications.

There has been a lot of research in the last few years directed at tackling the many challenges in working with federated learning, including setting up learning environments, providing privacy guarantees, implementing model-client updates, and lowering communication costs. FLUTE addresses many of these while providing enhanced customization and enabling new research on a realistic scale. It also allows developers and researchers to test and experiment with certain scenarios, such as data privacy, communication strategies, and scalability, before implementing their ML model in a production framework.

One of FLUTE’s main benefits is its native integration with Azure ML workspaces, leveraging the platform’s features to manage and track experiments, parameter sweeps, and model snapshots. Its distributed nature is based on Python and PyTorch, and the flexibly designed client-server architecture helps researchers and developers quickly prototype novel approaches to federated learning. However, FLUTE’s key innovation and technological differentiator is the ease it provides in implementing new scenarios for experimentation in core areas of active research in a robust high-performance simulator. 

FLUTE offers a platform where all clients are implemented as isolated object instances, as shown in Figure 1. The interface between the server and the remaining workers relies on messages that contain client IDs and training information, with MPI as the main communication protocol. Local data on each client stays within local storage boundaries and is never aggregated with other local sources. Clients only communicate gradients to the central server.

This diagram shows server-client communication under FLUTE’s architecture. Worker 0 that acts as the server and contains the global model, client training data, the configuration, and the optimizer. Worker i receives a copy of the global model plus the task configuration. It also contains clients that are composed of the trainer and the optimizer. Each client sends the payload back to Worker 0.
Figure 1: FLUTE’s client-server architecture and workflow. First, the server pushes the initial global model to the clients and sends training information. Then, the clients train their instances of the global model with locally available data. Finally, all clients return the information to the server to aggregate the pseudo-gradients and produce a new global model that will be updated to the clients. This three-step process repeats for all rounds of training.

The following features contribute to FLUTE’s versatile framework and enable experimentation with new federated learning approaches: 

  • Scalability: Scale is a critical factor in understanding practical metrics, such as convergence and privacy-utility tradeoffs. Researchers and developers can run large-scale experiments using tens of thousands of clients with a reasonable turnaround time. 
  • Flexibility: FLUTE supports diverse federated learning configurations, including standardized implementations such as DGA and FedAvg.
  • Versatility: FLUTE’s generic API helps researchers and developers easily implement new models, datasets, metrics, and experimentation features, while its open architecture helps them add new algorithms in such areas as optimization, privacy, and robustness.

Available as an open-source platform

As part of this announcement, we’re making FLUTE available as a versatile open-source platform for rapid prototyping and experimentation. It comes with a set of basic tools to help kickstart experiments. We hope researchers and developers take advantage of this framework by exploring new approaches to federated learning.

Looking ahead

FLUTE’s innovative framework offers a new paradigm for implementing federated learning algorithms at scale, and this is just the beginning. We’re making improvements with the view toward making FLUTE the standard federated learning simulation platform. Future releases will include algorithmic enhancements in optimization and support for additional communication protocols. We’re also adding features to make it easier to set up experiments when including tailored features in new tasks and the ability to easily incorporate FLUTE as a library into Azure ML pipelines.

Additional resources 

Check out this video for a deep dive into FLUTE architecture and a tutorial on how to use it. Our documentation also explains how to implement FLUTE.  

You can learn more about the FLUTE project by visiting our project page, and discover more about our current federated learning research as well as other projects related to privacy in AI on our group page

Explore More

  • Download

    FLUTE


    FLUTE (Federated Learning Utilities for Testing and Experimentation) is a platform for conducting high-performance federated learning simulations.

The post FLUTE: A scalable federated learning simulation platform appeared first on Microsoft Research.

Read More

Azure Quantum innovation: Efficient error correction of topological qubits with Floquet codes

Qubits arranged in a square array on a two-dimensional surface. Measurements are done on the qubits in a sequence of checks, shown as a repeating pattern of three steps. In each step, one measures a check on each pair of neighboring qubits, shown as a line connecting those qubits, with the lines moving in a repeating pattern over the three steps.
This graphic shows the repeating three-step sequence of checks used in Floquet codes. Each circle represents a qubit, and a line between a pair of circles indicates that that check is measured on that time step. The colors indicate the type of operator measured in each check, either XX, YY, or ZZ, so that the type of check measured also changes with time. Learn more about this sequence of checks in the section “Unlocking a new class of quantum codes” below. 

Technological innovation that enables scaling of quantum computing underpins the Microsoft Azure Quantum program. In March of this year, we announced our demonstration of the underlying physics required to create a topological qubit—qubits that are theorized to be inherently more stable than existing ones without sacrificing size or speed. However, our quest to deliver a general-purpose quantum computer capable of addressing industrial-scale problems will require innovation across every layer of the quantum stack, from materials at the nanoscale to algorithms and applications. At Azure Quantum, our full-stack approach and broad expertise across all areas of quantum computation allows us to drive innovation in this space through tight collaboration across theory, hardware, software and systems teams. 

One of the greatest challenges in building a quantum computer is that quantum states are intrinsically fragile and are quickly destroyed when a qubit couples to its environment, leading to noise. A crucial technology to overcome this fragility, which is also used in classical digital computing, is error correction. By encoding the state of a single logical qubit into many physical qubits, quantum error correction (QEC) has the ability to detect and correct most errors that occur on the physical qubits. Indeed, such error correction needs to be at the heart of any scalable quantum system. Without it, no known qubit technology can protect quantum states sufficiently long enough to perform a calculation that can deliver real-world impact. However, quantum error correction also comes at a significant cost: depending on the quality of the physical qubits, error correction can increase the space requirements of a computation by a factor of several thousand and the time requirements more than tenfold. Therefore, any improvements on error correction have enormous positive ripple effects across the entire stack.

In this post, we’ll share some exciting implications from our recent innovations toward scale—specifically how to perform quantum error correction in our topological quantum computation stack— published in the series of papers listed below. Topological qubits promise lower error rates than conventional qubits, and as such can perform scalable quantum computation at lower overhead. On top of that, in these papers we introduce a new class of quantum error correction codes, called Floquet codes, which are particularly suited to topological qubits. Our new approaches culminate in an additional tenfold or more reduction to the overhead needed for error correction on topological qubits compared to previous state of the art, opening a viable path toward scaling to a million qubits and beyond. 

Unlocking a new class of quantum codes 

To optimize performance on any quantum computing platform, the circuits must be adapted to the capabilities of the hardware. This is particularly true for error correction schemes, which must be tailor-made to exploit the strengths of a given hardware platform. Unlike most other qubits, our topological qubits employ a measurement-based scheme, where direct measurements between adjacent qubits are the native set of operations. While all quantum error correction schemes use frequent measurements to identify errors, the state-of-the-art schemes require complex multi-qubit measurements that can’t be implemented directly in the hardware and must be compiled into native operations at the expense of additional auxiliary qubits and additional timesteps. The outcomes of these measurements are used to infer the occurrence of errors without destroying the encoded quantum state. 

Our recent breakthroughs overcome this issue through a conceptually new perspective on quantum codes (put forward in “Dynamically Generated Logical Qubits” and “Boundaries for the Honeycomb code”), where the encoding of the quantum information is not static but rather allowed to periodically evolve in time. Many examples of physical systems are known where such periodic evolution allows new phenomena to occur (see, for example, the well-known Kapitza pendulum). The study of such systems falls under the term Floquet systems, which gives this new class of codes its name. 

These codes are built entirely from two-qubit measurements referred to as “check measurements.” Just like measurements in a conventional code, these are used to check for errors. The simplicity of these checks, however, means that each time we measure a check, we change the encoding of the quantum information, leading to the Floquet nature of the code. As a consequence, the outcomes of these measurements cannot be used directly to infer which errors have occurred, but rather the full history of measurement outcomes over time must be taken into account. 

The physical qubits are arranged in a lattice (such as that shown in Figure 1), represented as black dots on the vertices of this graph. Each check is associated with an edge of the graph, and one sequentially measures checks of different colors. The code state changes as the different checks are measured. There are several possible lattice arrangements of the qubits that allow for a natural implementation of a Floquet code. The lattices should have the following two properties: 1) each vertex should be attached to three edges and 2) using only three colors, it should be possible to color the plaquettes in such a way that no adjacent plaquettes have the same color (that is, the plaquettes should be “three-colorable”). While many such arrangements remain to be explored and the optimal choice will depend on details of the physical hardware, Figure 1 shows two possible Floquet-code arrangements. 

Two different ways of tiling a surface.  In the 4.8.8 code configuration on the left, the surface is tiled with octagons and squares, and in the honeycomb code configuration it is tiled with hexagons.  Each shows a possible arrangement of qubits in a Floquet code, with the qubits at the vertices of the tiling. The tiling displays some more complicated features at the boundary, but in the middle it is a regular tiling.
Figure 1: Lattice of qubits used for two different Floquet codes, the 4.8.8 code (left) and the honeycomb code (right). The optimal choice of code depends on the level of noise present and on correlations in the noise. 

Error correction tailor-made for topological qubits 

In the realm of our measurement-based topological architecture, we have identified the two arrangements shown in Figure 1 as particularly appealing when combined with a particular design of topological qubit—a “tetron” qubit—which is also a scalable design. The connectivity of these two layouts can be naturally mapped onto the connectivity of an array of such tetrons, which is shown in Figure 2. Furthermore, the majority of the two-qubit check operators that are used to construct these codes are exactly those native operations between tetrons that can be implemented with minimal error, as shown in the lower panel of Figure 2. The details of these codes, their implementation with topological qubits, and numerical studies of their performance are discussed in “Performance of planar Floquet codes with Majorana-based qubits.”

Top panel: an array of qubits.  Each qubit is shown as a sideways “H,” with the long edges of the “H” being topological wires supporting Majorana modes, giving four Majorana modes on each qubit at the points of the “H.” The bottom panel shows different loops connecting different qubits to measure checks of the code.
Figure 2: Upper panel: Physical array of tetron qubits that can be used to implement either the honeycomb or 4.8.8 Floquet code. Lower panel: Mapping of measurement operations into physical interference loops that are used for two-qubit measurements. 

Our numerical simulations show that our Floquet codes and architecture implemented with topological “tetron” qubits help secure the path to a scalable quantum system in several ways. First, the very favorable threshold of these codes, which we estimate to be close to 1 percent, allows us to achieve quantum error correction earlier and demonstrate tangible steps on our journey toward quantum advantage. Second, in the longer run, we find that these codes reduce the overhead required for quantum error correction on topological qubits roughly tenfold compared to the previous state-of-the-art approach, which means that our scalable system can be built from fewer physical qubits and can run at a faster clock speed (see Figure 3 below).

A plot of the overhead due to error correction as a function of the performance of the physical qubits.  As the physical qubits are improved (lower noise, on the left side of the plot), the overhead is reduced. The plot shows that the Floquet codes outperform other codes by an order of magnitude.
Figure 3: Comparison of the spacetime overhead between the previous state-of-the-art (blue, dashed line) and the newly developed Floquet codes (black, solid line), both for an implementation on topological qubits. See Figure 8 in “Performance of planar Floquet codes with Majorana-based qubits” for more details. 

Approaching quantum computation from the unique topological perspective requires synchronized advancements across the entire Azure Quantum stack. Along with our recent demonstration of the building blocks for topological qubits, optimizing quantum error correction using Floquet codes represents a critical piece of the scientific foundation needed to achieve scaled quantum computation. These breakthroughs help establish a path and architecture for the industrial quantum machine.

The post Azure Quantum innovation: Efficient error correction of topological qubits with Floquet codes appeared first on Microsoft Research.

Read More