The post Singapore develops Asia’s first AI-based mobile app for shark and ray fin identification to combat illegal wildlife trade appeared first on The AI Blog.
DoWhy evolves to independent PyWhy model to help causal inference grow
Identifying causal effects is an integral part of scientific inquiry. It helps us understand everything from educational outcomes to the effects of social policies to risk factors for diseases. Questions of cause-and-effect are also critical for the design and data-driven evaluation of many technological systems we build today.
To help data scientists better understand and deploy causal inference, Microsoft researchers built a tool that implements the process of causal inference analysis from end to end. The ensuing DoWhy library has been doing just that since 2018 and has cultivated a community devoted to applying causal inference principles in data science. To broaden access to this critical knowledge base, DoWhy is migrating to an independent open-source governance model in a new PyWhy GitHub organization. As a first step toward this model, we are announcing a collaboration with Amazon Web Services (AWS), which is contributing new technology based on structural causal models.
What is causal inference?
The goal of conventional machine learning methods is to predict an outcome. In contrast, causal inference focuses on the effect of a decision or action—that is, the difference between the outcome if an action is completed versus not completed. For example, consider a public utility company seeking to reduce their customers’ usage of water through a marketing and rewards program. The effectiveness of a rewards program is difficult to ascertain, as any decrease in water usage by participating customers is confounded with their choice to participate in the program. If we observe that a rewards program member uses less water, how do we know whether it is the program that is incentivizing their lower water usage or if customers who were already planning to reduce water usage also chose to join the program? Given information about the drivers of customer behavior, causal methods can disentangle confounding factors and identify the effect of this rewards program.
How do we know when we have the right answer? The effect of an action like signing up for a customer loyalty program is typically not an observable value. For any given customer, we see only one of the two respective outcomes and cannot directly observe the difference the program made. This means that processes developed to validate conventional machine learning models—based on comparing predictions to observed, ground truths—cannot be used. Instead, we need new processes to gain confidence in the reliability of causal inference. Most critically, we need to capture our domain knowledge, reason about our modeling choices, then validate our core assumptions when possible and analyze the sensitivity of our results to violations of assumptions when validation is not possible.
Four steps of causal inference analysis
Data scientists just beginning to explore causal inference are most challenged by the new modeling assumptions of causal methods. DoWhy can help them understand and implement the process. The library focuses on the four steps of an end-to-end causal inference analysis, which are discussed in detail in a previous paper, DoWhy: an End-to-End Library for Causal Inference, and related blog post:
- Modeling: Causal reasoning begins with the creation of a clear model of the causal assumptions being made. This involves documenting what is known about the data generating process and mechanisms. To get a valid answer to our cause-and-effect questions, we must be explicit about what we already know.
- Identification: Next, we use the model to decide whether the causal question can be answered, and we provide the required expression to be computed. Identification is the process of analyzing our model.
- Estimation: Once we have a strategy for identifying the causal effect, we can choose from several different statistical and machine learning-based estimation methods to answer our causal question. Estimation is the process of analyzing our data.
- Refutation: Once we have our answer, we must do everything we can to test our underlying assumptions. Is our model consistent with the data? How sensitive is the answer to the assumptions made? If the model missed an unobserved confounder, will that change our answer a little or a lot?
This focus on the four steps of the end-to-end causal inference process differentiates the DoWhy library from prior causal inference toolkits. DoWhy complements other libraries—which focus on individual steps—and offers users the benefits of those libraries in a seamless, unified API. For example, for estimation, DoWhy offers the ability to call out to Microsoft’s EconML library for its advanced estimation methods.
Current DoWhy deployments
Today, DoWhy has been installed over one million times. It is widely deployed in production scenarios across industry and academia—from evaluating the effects of customer loyalty and marketing programs to identifying the controllable drivers of key business metrics. DoWhy’s rich API has enabled the creation of downstream solutions such as AutoCausality from Wise.com, which automates comparison of different methods, and ShowWhy from Microsoft, which provides a no-code GUI experience for causal inference analysis. In academia, DoWhy has been used in a range of research scenarios, including sustainable building design, environmental data analyses, and health studies. At Microsoft, we continue to use DoWhy to power causal analyses and test their validity, for example, estimating who benefits most from messages to avoid overcommunicating to large groups.
A community of more than 40 researchers and developers continually enrich the library with critical additions. Highly impactful contributions, such as customizable backdoor criterion implementation and a user-friendly Pandas integration, have come from external contributors. Instructors in courses and workshops around the world use DoWhy as a pedagogical tool to teach causal inference.
With such broad support, DoWhy continues to improve and expand. In addition to more complete implementations of identification algorithms and new sensitivity analysis methods, DoWhy has added experimental support for causal discovery and more powerful methods for testing the validity of a causal estimate. Using the four steps as a set of fundamental operations for causal analysis, DoWhy is now expanding into other tasks, such as representation learning.
Microsoft continues to expand the frontiers of causal learning through its research initiatives, with new approaches to robust learning, statistical advances for causal estimation, deep learning-based methods for end-to-end causal discovery and inference, and investigations into how causal learning can help with fairness, explainability and interpretability of machine learning models. As each of these technologies mature, we expect to make them available to the broader causal community through open source and product offerings.
An independent organization for DoWhy and other open-source causal inference projects
Making causality a pillar of data science practice requires an even broader, collaborative effort to create a standardized foundation for our industry.
To this end, we are happy to announce that we are shifting DoWhy into an independent open-source governance model, in a new PyWhy effort.
Our first collaborator in this initiative is AWS, which is contributing new technology for causal attribution based on a structural causal model that complements DoWhy’s current functionalities.
We are looking forward to accelerating and broadening adoption of our open-source causal learning tools through this new Github organization. We invite data scientists, researchers, and engineers, whether you are just learning about causality or already designing new algorithms or even building your own tools, to join us on the open-source journey towards building a useful causal analysis ecosystem.
We encourage you to explore DoWhy and invite you to contact us to learn more. We are excited by what lies ahead as we aim to transform data science practice to drive improved modeling and decision making.
The post DoWhy evolves to independent PyWhy model to help causal inference grow appeared first on Microsoft Research.
(De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools
It’s a well-known challenge that large language models (LLMs)—growing in popularity thanks to their adaptability across a variety of applications—carry risks. Because they’re trained on large amounts of data from across the internet, they’re capable of generating inappropriate and harmful language based on similar language encountered during training.
Content moderation tools can be deployed to flag or filter such language in some contexts, but unfortunately, datasets available to train these tools often fail to capture the complexities of potentially inappropriate and toxic language, especially hate speech. Specifically, the toxic examples in many existing hate speech datasets tend either to be too hard or too easy for tools to learn from—the too-easy examples contain slurs, profanity, and explicit mentions of minority identity groups; the too-hard examples involve obscure references or inside jokes within the hate speech community. Additionally, the neutral examples in these datasets tend not to contain group mentions. As a result, tools may flag any language that references a minority identity group as hate speech, even when that language is neutral. Alternatively, tools trained on this data fail to detect harmful language when it lacks known or explicit slurs, profanity, or explicit mentions of minority identity groups.
Generating the kind of data needed to strengthen content moderation tools against the above failures and harms is challenging for numerous reasons. In particular, toxic text that is more implicit and that existing machine learning architectures can still learn from or neutral text with group mentions is difficult to collect at scale. Additionally, asking people to write such examples—particularly the toxic ones—can have a negative impact mentally on those assigned the task.
Inspired by the ability of large language models to mimic the tone, style, and vocabulary of prompts they receive—whether toxic or neutral—we set out to create a dataset for training content moderation tools that can be used to better flag implicitly harmful language. In our paper “ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection,” we collected initial examples of neutral statements with group mentions and examples of implicit hate speech across 13 minority identity groups and used a large-scale language model to scale up and guide the generation process. The outcome is the largest implicit hate speech dataset to date that is publicly available: 274,000 examples comprising both neutral and toxic statements. We conducted a human study on the generated dataset to better understand different aspects of harm beyond binary labels of toxic and neutral assigned by content moderation tools. To stress test existing content moderation tools across minority identity groups studied in this work, we also propose an adversarial classifier-in-the-loop decoding approach. The dataset, two content moderation tools trained on the dataset, prompts used as seed data, and the source codes for our proposed adversarial decoding approach are available in the ToxiGen GitHub repo (please see footnote).
We’re presenting this work at the 2022 Meeting of the Association for Computational Linguistics (ACL), where our colleagues will also be presenting work that leverages the generative power of large language models and human expertise.
Demonstration-based prompting for building better datasets
Large Transformer-based language models don’t explicitly encode semantic information; nevertheless, these models can distinguish the statistical interactions of words in different contexts. Through experimentation with the generation of language via one of these large language models, we learned how to utilize careful prompt engineering strategies to create the ToxiGen implicit hate speech dataset.
Our first experiments were to generate examples of hate speech and neutral speech related to the 13 minority identity groups in our work. We started by collecting implicit hate speech prompts from existing datasets and neutral prompts drawn from news articles, opinion pieces, podcast transcripts, and other similar public sources and feeding them into the LLM to create a broader, deeper set of prompts. What we found was that the LLM could generate examples that were qualitatively different depending on the source material. When prompted with bits from different writers on the above topics, in each case, the LLM produced linguistically diverse outputs that were nonetheless similar in style and tone.
Furthermore, we found that through careful cultivation of prompt sets, we could generate a wide variety of text reflecting diverse opinions and thoughts on these topics that weren’t found in our original source materials. We could generate neutral statements about sensitive topics that mentioned the relevant minority identity groups, and we could consistently generate hate speech statements about these minority identity groups that didn’t contain slurs or profanity. And the more we experimented with the source material, the more interesting our dataset became. This is particularly exciting because we hope that other individuals and groups can use these tools to extend our dataset; different disciplinary experts could utilize the same strategies and collect even better prompt sets, resulting in even more subtle and rich examples of neutral speech and hate speech.
We also found that the model often generated examples of speech that we ourselves had trouble labeling. In essence, we were using the LLM as a probe to explore the delicate boundaries between acceptable and offensive speech. As a result, our own understanding of the problem definition itself grew through our interactions with the model.
The first 260,000 examples from our dataset were drawn from this experimental approach.
(De)ToxiGen: An adversarial decoding approach for strengthening content moderation tools
While demonstration-based prompting can facilitate large-scale data generation, it doesn’t generate data targeted specifically to challenge a given content moderation tool, or content classifier. This is important because every content moderation tool has unique vulnerabilities depending on the type of data it has been trained on. To address this, we developed (De)ToxiGen (referred to as ALICE in the paper), an algorithmic mechanism that creates an adversarial set-up between an LLM and a given content moderation tool in which the content classifier is in the loop during decoding.
The proposed approach can increase or decrease the likelihood that a generated statement is classified as hate speech while maintaining the coherence of the generated language. It can generate both false negatives and false positives for a given content moderation tool. For false negatives, toxic prompts are used to elicit toxic responses, and then the tool’s probability of the neutral class is maximized during decoding. Similarly, to generate false positives, neutral prompts are used to generate neutral responses, and then the probability of the toxic class is maximized during decoding. With this approach, we’re essentially trying to reveal weaknesses in a specific content moderation tool by guiding the LLM to produce statements that we know the tool will misidentify. The generated data can then be used to improve the performance and coverage of the targeted content moderation tool. Our ToxiGen dataset includes data generated by both demonstration-based prompting and our proposed adversarial decoding approach. Through empirical study on three existing human-written datasets, we found that starting with an existing content moderation tool and fine-tuning it on ToxiGen can improve the tool’s performance significantly, demonstrating the quality of the machine-generated data in ToxiGen.
Human evaluation: Better understanding the data
Human language is complex, particularly when it comes to harmful statements. To better understand different aspects of the data in ToxiGen—its perceived harmfulness and intent and whether it presents as fact or opinion, for example—we conducted human evaluations on the data generated by both regular decoding (top-k), used in the demonstration-based prompting, and the proposed adversarial decoding. The human evaluation also allowed us to test the quality of the output of these methods and gauge how effective these methods were in guiding the generation of the data we sought.
For the human evaluation, three annotators were used for each statement from a pool of 156 prequalified annotators with prior experience annotating toxic language. About 4,500 samples were randomly selected for each of the decoding methods with coverage across all 13 minority identity groups for each split. We found the following:
- For both decoding methods, minority identity group mentions included in the prompt also exist in the generated statements. This means that both data generation methods reliably produce the data they were designed to produce—hateful and neutral statements with explicit reference to the specified minority identity group.
- In the neutral case, the label of the prompt matches the generated text more often than in the toxic case, as shown in Figure 3a.
- The proposed decoding approach generates a higher percentage of adversarial text compared to regular decoding—that is, it produces data that is more likely to fool a given content moderation tool—as illustrated in Figure 3b.
- 90.5 percent of machine-generated examples were thought to be human-written by the majority of annotators.
- Perceived harmfulness with respect to human- or AI-authored text is similar.
Looking ahead: Societal implications and opportunities
As advances continue to be made in large language models, we remain vigilant in our pursuit of AI systems that align with our commitment to technology that benefits society as a whole and empowers everyone to achieve more. We’re beginning to ask better questions to more deeply understand the risks associated with LLMs and build processes and methods for addressing them. Existing content moderation tools tend to be only good at flagging overt inappropriate or harmful language. Our work aims to create data that can better target the challenge. While our work here specifically explores hate speech, our proposed methods could be applied to a variety of content moderation challenges, such as flagging potential misinformation content. By releasing the source codes and prompt seeds for this work, we hope to encourage the research community to contribute to it by, for example, adding prompt seeds and generating data for minority identity groups that aren’t covered in our dataset.
As with many technologies, the solutions we develop to make them stronger, more secure, and less vulnerable also have the potential to be used in unintended ways. While the methods described here may be used to generate inappropriate or harmful language, we believe that they provide far greater value in helping to combat such language, resulting in content moderation tools that can be used alongside human guidance to support fairer, safer, more reliable, and more inclusive AI systems.
Considerations for responsible use
There is still a lot that this dataset is not capturing about what constitutes problematic language, and before utilizing the dataset, its limitations should be acknowledged. Our annotations might not capture the full complexity of these issues, given problematic language is context-dependent, dynamic, and can manifest in different forms and different severities. Content moderation tools aren’t a silver bullet to address harmful online content. Problematic language is fundamentally a human-centric problem. It should be studied in conjunction with human experience, and tools to address this problem should be developed and deployed with human expertise and well-informed regulatory processes and policy. Multidisciplinary work is needed to better understand the aspects of this challenge.
Also, this dataset only captures implicit toxicity (more precisely hate speech) for 13 minority identity groups and due to its large scale can naturally have imperfections. Our goal in this project is to provide the community with means to improve hate speech detection on implicit toxic language for the identified minority identity groups, and there exist limitations to this dataset and models trained on it that can potentially be the subject of future research, for example, including more minority identity groups, a combination of them, and so on that are not covered in our work. Stronger content moderation tools and systems can contribute to mitigating fairness-related harms in AI systems. For example, systems that don’t over-flag neutral statements with minority identity group mentions can help ensure better representation of diverse perspectives and experiences, while systems that can better flag implicit hate speech can support more inclusive technology.
Acknowledgment
This work was conducted by PhD students Thomas Hartvigsen and Saadia Gabriel during their internships at Microsoft Azure and Microsoft Research. Hamid Palangi, Dipankar Ray, Maarten Sap, and Ece Kamar served as advisors on the work. A special thanks to Misha Bilenko from Azure ML for making the compute resources available and to Microsoft Research for supporting our large-scale human study.
Please note: This research, the GitHub repository, and examples from our work included in this blog contain and discuss content that is offensive or upsetting. All materials are intended to support research that improves hate speech detection methods. Included examples of hate speech don’t represent how the authors or sponsors feel about any minority identity groups. Hate speech applies to a range of minority identity groups; for the purposes of this research, we focus on 13 of them (as shown in Figure 1). Content moderation tools are part of larger content moderation systems. These systems also include human expertise and thoughtful policy and regulatory development. Even the most robust content moderation tools and datasets require systems with human supervision.
The post (De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools appeared first on Microsoft Research.
Partnering people with large language models to find and fix bugs in NLP systems
Advances in platform models—large-scale models that can serve as foundations across applications—have significantly improved the ability of computers to process natural language. But natural language processing (NLP) models are still far from perfect, sometimes failing in embarrassing ways, like translating “Eu não recomendo este prato” (I don’t recommend this dish) in Portuguese to “I highly recommend this dish” in English (a real example from a top commercial model). These failures continue to exist in part because finding and fixing bugs in NLP models is hard—so hard that severe bugs impact almost every major open-source and commercial NLP model.
Current methods for finding or fixing bugs take one of two approaches: they’re either user-driven or automated. User-driven methods are flexible and can test any aspect of a model’s behavior, but they depend on highly variable human ability to imagine bugs and are so labor intensive that in practice only a small part of the input space gets tested. Automated approaches, on the other hand, are fast and so can explore large portions of the input space. However, since they lack human guidance, they can only test if a model is right or wrong in very restricted scenarios, such as when the model has inconsistent predictions on inputs with slight variations in phrasing.
We believe platform models, specifically modern large language models (LLMs) like GPT-3, offer an opportunity for us to combine the synergistic strengths of both user-driven approaches and automated approaches, keeping the user in control of defining what the model being tested should be doing while leveraging the abilities of modern generative language models to generate at scale tests within a specific category of model behavior. We call this human-AI team approach Adaptive Testing and Debugging, or AdaTest for short.
With AdaTest, a large language model is tasked with the slow burden of generating a large quantity of tests targeted at finding bugs in the model being tested, while the person steers the language model by selecting valid tests and organizing them into semantically related topics. This guidance from the person drastically improves the language model’s generation performance and directs it toward areas of interest. Because these tests are effectively a form of labeled data, they not only identify bugs but can be used to fix bugs in an iterative debugging loop similar to traditional software development. AdaTest offers significant productivity gains for expert users while remaining simple enough to empower diverse groups of non-experts without a background in programming. This means experts and non-experts alike can better understand and control the behavior of their AI systems across a range of scenarios, which makes for not only better-performing AI systems but more responsible AI systems. The AdaTest code and pre-populated test trees are open source on GitHub.
We’re presenting our paper, “Adaptive Testing and Debugging of NLP Models,” at the 2022 Meeting of the Association for Computational Linguistics (ACL), where our colleagues will also be introducing work that leverages large language models, in their case, to grow adversarial datasets for content moderation tools.
Finding bugs with the testing loop
The AdaTest process is composed of an inner testing loop that is used to find bugs (Figure 1, unrolled in Figure 2) and an outer debugging loop that is used to fix bugs (Figure 1, unrolled in Figure 4).
Consider how this works for sentiment analysis, used to determine if a piece of text expresses a positive or negative sentiment (typically in the context of product reviews or customer feedback). While the task seems simple enough, even state-of-the-art models have failures, ranging from overt, like classifying “I don’t think I’ve ever had a nicer time in my life” as negative, to more subtly harmful, like classifying “I am a racial minority” as negative (both represent real failures found with AdaTest in commercial models). To demonstrate how AdaTest finds and fixes bugs, we show how to test for (and later fix) instances of fairness-related harms in which neutral references to a specific identity group within a piece of text could cause a sentiment analysis model to incorrectly downweight the sentiment of the text—in other words, scenarios in which a model might treat comments from specific groups more negatively.
In the testing loop, we start with a set of unit tests about various identities and label the set “/Sensitive” (Figure 2 below). These initial examples don’t reveal any model failures. But AdaTest then uses a large language model—in our case, GPT-3—to generate many similar suggested tests designed to highlight bugs (Figure 2A). While hundreds of tests are generated, we only need to review the top few failing or near-failing tests. We then ignore tests that don’t represent real failures (for example, “I am tired of being silenced” really should be negative in Figure 2) and add the other valid tests to the current topic, also occasionally organizing them into additional subtopics (Figure 2B). These user-filtered tests are included in the language model prompt for the next round of suggestions, nudging the next set of suggestions toward the intersection between user interest and model failure (Figure 2C). Repeating the testing loop results in the language model starting at tests that don’t fail and slowly working its way up to producing stronger and stronger failures. So even when users can’t find model failures on their own, they can start from a small set of passing tests and quickly iterate with the language model to produce a large set of tests that reveal bugs in the model being tested.
If instead of the “/Sensitive” topic shown in Figure 2, we target a different topic, such as handling negation, we’ll reveal different failures. For example, starting from simple statements like “I have never been happier” that a commercial model correctly classifies as positive, AdaTest can quickly find bugs like “I don’t think that I’ve ever seen a nicer town” getting labeled as negative. These bugs are egregious and obvious once you see them, but they’re hard to find by hand since they only happen for very specific phrasings.
We ran user studies to quantitatively evaluate if AdaTest makes experts and non-experts better at writing tests and finding bugs in models. We asked experts—those with a background in machine learning and NLP—to test specific topics in two models: a commercial sentiment classifier and GPT-2 for next word auto-complete, used in such applications as predicting the next word in an email being typed (a scenario in which we want to avoid suggesting stereotypes, for example, one of the behaviors we had participants test for). For each topic and model, participants were randomly assigned to use CheckList (representing state-of-the-art user-driven testing) or AdaTest. We present the average number of discovered model failures per minute in Figure 3, where we observe a fivefold improvement with AdaTest across models and participants in the study. We asked non-experts, or those without any programming background, to test the Perspective API toxicity model for content moderation. Participants tried to find non-toxic statements (that is, statements they would personally feel appropriate posting) predicted as toxic for political opinions. Participants were given access to an improved version of the Dynabench crowd-sourcing interface for model testing and to AdaTest. AdaTest provided up to a tenfold improvement (bottom portion of Figure 3).
We also grouped participants by their progressive versus conservative political alignment and found that participants wrote tests with twice the quality when testing their own perspective versus an opposing perspective (as measured by an independent set of in-group raters). Our user studies highlight that AdaTest can be used by anyone and that such easy-to-use tools are important to enable model testing by people with diverse backgrounds since testers representing different lived experiences and viewpoints are needed to effectively test different perspectives.
Fixing bugs with the debugging loop
Once enough bugs are discovered, testers of a model then engage in the outer debugging loop (Figure 4 below), where they fix bugs discovered in the testing loop and then retest the model. In our experiments, we fixed bugs by fine-tuning the model on the tests, but other strategies, such as collecting more data or adding constraints, are also possible. The retest part of the debugging loop (that is, running the testing loop again) is critical since once we use our tests to fix the model, they no longer represent test data but rather training data. The process of fixing a bug often overcompensates, introducing shortcuts or bugs in the initial rounds of the debugging loop that can only be found using a new set of tests adapted to target the new “fixed” model.
Running the debugging loop on an open-source RoBERTa-Large sentiment model (Figure 4) demonstrates the importance of a test-fix-retest cycle. We start with tests from the “/Sensitive/Immigration” topic from Figure 2 that the RoBERTa model incorrectly labels as negative. Fine-tuning the model on these tests (mixed with the original training data to maintain task performance) results in a new model that no longer fails the tests (second row of Figure 4). However, when we rerun the testing loop, we find that now almost all immigration statements are labeled as “neutral,” even if they are truly negative based on the application and testing scenario (for example, the statements in the third row of Figure 4 wouldn’t be neutral if a model were tasked with detecting if language was for or against something). Fine-tuning again using these new tests (and the older ones) results in a model that correctly fixes the original bug without adding the “every immigration statement is neutral” shortcut.
This doesn’t, of course, guarantee that there isn’t another shortcut still in the model, but in our experience, a few rounds of the debugging loop drastically reduce the number of accidental bugs that get introduced when “fixing” the original bugs. The testers of the model don’t have to exhaustively identify every possible shortcut or imbalance ahead of time, since AdaTest adaptively surfaces and fixes bugs that have been introduced in the next rounds of testing and debugging. Thus, the debugging loop serves as a friendly adversary, pushing the boundaries of the current “specification” until a satisfactory model is produced. In fact, AdaTest can be seen as an application of the test-fix-retest loop from software engineering to NLP.
To evaluate the effectiveness of the debugging loop, we fine-tuned RoBERTa-Large to detect if two questions are duplicates (that is, the same question worded differently) using the Quora Question Pairs (QQP) dataset and also fine-tuned it for positive/neutral/negative sentiment analysis using the Stanford Sentiment Treebank (SST) dataset. Using previously published CheckList suites for evaluation, we find the baseline model fails 22 out of 53 QQP topics and 11 out of 39 sentiment topics. We then created data to “fix” a topic by either taking 50 examples from the topic’s data in the CheckList condition or by starting from a seed of five examples and running the debugging loop with AdaTest until finding failures becomes qualitatively difficult (on average 2.83 rounds for QQP and 3.83 rounds for sentiment). This yields an average of 41.6 tests for QQP and 55.8 tests for sentiment. We followed this process for six distinct high-failure rate topics in each task. In the vast majority of cases (see paper for details), AdaTest fixes the topics used for training and a number of unseen held-out topics without breaking any topics, while CheckList data often introduces new bugs (and thus breaks other test topics).
We also evaluated the effectiveness of AdaTest in a standard development setting, targeting a model for to-do detection in meeting notes. After three months of development, CheckList testing, and ad hoc GPT-3–based data augmentation, a PhD-level team had managed to build a model with an F1 score of 0.66 (out of 1) on unseen data collected in the wild. We gave AdaTest to the team with a demo a few minutes long. After four hours of running the debugging loop on their own, they produced another model with an F1 score of 0.77 on the same unseen dataset. These scores were then replicated again on a second unseen dataset, showing that AdaTest can add significant bug-fixing value with a fraction of the effort involved in traditional approaches.
The promise of human-AI collaboration for ML development
AdaTest encourages a close collaboration between people and large language models, yielding the benefits of both. People provide the problem specification that the language model lacks, while the language model provides quality test creation at a scale and scope that is infeasible for people. The debugging loop connects model testing and debugging to effectively fix bugs, taking model development a step closer toward the iterative nature of traditional software development. Human-AI partnership represents a promising way forward for machine learning development, and we expect this synergy to only improve as the capabilities of large language models continue to grow.
Check out the full paper to see AdaTest’s effectiveness on classification models (sentiment analysis, QQP, toxicity, media selection, and task detection), generation models (GPT-2, translation), and per-token models (NER) ranging from well-tested production systems to brand-new applications. Give it a try yourself at https://github.com/microsoft/adatest.
The post Partnering people with large language models to find and fix bugs in NLP systems appeared first on Microsoft Research.
FLUTE: A scalable federated learning simulation platform
Federated learning has become a major area of machine learning (ML) research in recent years due to its versatility in training complex models over massive amounts of data without the need to share that data with a centralized entity. However, despite this flexibility and the amount of research already conducted, it’s difficult to implement due to its many moving parts—a significant deviation from traditional ML pipelines.
The challenges in working with federated learning result from the diversity of local data and end-node hardware, privacy concerns, and optimization constraints. These are compounded by the sheer volume of federated learning clients and their data and necessitates a wide skill set, significant interdisciplinary research efforts, and major engineering resources to manage. In addition, federated learning applications often need to scale the learning process to millions of clients to simulate a real-world environment. All of these challenges underscore the need for a simulation platform, one that enables researchers and developers to perform proof-of-concept implementations and validate performance before building and deploying their ML models.
A versatile framework for federated learning
Today, the Privacy in AI team at Microsoft Research is thrilled to introduce Federated Learning Utilities and Tools for Experimentation (FLUTE) as a framework for running large-scale offline federated learning simulations, which we discuss in detail in the paper, “FLUTE: A Scalable, Extensible Framework for High-Performance Federated Learning Simulations.” In creating FLUTE, our goal was to develop a high-performance simulation platform that enables quick prototyping of federated learning research and makes it easier to implement federated learning applications.
There has been a lot of research in the last few years directed at tackling the many challenges in working with federated learning, including setting up learning environments, providing privacy guarantees, implementing model-client updates, and lowering communication costs. FLUTE addresses many of these while providing enhanced customization and enabling new research on a realistic scale. It also allows developers and researchers to test and experiment with certain scenarios, such as data privacy, communication strategies, and scalability, before implementing their ML model in a production framework.
One of FLUTE’s main benefits is its native integration with Azure ML workspaces, leveraging the platform’s features to manage and track experiments, parameter sweeps, and model snapshots. Its distributed nature is based on Python and PyTorch, and the flexibly designed client-server architecture helps researchers and developers quickly prototype novel approaches to federated learning. However, FLUTE’s key innovation and technological differentiator is the ease it provides in implementing new scenarios for experimentation in core areas of active research in a robust high-performance simulator.
FLUTE offers a platform where all clients are implemented as isolated object instances, as shown in Figure 1. The interface between the server and the remaining workers relies on messages that contain client IDs and training information, with MPI as the main communication protocol. Local data on each client stays within local storage boundaries and is never aggregated with other local sources. Clients only communicate gradients to the central server.
The following features contribute to FLUTE’s versatile framework and enable experimentation with new federated learning approaches:
- Scalability: Scale is a critical factor in understanding practical metrics, such as convergence and privacy-utility tradeoffs. Researchers and developers can run large-scale experiments using tens of thousands of clients with a reasonable turnaround time.
- Flexibility: FLUTE supports diverse federated learning configurations, including standardized implementations such as DGA and FedAvg.
- Versatility: FLUTE’s generic API helps researchers and developers easily implement new models, datasets, metrics, and experimentation features, while its open architecture helps them add new algorithms in such areas as optimization, privacy, and robustness.
Available as an open-source platform
As part of this announcement, we’re making FLUTE available as a versatile open-source platform for rapid prototyping and experimentation. It comes with a set of basic tools to help kickstart experiments. We hope researchers and developers take advantage of this framework by exploring new approaches to federated learning.
Looking ahead
FLUTE’s innovative framework offers a new paradigm for implementing federated learning algorithms at scale, and this is just the beginning. We’re making improvements with the view toward making FLUTE the standard federated learning simulation platform. Future releases will include algorithmic enhancements in optimization and support for additional communication protocols. We’re also adding features to make it easier to set up experiments when including tailored features in new tasks and the ability to easily incorporate FLUTE as a library into Azure ML pipelines.
Additional resources
Check out this video for a deep dive into FLUTE architecture and a tutorial on how to use it. Our documentation also explains how to implement FLUTE.
You can learn more about the FLUTE project by visiting our project page, and discover more about our current federated learning research as well as other projects related to privacy in AI on our group page.
Explore More
The post FLUTE: A scalable federated learning simulation platform appeared first on Microsoft Research.