(De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools

An abstract image in pastel colors showing a vortex of vectors.

It’s a well-known challenge that large language models (LLMs)—growing in popularity thanks to their adaptability across a variety of applications—carry risks. Because they’re trained on large amounts of data from across the internet, they’re capable of generating inappropriate and harmful language based on similar language encountered during training.  

Content moderation tools can be deployed to flag or filter such language in some contexts, but unfortunately, datasets available to train these tools often fail to capture the complexities of potentially inappropriate and toxic language, especially hate speech. Specifically, the toxic examples in many existing hate speech datasets tend either to be too hard or too easy for tools to learn from—the too-easy examples contain slurs, profanity, and explicit mentions of minority identity groups; the too-hard examples involve obscure references or inside jokes within the hate speech community. Additionally, the neutral examples in these datasets tend not to contain group mentions. As a result, tools may flag any language that references a minority identity group as hate speech, even when that language is neutral. Alternatively, tools trained on this data fail to detect harmful language when it lacks known or explicit slurs, profanity, or explicit mentions of minority identity groups.  

Generating the kind of data needed to strengthen content moderation tools against the above failures and harms is challenging for numerous reasons. In particular, toxic text that is more implicit and that existing machine learning architectures can still learn from or neutral text with group mentions is difficult to collect at scale. Additionally, asking people to write such examples—particularly the toxic ones—can have a negative impact mentally on those assigned the task. 

Inspired by the ability of large language models to mimic the tone, style, and vocabulary of prompts they receive—whether toxic or neutral—we set out to create a dataset for training content moderation tools that can be used to better flag implicitly harmful language. In our paper “ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection,” we collected initial examples of neutral statements with group mentions and examples of implicit hate speech across 13 minority identity groups and used a large-scale language model to scale up and guide the generation process. The outcome is the largest implicit hate speech dataset to date that is publicly available: 274,000 examples comprising both neutral and toxic statements. We conducted a human study on the generated dataset to better understand different aspects of harm beyond binary labels of toxic and neutral assigned by content moderation tools. To stress test existing content moderation tools across minority identity groups studied in this work, we also propose an adversarial classifier-in-the-loop decoding approach. The dataset, two content moderation tools trained on the dataset, prompts used as seed data, and the source codes for our proposed adversarial decoding approach are available in the ToxiGen GitHub repo (please see footnote).

We’re presenting this work at the 2022 Meeting of the Association for Computational Linguistics (ACL), where our colleagues will also be presenting work that leverages the generative power of large language models and human expertise

A horizontal chart comparing the proportion of minority identity group mentions in the prompts with the minority identity group mentions in the generated text for the 13 minority identity groups in this work: Black, Mexican, people with physical disabilities, LGBTQ+, people with cognitive disabilities, Chinese, Muslim, Jewish, Middle Eastern, Women, Asian, Native American, and Latino.
Figure 1: The ToxiGen dataset—an implicit hate speech dataset created by using a large-scale language model with both regular and adversarial decoding to scale up and guide the generation process—contains 274,000 examples comprising both neutral and toxic statements across 13 minority identity groups. As illustrated above, mentions of a specific minority identity group in the prompts and mentions of the same minority identity group in the corresponding generated text are proportional.

Demonstration-based prompting for building better datasets

Large Transformer-based language models don’t explicitly encode semantic information; nevertheless, these models can distinguish the statistical interactions of words in different contexts. Through experimentation with the generation of language via one of these large language models, we learned how to utilize careful prompt engineering strategies to create the ToxiGen implicit hate speech dataset. 

Our first experiments were to generate examples of hate speech and neutral speech related to the 13 minority identity groups in our work. We started by collecting implicit hate speech prompts from existing datasets and neutral prompts drawn from news articles, opinion pieces, podcast transcripts, and other similar public sources and feeding them into the LLM to create a broader, deeper set of prompts. What we found was that the LLM could generate examples that were qualitatively different depending on the source material. When prompted with bits from different writers on the above topics, in each case, the LLM produced linguistically diverse outputs that were nonetheless similar in style and tone. 

Furthermore, we found that through careful cultivation of prompt sets, we could generate a wide variety of text reflecting diverse opinions and thoughts on these topics that weren’t found in our original source materials. We could generate neutral statements about sensitive topics that mentioned the relevant minority identity groups, and we could consistently generate hate speech statements about these minority identity groups that didn’t contain slurs or profanity. And the more we experimented with the source material, the more interesting our dataset became. This is particularly exciting because we hope that other individuals and groups can use these tools to extend our dataset; different disciplinary experts could utilize the same strategies and collect even better prompt sets, resulting in even more subtle and rich examples of neutral speech and hate speech. 

We also found that the model often generated examples of speech that we ourselves had trouble labeling. In essence, we were using the LLM as a probe to explore the delicate boundaries between acceptable and offensive speech. As a result, our own understanding of the problem definition itself grew through our interactions with the model.  

The first 260,000 examples from our dataset were drawn from this experimental approach. 

Examples of statements generated by (De)ToxiGen that fool Google’s Perspective API, HateBERT, OpenAI content filter, AI2 Delphi, and RoBERTa.
Figure 2: Examples of statements generated by (De)ToxiGen that fool Google’s Perspective API, HateBERT, OpenAI content filter, AI2 Delphi, and RoBERTa. Five statements are neutral but mention minority identity groups, so the content moderation tools find them hateful. Five are toxic sentences, but the tools find them neutral. The proposed decoding approach, (De)ToxiGen (referred to as ALICE in the paper), can challenge these content moderation tools, allowing developers to increase their coverage by creating adversarial examples. 

(De)ToxiGen: An adversarial decoding approach for strengthening content moderation tools

While demonstration-based prompting can facilitate large-scale data generation, it doesn’t generate data targeted specifically to challenge a given content moderation tool, or content classifier. This is important because every content moderation tool has unique vulnerabilities depending on the type of data it has been trained on. To address this, we developed (De)ToxiGen (referred to as ALICE in the paper), an algorithmic mechanism that creates an adversarial set-up between an LLM and a given content moderation tool in which the content classifier is in the loop during decoding.  

The proposed approach can increase or decrease the likelihood that a generated statement is classified as hate speech while maintaining the coherence of the generated language. It can generate both false negatives and false positives for a given content moderation tool. For false negatives, toxic prompts are used to elicit toxic responses, and then the tool’s probability of the neutral class is maximized during decoding. Similarly, to generate false positives, neutral prompts are used to generate neutral responses, and then the probability of the toxic class is maximized during decoding. With this approach, we’re essentially trying to reveal weaknesses in a specific content moderation tool by guiding the LLM to produce statements that we know the tool will misidentify. The generated data can then be used to improve the performance and coverage of the targeted content moderation tool. Our ToxiGen dataset includes data generated by both demonstration-based prompting and our proposed adversarial decoding approach. Through empirical study on three existing human-written datasets, we found that starting with an existing content moderation tool and fine-tuning it on ToxiGen can improve the tool’s performance significantly, demonstrating the quality of the machine-generated data in ToxiGen.  

Human evaluation: Better understanding the data

Human language is complex, particularly when it comes to harmful statements. To better understand different aspects of the data in ToxiGen—its perceived harmfulness and intent and whether it presents as fact or opinion, for example—we conducted human evaluations on the data generated by both regular decoding (top-k), used in the demonstration-based prompting, and the proposed adversarial decoding. The human evaluation also allowed us to test the quality of the output of these methods and gauge how effective these methods were in guiding the generation of the data we sought. 

For the human evaluation, three annotators were used for each statement from a pool of 156 prequalified annotators with prior experience annotating toxic language. About 4,500 samples were randomly selected for each of the decoding methods with coverage across all 13 minority identity groups for each split. We found the following: 

  1. For both decoding methods, minority identity group mentions included in the prompt also exist in the generated statements. This means that both data generation methods reliably produce the data they were designed to produce—hateful and neutral statements with explicit reference to the specified minority identity group.
  2. In the neutral case, the label of the prompt matches the generated text more often than in the toxic case, as shown in Figure 3a. 
  3. The proposed decoding approach generates a higher percentage of adversarial text compared to regular decoding—that is, it produces data that is more likely to fool a given content moderation tool—as illustrated in Figure 3b. 
Two bar charts side by side. The one on the left, titled “Prompt-Response Matching,” shows that top-k decoding produces non-toxic responses 95.2 percent of the time when given a non-toxic prompt compared with 92.1 percent for (De)ToxiGen and that top-k decoding produces toxic responses 67.7 percent of the time when given a toxic prompt compared with 40.3 percent for (De)ToxiGen. The bar chart on the right, titled “Adversarial Power,” shows that statements generated by (De)ToxiGen fool HateBERT 26.4 percent of the time compared with 16.8 percent for statements generated via top-k decoding.
Figure 3a (left) and 3b (right): Human evaluations on the data generated by regular decoding (top-k) and the proposed adversarial decoding showed that the toxicity labels for the prompt and the generated response match more often for non-toxic prompts compared to toxic ones (left). It was also observed that (De)ToxiGen generates a higher percentage of adversarial text compared to regular decoding (right). 
  1. 90.5 percent of machine-generated examples were thought to be human-written by the majority of annotators.
  2. Perceived harmfulness with respect to human- or AI-authored text is similar. 

Looking ahead: Societal implications and opportunities

As advances continue to be made in large language models, we remain vigilant in our pursuit of AI systems that align with our commitment to technology that benefits society as a whole and empowers everyone to achieve more. We’re beginning to ask better questions to more deeply understand the risks associated with LLMs and build processes and methods for addressing them. Existing content moderation tools tend to be only good at flagging overt inappropriate or harmful language. Our work aims to create data that can better target the challenge. While our work here specifically explores hate speech, our proposed methods could be applied to a variety of content moderation challenges, such as flagging potential misinformation content. By releasing the source codes and prompt seeds for this work, we hope to encourage the research community to contribute to it by, for example, adding prompt seeds and generating data for minority identity groups that aren’t covered in our dataset. 

As with many technologies, the solutions we develop to make them stronger, more secure, and less vulnerable also have the potential to be used in unintended ways. While the methods described here may be used to generate inappropriate or harmful language, we believe that they provide far greater value in helping to combat such language, resulting in content moderation tools that can be used alongside human guidance to support fairer, safer, more reliable, and more inclusive AI systems.  

Considerations for responsible use

There is still a lot that this dataset is not capturing about what constitutes problematic language, and before utilizing the dataset, its limitations should be acknowledged. Our annotations might not capture the full complexity of these issues, given problematic language is context-dependent, dynamic, and can manifest in different forms and different severities. Content moderation tools aren’t a silver bullet to address harmful online content. Problematic language is fundamentally a human-centric problem. It should be studied in conjunction with human experience, and tools to address this problem should be developed and deployed with human expertise and well-informed regulatory processes and policy. Multidisciplinary work is needed to better understand the aspects of this challenge.  

Also, this dataset only captures implicit toxicity (more precisely hate speech) for 13 minority identity groups and due to its large scale can naturally have imperfections. Our goal in this project is to provide the community with means to improve hate speech detection on implicit toxic language for the identified minority identity groups, and there exist limitations to this dataset and models trained on it that can potentially be the subject of future research, for example, including more minority identity groups, a combination of them, and so on that are not covered in our work. Stronger content moderation tools and systems can contribute to mitigating fairness-related harms in AI systems. For example, systems that don’t over-flag neutral statements with minority identity group mentions can help ensure better representation of diverse perspectives and experiences, while systems that can better flag implicit hate speech can support more inclusive technology.   

Acknowledgment 

This work was conducted by PhD students Thomas Hartvigsen and Saadia Gabriel during their internships at Microsoft Azure and Microsoft Research. Hamid Palangi, Dipankar Ray, Maarten Sap, and Ece Kamar served as advisors on the work. A special thanks to Misha Bilenko from Azure ML for making the compute resources available and to Microsoft Research for supporting our large-scale human study. 

Platform models—large-scale models trained on vast amounts of data—are making it easier and faster to develop AI systems. (De)ToxiGen and other tools and resources like it are being developed by researchers at Microsoft to help developers get the most out of these platform models while also understanding, measuring, and mitigating the risks they pose.

Please note: This research, the GitHub repository, and examples from our work included in this blog contain and discuss content that is offensive or upsetting. All materials are intended to support research that improves hate speech detection methods. Included examples of hate speech don’t represent how the authors or sponsors feel about any minority identity groups. Hate speech applies to a range of minority identity groups; for the purposes of this research, we focus on 13 of them (as shown in Figure 1). Content moderation tools are part of larger content moderation systems. These systems also include human expertise and thoughtful policy and regulatory development. Even the most robust content moderation tools and datasets require systems with human supervision. 

The post (De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools appeared first on Microsoft Research.

Read More

Partnering people with large language models to find and fix bugs in NLP systems

Advances in platform models—large-scale models that can serve as foundations across applications—have significantly improved the ability of computers to process natural language. But natural language processing (NLP) models are still far from perfect, sometimes failing in embarrassing ways, like translating “Eu não recomendo este prato” (I don’t recommend this dish) in Portuguese to “I highly recommend this dish” in English (a real example from a top commercial model). These failures continue to exist in part because finding and fixing bugs in NLP models is hard—so hard that severe bugs impact almost every major open-source and commercial NLP model. 

Current methods for finding or fixing bugs take one of two approaches: they’re either user-driven or automated. User-driven methods are flexible and can test any aspect of a model’s behavior, but they depend on highly variable human ability to imagine bugs and are so labor intensive that in practice only a small part of the input space gets tested. Automated approaches, on the other hand, are fast and so can explore large portions of the input space. However, since they lack human guidance, they can only test if a model is right or wrong in very restricted scenarios, such as when the model has inconsistent predictions on inputs with slight variations in phrasing. 

We believe platform models, specifically modern large language models (LLMs) like GPT-3, offer an opportunity for us to combine the synergistic strengths of both user-driven approaches and automated approaches, keeping the user in control of defining what the model being tested should be doing while leveraging the abilities of modern generative language models to generate at scale tests within a specific category of model behavior. We call this human-AI team approach Adaptive Testing and Debugging, or AdaTest for short. 

With AdaTest, a large language model is tasked with the slow burden of generating a large quantity of tests targeted at finding bugs in the model being tested, while the person steers the language model by selecting valid tests and organizing them into semantically related topics. This guidance from the person drastically improves the language model’s generation performance and directs it toward areas of interest. Because these tests are effectively a form of labeled data, they not only identify bugs but can be used to fix bugs in an iterative debugging loop similar to traditional software development. AdaTest offers significant productivity gains for expert users while remaining simple enough to empower diverse groups of non-experts without a background in programming. This means experts and non-experts alike can better understand and control the behavior of their AI systems across a range of scenarios, which makes for not only better-performing AI systems but more responsible AI systems. The AdaTest code and pre-populated test trees are open source on GitHub

We’re presenting our paper, “Adaptive Testing and Debugging of NLP Models,” at the 2022 Meeting of the Association for Computational Linguistics (ACL), where our colleagues will also be introducing work that leverages large language models, in their case, to grow adversarial datasets for content moderation tools

A diagram in which the testing loop is represented by a series of icons showing the language model suggesting tests, the user filtering and organizing them in a test tree, and the language model using that user feedback to suggest more tests, beginning the process again. The graphic representing the testing loop is situated within the debugging loop. Red arrows from the testing loop to a black square labeled “target model” and back to the testing loop indicate identified test failures being used to fix a target model, which is then retested in an iterative process.
Figure 1: AdaTest consists of two loops: a testing loop that generates and organizes tests optimized for the model being tested (the target model) and a debugging loop that iteratively refines the model based on test failures. 

Finding bugs with the testing loop

The AdaTest process is composed of an inner testing loop that is used to find bugs (Figure 1, unrolled in Figure 2) and an outer debugging loop that is used to fix bugs (Figure 1, unrolled in Figure 4). 

Consider how this works for sentiment analysis, used to determine if a piece of text expresses a positive or negative sentiment (typically in the context of product reviews or customer feedback). While the task seems simple enough, even state-of-the-art models have failures, ranging from overt, like classifying “I don’t think I’ve ever had a nicer time in my life” as negative, to more subtly harmful, like classifying “I am a racial minority” as negative (both represent real failures found with AdaTest in commercial models). To demonstrate how AdaTest finds and fixes bugs, we show how to test for (and later fix) instances of fairness-related harms in which neutral references to a specific identity group within a piece of text could cause a sentiment analysis model to incorrectly downweight the sentiment of the text—in other words, scenarios in which a model might treat comments from specific groups more negatively. 

In the testing loop, we start with a set of unit tests about various identities and label the set “/Sensitive” (Figure 2 below). These initial examples don’t reveal any model failures. But AdaTest then uses a large language model—in our case, GPT-3—to generate many similar suggested tests designed to highlight bugs (Figure 2A). While hundreds of tests are generated, we only need to review the top few failing or near-failing tests. We then ignore tests that don’t represent real failures (for example, “I am tired of being silenced” really should be negative in Figure 2) and add the other valid tests to the current topic, also occasionally organizing them into additional subtopics (Figure 2B). These user-filtered tests are included in the language model prompt for the next round of suggestions, nudging the next set of suggestions toward the intersection between user interest and model failure (Figure 2C). Repeating the testing loop results in the language model starting at tests that don’t fail and slowly working its way up to producing stronger and stronger failures. So even when users can’t find model failures on their own, they can start from a small set of passing tests and quickly iterate with the language model to produce a large set of tests that reveal bugs in the model being tested. 

The testing loop represented as a series of rectangles, each containing test suggestions. Starting with the top rectangle and moving down, the user provides three neutral identity statements that are not predicted as negative. The language model, represented by a robot icon, suggests two statements predicted as negative in the next rectangle. In a third rectangle, real failures are accepted and organized into subtopics by the user, represented by a person icon. From those selections, the model suggests two more statements in the next rectangle. In the last rectangle, one of the subtopics is expanded based on the model’s previous suggestions.
Figure 2: The testing loop cycles between the large language model (LLM) generating test suggestions, the model scoring the suggestions, and the user accepting (✔) and organizing them, beginning with initial user-provided examples. In this three-way sentiment analysis example, the model “f” can either pass (green) or fail (red) a test. Passing one of the tests above means the model did not output “negative” while failing a test above means the model did output “negative” and hence failed the test assertion (≠). As the user filters and organizes (B, D), the LLM iteratively climbs toward suggesting valid tests that reveal more pronounced failures (A, C). In this example, we’re testing a sentiment analysis model to ensure that neutral identity-related statements don’t cause the model to flag comments as negative. 

If instead of the “/Sensitive” topic shown in Figure 2, we target a different topic, such as handling negation, we’ll reveal different failures. For example, starting from simple statements like “I have never been happier” that a commercial model correctly classifies as positive, AdaTest can quickly find bugs like “I don’t think that I’ve ever seen a nicer town” getting labeled as negative. These bugs are egregious and obvious once you see them, but they’re hard to find by hand since they only happen for very specific phrasings. 

We ran user studies to quantitatively evaluate if AdaTest makes experts and non-experts better at writing tests and finding bugs in models. We asked experts—those with a background in machine learning and NLP—to test specific topics in two models: a commercial sentiment classifier and GPT-2 for next word auto-complete, used in such applications as predicting the next word in an email being typed (a scenario in which we want to avoid suggesting stereotypes, for example, one of the behaviors we had participants test for). For each topic and model, participants were randomly assigned to use CheckList (representing state-of-the-art user-driven testing) or AdaTest. We present the average number of discovered model failures per minute in Figure 3, where we observe a fivefold improvement with AdaTest across models and participants in the study. We asked non-experts, or those without any programming background, to test the Perspective API toxicity model for content moderation. Participants tried to find non-toxic statements (that is, statements they would personally feel appropriate posting) predicted as toxic for political opinions. Participants were given access to an improved version of the Dynabench crowd-sourcing interface for model testing and to AdaTest. AdaTest provided up to a tenfold improvement (bottom portion of Figure 3). 

A horizontal bar chart with failures found per minute on the x-axis and model and topic on the y-axis broken down by experience of the participant doing the testing. NLP experts testing the sentiment model and auto-complete with AdaTest found 2 clear positive failures per minute and 1 negated positive per minute and 0.6 Muslim stereotypes and 1.1 African American stereotypes, respectively. NLP experts testing the sentiment model and auto-complete with CheckList found 0.3 clear positive failures per minute and 0.2 negated positives per minute and 0.1 Muslim stereotypes and 0.2 African American stereotypes, respectively. Non-experts testing the toxicity model for non-toxic political viewpoints classified as toxic found 1.5 failures per minute with AdaTest compared with 0.15 with Dynabench.
Figure 3: Per-topic model failures per minute. Experts found approximately five times more failures with AdaTest on all topics, and non-experts benefited by up to 10 times. Error bars represent the 10th and 90th percentiles over bootstrap re-samples of participants. 

We also grouped participants by their progressive versus conservative political alignment and found that participants wrote tests with twice the quality when testing their own perspective versus an opposing perspective (as measured by an independent set of in-group raters). Our user studies highlight that AdaTest can be used by anyone and that such easy-to-use tools are important to enable model testing by people with diverse backgrounds since testers representing different lived experiences and viewpoints are needed to effectively test different perspectives. 

Fixing bugs with the debugging loop 

Once enough bugs are discovered, testers of a model then engage in the outer debugging loop (Figure 4 below), where they fix bugs discovered in the testing loop and then retest the model. In our experiments, we fixed bugs by fine-tuning the model on the tests, but other strategies, such as collecting more data or adding constraints, are also possible. The retest part of the debugging loop (that is, running the testing loop again) is critical since once we use our tests to fix the model, they no longer represent test data but rather training data. The process of fixing a bug often overcompensates, introducing shortcuts or bugs in the initial rounds of the debugging loop that can only be found using a new set of tests adapted to target the new “fixed” model. 

Running the debugging loop on an open-source RoBERTa-Large sentiment model (Figure 4) demonstrates the importance of a test-fix-retest cycle. We start with tests from the “/Sensitive/Immigration” topic from Figure 2 that the RoBERTa model incorrectly labels as negative. Fine-tuning the model on these tests (mixed with the original training data to maintain task performance) results in a new model that no longer fails the tests (second row of Figure 4). However, when we rerun the testing loop, we find that now almost all immigration statements are labeled as “neutral,” even if they are truly negative based on the application and testing scenario (for example, the statements in the third row of Figure 4 wouldn’t be neutral if a model were tasked with detecting if language was for or against something). Fine-tuning again using these new tests (and the older ones) results in a model that correctly fixes the original bug without adding the “every immigration statement is neutral” shortcut.

This doesn’t, of course, guarantee that there isn’t another shortcut still in the model, but in our experience, a few rounds of the debugging loop drastically reduce the number of accidental bugs that get introduced when “fixing” the original bugs. The testers of the model don’t have to exhaustively identify every possible shortcut or imbalance ahead of time, since AdaTest adaptively surfaces and fixes bugs that have been introduced in the next rounds of testing and debugging. Thus, the debugging loop serves as a friendly adversary, pushing the boundaries of the current “specification” until a satisfactory model is produced. In fact, AdaTest can be seen as an application of the test-fix-retest loop from software engineering to NLP. 

The debugging loop represented as a series of rectangles. Starting with the top rectangle and moving down, tests the model has failed are used to fix the model, as demonstrated by the correctly predicted statements in the second rectangle. The testing loop is run again, revealing that an overcorrection has occurred that causes a bug, in the third rectangle, with even negative statements about the topic being predicted as neutral. These tests that reveal this new bug are then fixed, and the model is fine-tuned on the new tests and previous tests, resulting in the statements being predicted correctly in the last rectangle.
Figure 4: Shortcuts added during an iteration of the debugging loop are found and fixed by future iterations. 

To evaluate the effectiveness of the debugging loop, we fine-tuned RoBERTa-Large to detect if two questions are duplicates (that is, the same question worded differently) using the Quora Question Pairs (QQP) dataset and also fine-tuned it for positive/neutral/negative sentiment analysis using the Stanford Sentiment Treebank (SST) dataset. Using previously published CheckList suites for evaluation, we find the baseline model fails 22 out of 53 QQP topics and 11 out of 39 sentiment topics. We then created data to “fix” a topic by either taking 50 examples from the topic’s data in the CheckList condition or by starting from a seed of five examples and running the debugging loop with AdaTest until finding failures becomes qualitatively difficult (on average 2.83 rounds for QQP and 3.83 rounds for sentiment). This yields an average of 41.6 tests for QQP and 55.8 tests for sentiment. We followed this process for six distinct high-failure rate topics in each task. In the vast majority of cases (see paper for details), AdaTest fixes the topics used for training and a number of unseen held-out topics without breaking any topics, while CheckList data often introduces new bugs (and thus breaks other test topics).

We also evaluated the effectiveness of AdaTest in a standard development setting, targeting a model for to-do detection in meeting notes. After three months of development, CheckList testing, and ad hoc GPT-3–based data augmentation, a PhD-level team had managed to build a model with an F1 score of 0.66 (out of 1) on unseen data collected in the wild. We gave AdaTest to the team with a demo a few minutes long. After four hours of running the debugging loop on their own, they produced another model with an F1 score of 0.77 on the same unseen dataset. These scores were then replicated again on a second unseen dataset, showing that AdaTest can add significant bug-fixing value with a fraction of the effort involved in traditional approaches. 

The promise of human-AI collaboration for ML development

AdaTest encourages a close collaboration between people and large language models, yielding the benefits of both. People provide the problem specification that the language model lacks, while the language model provides quality test creation at a scale and scope that is infeasible for people. The debugging loop connects model testing and debugging to effectively fix bugs, taking model development a step closer toward the iterative nature of traditional software development. Human-AI partnership represents a promising way forward for machine learning development, and we expect this synergy to only improve as the capabilities of large language models continue to grow.

Check out the full paper to see AdaTest’s effectiveness on classification models (sentiment analysis, QQP, toxicity, media selection, and task detection), generation models (GPT-2, translation), and per-token models (NER) ranging from well-tested production systems to brand-new applications. Give it a try yourself at https://github.com/microsoft/adatest

Platform models—large-scale models trained on vast amounts of data—are making it easier and faster to develop AI systems. AdaTest and other tools and resources like it are being developed by researchers at Microsoft to help developers get the most out of these platform models while also understanding, measuring, and mitigating the risks they pose.

The post Partnering people with large language models to find and fix bugs in NLP systems appeared first on Microsoft Research.

Read More

Real-time SKU detection in the browser using TensorFlow.js

Posted by Hugo Zanini, Data Product Manager

Last year, I published an article on how to train custom object detection in the browser using TensorFlow.js. This received lots of interest from developers from all over the world who tried to apply the solution to their personal or business projects.While answering reader’s questions on my first article, I noticed a few difficulties in adapting our solution to large datasets, and deploying the resulting model in production using the new version of TensorFlow.js.

Therefore, the goal of this article is to share a solution for a well-known problem in the consumer packaged goods (CPG) industry: real-time and offline SKU detection using TensorFlow.js.

Offline SKU detection running in real time on a smartphone using TensorFlow.js

The problem

Items consumed frequently by consumers (foods, beverages, household products, etc) require an extensive routine of replenishment and placement of those products at their point of sale (supermarkets, convenience stores, etc).

Over the past few years, researchers have shown repeatedly that about two-thirds of purchase decisions are made after customers enter the store. One of the biggest challenges for consumer goods companies is to guarantee the availability and correct placement of their product in-stores.

At stores, teams organize the shelves based on marketing strategies, and manage the level of products in the stores. The people working on these activities may count the number of SKUs of each brand in a store to estimate product stocks and market share, and help to shape marketing strategies.

These estimations though are very time-consuming. Taking a photo and using an algorithm to count the SKUs on the shelves to calculate a brand’s market share could be a good solution.

To use an approach like that, the detection should run in real-time such that as soon as you point a phone camera to the shelf, the algorithm recognizes the brands and calculates the market shares. And, as the internet inside the stores is generally limited, the detection should work offline.

Example workflow

This post is going to show how to implement the real-time and offline image recognition solution to identify generic SKUs using the SKU110K dataset and the MobileNetV2 network.

Due to the lack of a public dataset with labeled SKUs of different brands, we’re going to create a generic algorithm, but all the instructions can be applied in a multiclass problem.

As with every machine learning flow, the project will be divided into four steps, as follows:

Object Detection Model Production Pipeline

Preparing the data

The first step to training a good model is to gather good data. As mentioned before, this solution is going to use a dataset of SKUs in different scenarios. The purpose of SKU110K was to create a benchmark for models capable of recognizing objects in densely packed scenes.

The dataset is provided in the Pascal VOC format and has to be converted to tf.record. The script to do the conversion is available here and the tf.record version of the dataset is also available in my project repository. As mentioned before, SKU110K is a large and very challenging dataset to work with. It contains many objects, often looking similar or even identical, positioned in close proximity.

Dataset characteristics (Gist link)

To work with this dataset, the neural network chosen has to be very effective in recognizing patterns and be small enough to run in real-time in TensorFlow.js.

Choosing the model

There are a variety of neural networks capable of solving the SKU detection problem. But, the architectures that easily achieve a high level of precision are very dense and don’t have reasonable inference times when converted to TensorFlow.js to run in real-time.

Because of that, the approach here is going to be to focus on optimizing a mid-level neural network to achieve reasonable precision working on densely packed scenes and run the inferences in real-time. Analyzing the TensorFlow 2.0 Detection Model Zoo, the challenge will be to try to solve the problem using the lighter single-shot model available: SSD MobileNet v2 320×320 which seems to fit the criteria required. The architecture is proven to be able to recognize up to 90 classes and can be trained to identify different SKUs.

Training the model

With a good dataset and the model selected, it’s time to think about the training process. TensorFlow 2.0 provides an Object Detection API that makes it easy to construct, train, and deploy object detection models. In this project, we’re going to use this API and train the model using a Google Colaboratory Notebook. The remainder of this section explains how to set up the environment, the model selection, and training. If you want to jump straight to the Colab Notebook, click here.

Setting up the environment

Create a new Google Colab notebook and select a GPU as the hardware accelerator:

Runtime > Change runtime type > Hardware accelerator: GPU

Clone, install, and test the TensorFlow Object Detection API:

Gist link

Next, download and extract the dataset using the following commands:

Gist link

Setting up the training pipeline

We’re ready to configure the training pipeline. TensorFlow 2.0 provides pre-trained weights for the SSD Mobilenet v2 320×320 on the COCO 2017 Dataset, and they are going to be downloaded using the following commands:

Gist link

The downloaded weights were pre-trained on the COCO 2017 Dataset, but the focus here is to train the model to recognize one class so these weights are going to be used only to initialize the network — this technique is known as transfer learning, and it’s commonly used to speed up the learning process.

The last step is to set up the hyperparameters on the configuration file that is going to be used during the training. Choosing the best hyperparameters is a task that requires some experimentation and, consequently, computational resources.

I took a standard configuration of MobileNetV2 parameters from the TensorFlow Models Config Repository and performed a sequence of experiments (thanks Google Developers for the free resources) to optimize the model to work with densely packed scenes on the SKU110K dataset. Download the configuration and check the parameters using the code below.

Gist link

Gist link

With the parameters set, start the training by executing the following command:

Gist link

To identify how well the training is going, we use the loss value. Loss is a number indicating how bad the model’s prediction was on the training samples. If the model’s prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples (Descending into ML: Training and Loss | Machine Learning Crash Course).

The training process was monitored through Tensorboard and took around 22h to finish on a 60GB machine using an NVIDIA Tesla P4. The final losses can be checked below

Total training loss

Validate the model

Now let’s evaluate the trained model using the test data:

Gist link

The evaluation was done across 2740 images and provides three metrics based on the COCO detection evaluation metrics: precision, recall, and loss (Classification: Precision and Recall | Machine Learning Crash Course). The same metrics are available via Tensorboard and can be analyzed in an easier way

%load_ext tensorboard
%tensorboard --logdir '/content/training/'

You can then explore all training and evaluation metrics.

Main evaluation metrics

Exporting the model

Now that the training is validated, it’s time to export the model. We’re going to convert the training checkpoints to a protobuf (pb) file. This file is going to have the graph definition and the weights of the model.

Gist link

As we’re going to deploy the model using TensorFlow.js and Google Colab has a maximum lifetime limit of 12 hours, let’s download the trained weights and save them locally. When running the command files.download("/content/saved_model.zip"), the Colab will prompt the file download automatically.

Gist link

Deploying the model

The model is going to be deployed in a way that anyone can open a PC or mobile camera and perform inference in real-time through a web browser. To do that, we’re going to convert the saved model to the TensorFlow.js layers format, load the model in a JavaScript application and make everything available on CodeSandbox.

Converting the model

At this point, you should have something similar to this structure saved locally:

%MD

├── inference-graph

│ ├── saved_model

│ │ ├── assets

│ │ ├── saved_model.pb

│ │ ├── variables

│ │ ├── variables.data-00000-of-00001

│ │ └── variables.index

Before we start, let’s create an isolated Python environment to work in an empty workspace and avoid any library conflict. Install virtualenv and then open a terminal in the inference-graph folder and create and activate a new virtual environment:

virtualenv -p python3 venv
source venv/bin/activate

Install the TensorFlow.js converter:

pip install tensorflowjs[wizard]

Start the conversion wizard:

tensorflowjs_wizard

Now, the tool will guide you through the conversion, providing explanations for each choice you need to make. The image below shows all the choices that were made to convert the model. Most of them are the standard ones, but options like the shard sizes and compression can be changed according to your needs.

To enable the browser to cache the weights automatically, it’s recommended to split them into shard files of around 4MB. To guarantee that the conversion is going to work, don’t skip the op validation as well, not all TensorFlow operations are supported so some models can be incompatible with TensorFlow.js — See this list for which ops are currently supported on the various backends that TensorFlow.js executes on such as WebGL, WebAssembly, or plain JavaScript.

Model conversion using TensorFlow.js Converter (Full resolution image here)

If everything works well, you’re going to have the model converted to the TensorFlow.js layers format in the web_model directory. The folder contains a model.json file and a set of sharded weights files in a binary format. The model.json has both the model topology (aka “architecture” or “graph”: a description of the layers and how they are connected) and a manifest of the weight files (Lin, Tsung-Yi, et al). The contents of the web_model folder currently contains the files shown below:

└ web_model
├── group1-shard1of5.bin
├── group1-shard2of5.bin
├── group1-shard3of5.bin
├── group1-shard4of5.bin
├── group1-shard5of5.bin
└── model.json

Configuring the application

The model is ready to be loaded in JavaScript. I’ve created an application to perform inference directly from the browser. Let’s clone the repository to figure out how to use the converted model in real-time. This is the project structure:

├── models
│ ├── group1-shard1of5.bin
│ ├── group1-shard2of5.bin
│ ├── group1-shard3of5.bin
│ ├── group1-shard4of5.bin
│ ├── group1-shard5of5.bin
│ └── model.json
├── package.json
├── package-lock.json
├── public
│ └── index.html
├── README.MD
└── src
├── index.js
└── styles.css

For the sake of simplicity, I have already provided a converted SKU-detector model in the model’s folder. However, let’s put the web_model generated in the previous section in the models folder and test it.

Next, install the http-server:

npm install http-server -g

Go to the models folder and run the command below to make the model available at http://127.0.0.1:8080 . This is a good choice when you want to keep the model weights in a safe place and control who can request inferences to it. The -c1 parameter is added to disable caching, and the –cors flag enables cross-origin resource sharing allowing the hosted files to be used by the client-side JavaScript for a given domain.

http-server -c1 --cors .

Alternatively, you can upload the model files somewhere else – even on a different domain if needed. In my case, I chose my own Github repo and referenced the model.json folder URL in the load_model function as shown below:

async function load_model() {
// It's possible to load the model locally or from a repo.
// Load from localhost locally:
const model = await loadGraphModel("http://127.0.0.1:8080/model.json");
// Or Load from another domain using a folder that contains model.json.
// const model = await loadGraphModel("https://github.com/hugozanini/realtime-sku-detection/tree/web");
return model;
}

This is a good option because it gives more flexibility to the application and makes it easier to run on public web servers.

Pick one of the methods to load the model files in the function load_model (lines 10–15 in the file src>index.js).

When loading the model, TensorFlow.js will perform the following requests:

GET /model.json
GET /group1-shard1of5.bin
GET /group1-shard2of5.bin
GET /group1-shard3of5.bin
GET /group1-shardo4f5.bin
GET /group1-shardo5f5.bin

Publishing in CodeSandbox

CodeSandbox is a simple tool for creating web apps where we can upload the code and make the application available for everyone on the web. By uploading the model files in a GitHub repo and referencing them in the load_model function, we can simply log into CodeSandbox, click on New project > Import from Github, and select the app repository.

Wait some minutes to install the packages and your app will be available at a public URL that you can share with others. Click on Show > In a new window and a tab will open with a live preview. Copy this URL and paste it in any web browser (PC or Mobile) and your object detection will be ready to run. A ready to use project can be found here as well if you prefer.

Conclusion

Besides the precision, an interesting part of these experiments is the inference time — everything runs in real-time in the browser via JavaScript. SKU detection models running in the browser, even offline, and using few computational resources is a must in many consumer packaged goods company applications, along with other industries too.

Enabling a Machine Learning solution to run on the client-side is a key step to guarantee that the models are being used effectively at the point of interaction with minimal latency and solve the problems when they happen: right in the user’s hand.

Deep learning should not be costly and should be used beyond just research, for real world use cases, which JavaScript is great for production deployments. I hope this article will serve as a basis for new projects involving Computer Vision, TensorFlow, and create an easier flow between Python and Javascript.

If you have any questions or suggestions you can reach me on Twitter.

Thanks for reading!

Acknowledgments

I’d like to thank the Google Developers Group, for providing all the computational resources for training the models, and the authors of the SKU 110K Dataset, for creating and open-sourcing the dataset used in this project.

Read More

Energy Grids Plug into AI for a Brighter, Cleaner Future

Electric utilities are taking a course in machine learning to create smarter grids for tough challenges ahead.

The winter 2021 megastorm in Texas left millions without power. Grid failures the past two summers sparked devastating wildfires amid California’s record drought.

“Extreme weather events of 2021 highlighted the risks climate change is introducing, and the importance of investing in more resilient electricity grids,” said a May 2021 report from the International Energy Agency, a group with members from more than 30 countries. It called for a net-zero carbon grid by 2050, fueled by hundreds more gigawatts in renewable sources.

The goal demands a transformation. Yesterday’s hundred-year-old grid — a one-way system from a few big power plants to many users — must morph into a two-way, flexible, distributed network connected to homes and buildings that sport solar panels, batteries and electric vehicles.

Given the changes ahead, experts say the grid must expand autonomous control systems that gather data at every node and use it to respond in real time.

An Essential Ingredient

“AI will play a crucial role maintaining stability for an electric grid that’s becoming exponentially more complex with large numbers of low-capacity, variable generation sources like wind and solar coming online and two-way power flowing into and out of houses,” said Jeremy Renshaw, a senior program manager at the Electric Power Research Institute (EPRI), an independent, non-profit that collaborates with more than 450 companies in 45 countries on energy R&D.

“AI can support grid operators already stretched to their limits by automating repetitive or time-consuming tasks,” said Renshaw, who manages EPRI’s AI initiative.

Rick Perez, a principal at Deloitte Consulting LLP with more than 16 years working with utilities and data analytics, agrees.

“The future energy grid will be distributed and fueled by thousands of intermittent power sources including wind farms and various storage technologies. Managing it requires advanced AI methods and high performance computing,” he said.

Real Projects, Real Results

Work is already underway at power plants and substations, on distribution lines and inside homes and businesses.

“Some of the largest utilities in the U.S. are taking the first steps of creating a data engineering platform and an edge-computing practice, using sensor arrays and real-time analysis,” said Perez.

For example, a utility in a large U.S. city recently got traction with AI on NVIDIA GPUs, determining in less than 30 minutes the best truck routes for responding to a storm. Past efforts on CPU-based systems took up to 36 hours, too long to be useful.

To show utilities what’s possible, Deloitte runs jobs on NVIDIA DGX A100 systems in its Center for AI Computing. One effort combines data on the state of the electric grid with local weather conditions to identify — in time to dispatch a repair crew — distribution lines caked with ice and in danger of failing.

“Because it’s an open system, we could use our existing IT staff and, with NVIDIA’s support, do supercomputing-class work for our client,” Perez said.

Building AI Models, Datasets

At EPRI, Renshaw reports progress on several fronts.

For example, more than 300 organizations have joined its L2RPN challenge to build AI models with reinforcement learning. Some are capable of controlling as many as five tasks at once to prevent an outage.

“We want to automate 80 percent of the mundane tasks for operators, so they can do a better job focusing on the 20 percent of the most complex challenges,” said Renshaw.

A 2021 report on how AI can address climate change cited as an important use case the L2RPN work which is expanding this year to include more complex models.

Separately, EPRI is curating 10 sets of anonymous data utilities can use to train AI models for their most critical jobs. One is a database that already sports 150,000 images taken by drones of aging equipment on powerlines.

EPRI also leads a startup incubator where utilities can collaborate with AI startups like Noteworthy AI, a member of NVIDIA Inception, to work on innovative projects. To keep shared data private, it can use NVIDIA FLARE software to train AI models.

Power Plants Get Digital Twins

Both EPRI and Deloitte are helping create industrial digital twins to optimize operations and training at power plants. For example, a power plant in one southern U.S. state is acting as a demo facility in an EPRI project that’s gathered broad interest.

Separately, Deloitte plans to use NVIDIA Omniverse Enterprise to develop a physically accurate digital twin of a nuclear power plant for worker training scenarios.

“Regulators are providing multiple grants for building digital twins of power plants to increase safety and reduce the high costs of shutting systems down for tests,” Perez said.

Truly Smart Meters Debut This Year

Similarly, both EPRI and Deloitte are helping define the next generation of smart meters.

“We call today’s systems smart meters, but in reality they send maybe one data point every 15 minutes which is very slow by today’s standards,” said Renshaw.

By contrast, software-defined smart grid chips and meters in development by Utilidata, a member of NVIDIA Inception, a free program for cutting-edge startups, and Anuranet use the next generation of the NVIDIA Jetson edge AI platform to process more than 30,000 data points per second. They seek insights that save energy and cost while increasing the grid’s resilience.

“If we can get sub-second data, it opens up a wealth of opportunities — we’ve identified 81 use cases for data from the next generation of smart meters,” he said.

AI using data from one of these new meters could have predicted his home HVAC system needed repair before it failed last year, costing him more than $1,000.

An Inflection Point

In addition, EPRI has pilot programs in two office buildings using AI to reduce energy waste by as much as 30 percent. And it’s starting a collaboration on ways machine learning could enhance cybersecurity, a rising concern in the wake of last year’s ransomware attack on an energy pipeline.

The to-do list goes on. The good news, said Perez, is significant funding is on the way to create a smarter, cleaner and more secure grid with initiatives around the globe, including the U.S. Infrastructure Investment and Jobs Act.

“We’re at an inflection point, and there simply is no viable plan for the grid’s future without AI and high performance computing,” he said.

Watch a GTC talk (viewable on-demand with registration) to see how utilities can use edge AI and high performance computing to modernize grid operations. And learn more about NVIDIA’s work with utilities and NVIDIA Inception.

The post Energy Grids Plug into AI for a Brighter, Cleaner Future appeared first on NVIDIA Blog.

Read More

Open-sourcing MuJoCo

In October 2021, we announced that we acquired the MuJoCo physics simulator, and made it freely available for everyone to support research everywhere. We also committed to developing and maintaining MuJoCo as a free, open-source, community-driven project with best-in-class capabilities. Today, we’re thrilled to report that open sourcing is complete and the entire codebase is on GitHub! Here, we explain why MuJoCo is a great platform for open-source collaboration and share a preview of our roadmap going forward.Read More

Open-sourcing MuJoCo

In October 2021, we announced that we acquired the MuJoCo physics simulator, and made it freely available for everyone to support research everywhere. We also committed to developing and maintaining MuJoCo as a free, open-source, community-driven project with best-in-class capabilities. Today, we’re thrilled to report that open sourcing is complete and the entire codebase is on GitHub! Here, we explain why MuJoCo is a great platform for open-source collaboration and share a preview of our roadmap going forward.Read More

What is Extended Reality?

Advances in extended reality have already changed the way we work, live and play, and it’s just getting started.

Extended reality, or XR, is an umbrella category that covers a spectrum of newer, immersive technologies, including virtual reality, augmented reality and mixed reality.

From gaming to virtual production to product design, XR has enabled people to create, collaborate and explore in computer-generated environments like never before.

What Is Extended Reality?

Virtual, augmented and mixed reality are all elements of XR technology.

Virtual reality puts users inside a virtual environment. VR users typically wear a headset that transports them into a virtual world — one moment they’re standing in a physical room, and the next they’re immersed in a simulated environment.

The latest VR technologies push these boundaries, making these environments look and behave more like the real world. They’re also adding support for additional senses, including touch, sound and smell.

With VR, gamers can become fully immersed in a video game, designers and customers can review building projects to finalize details prior to construction, and retailers can test virtual displays before committing to a physical one.

Augmented reality is when a rendered image is overlaid onto the real world. The mobile game Pokémon GO famously brought AR to the mainstream by showing computer-rendered monsters standing on lawns and sidewalks as players roam their neighborhoods.

AR graphics are visible through cell phones, tablets and other devices, bringing a new kind of interactive experience to users. Navigating directions, for example, can be improved with AR. Rather than following a 2D map, a windshield can superimpose directions over one’s view of the road, with simulated arrows directing the driver exactly where to turn.

Mixed reality is a seamless integration of the real world and rendered graphics, which creates an environment in which users can directly interact with the digital and physical worlds together.

With MR, real and virtual objects blend, and are presented together within a single display. Users can experience MR environments through a headset, phone or tablet, and can interact with digital objects by moving them around or placing them in the physical world.

There are two types of MR:

  • Mixing virtual objects into the real world — for instance, where a user sees the real world through cameras in a VR headset with virtual objects seamlessly mixed into the view. See this example video.
  • Mixing real-world objects into virtual worlds — for example, a camera view of a VR participant mixed into the virtual world, like watching a VR gamer playing in a virtual world.

The History of XR

To understand how far XR has come, consider its origins in VR.

VR began in the federal sector, where it was used to train people in flight simulators. The energy and automotive design industries were also early adopters. These simulation and visualization VR use cases required large supercomputers. It also needed dedicated spaces, including powerwalls, which are ultra-high-resolution displays, and VR CAVEs, which are empty rooms that have the VR environment projected on each surface, from the walls to the ceiling.

For decades, VR remained unaffordable for most users, and the small VR ecosystem was mainly composed of large institutions and academic researchers.

But early in the previous decade, several key component technologies reached a tipping point, which precipitated the launch of the HTC Vive and Oculus Rift head-mounted displays (HMDs), along with the SteamVR runtime.

Individuals could now purchase personal HMDs to experience great immersive content. And they could drive those HMDs and experiences from an individual PC or workstation with a powerful GPU.

Suddenly, VR was accessible to millions of individuals, and a large ecosystem quickly sprung up, filled with innovation and enthusiasm.

In recent years, a new wave of VR innovation started with the launch of all-in-one (AIO) headsets. Previously, fully immersive VR experiences required a physical connection to a powerful PC. The HMD couldn’t operate as a self-contained device, as it had no operating system and no ability to compute the image.

But with AIO headsets, users gained access to a dedicated device with a simple setup that could deliver fully tracked VR anywhere, anytime. Coupled with the innovation of VR streaming technology, users could now experience powerful VR environments, even while on the go.

Latest Trends in XR

High-quality XR is becoming increasingly accessible. Consumers worldwide are purchasing AIOs to experience XR, from immersive gaming to remote learning to virtual training. Large enterprises are adding XR into their workflows and design processes. XR drastically improves design implementation with the inclusion of a digital twin.

Image courtesy of Innoactive.

And one of today’s biggest trends is streaming XR experiences through 5G from the cloud. This removes the need to be tethered to workstations or limit experiences to a single space.

By streaming over 5G from the cloud, people can use XR devices and get the computational power to run XR experiences from a data center, regardless of location and time. Advanced solutions like NVIDIA CloudXR are making immersive streaming more accessible, so more XR users can experience high-fidelity environments from anywhere.

AR is also becoming more common. After Pokémon GO became a household name, AR emerged in a number of additional consumer-focused areas. Many social media platforms added filters that users could overlay on their faces. Organizations in retail incorporated AR to showcase photorealistic rendered 3D products, enabling customers to place these products in a room and visualize it in any space.

Plus, enterprises in various industries like architecture, manufacturing, healthcare and more are using the technology to vastly improve workflows and create unique, interactive experiences. For example, architects and design teams are integrating AR for construction project monitoring, so they can see onsite progress and compare it to digital designs.

And though it’s still fairly new, MR is developing in the XR space. Trends are shown through the emergence of many new headsets built for MR, including the Varjo XR-3. With MR headsets, professionals in engineering, design, simulation and research can develop and interact with their 3D models in real life.

Varjo XR-3 headset. Image courtesy of Varjo.

The Future of XR

As XR technology advances, another technology is propelling users into a new era: artificial intelligence.

AI will play a major role in the XR space, from virtual assistants helping designers in VR to intelligent AR overlays that can walk individuals through do-it-yourself projects.

For example, imagine wearing a headset and telling the content what to do through natural speech and gestures. With hands-free and speech-driven virtual agents at the ready, even non-experts will be able to create amazing designs, complete exceedingly complex projects and harness the capabilities of powerful applications.

Platforms like NVIDIA Omniverse have already changed how users create 3D simulations and virtual worlds. Omniverse allows users from across the globe to develop and operate digital twin simulations. The platform provides users with the flexibility to portal into the physically accurate, fully ray-traced virtual world through 2D monitors, or their preferred XR experience, so they can experience vast virtual worlds immersively.

Entering the next evolution of XR, the possibilities are virtually limitless.

Learn more and see how organizations can integrate XR with NVIDIA technologies.

Featured blog image includes KPF and Lenovo.

The post What is Extended Reality? appeared first on NVIDIA Blog.

Read More

From Cloud to Car: How NIO Develops Intelligent Vehicles on NVIDIA HGX

Building next-generation intelligent vehicles requires an AI infrastructure that pushes the cutting edge.

Electric vehicle maker NIO is using NVIDIA HGX to build a comprehensive data center infrastructure for developing AI-powered, software-defined vehicles. With high-performance compute, the automaker can continuously iterate on sophisticated deep learning models, creating robust autonomous driving algorithms in a closed-loop environment.

“The complex scenarios faced by mass-produced cars and the massive amount of data these fleets generate are the cornerstones of NIO’s autonomous driving capabilities,” said Bai Yuli, head of AI Platforms at NIO. “By using NVIDIA high-performance compute solutions, NIO can accelerate the path to autonomous driving.”

NIO has already launched intelligent vehicles developed on this infrastructure, such as its fully electric, intelligent flagship sedan, the ET7. Its mid-size performance sedan, the ET5 is scheduled to debut in September.

In addition to high-performance data center development, both models are built on the Adam supercomputer, powered by four NVIDIA DRIVE Orin systems-on-a-chip. These vehicles feature autonomous driving and intelligent cockpit capabilities that are continuously iterated upon and improved in the data center for a revolutionary customer experience.

Building a High-Performance AI Infrastructure With NVIDIA GPUs and Networking

The role of the data center is to ingest, curate and label massive amounts of data for AI model training at scale.

Data collection fleets generate hundreds of petabytes of data and billions of images each year. This data is then used to optimize the deep neural networks (DNNs) that will run in the vehicles.

NIO’s scalable AI infrastructure is powered by NVIDIA HGX with eight A100 Tensor Core GPUs and NVIDIA ConnectX-6 InfiniBand adapters. This scalable supercomputer cluster consists of NVME SSD servers and interconnects through the high-speed NVIDIA Quantum InfiniBand network platform.

This powerful infrastructure allows large amounts of deep learning training data to be transferred to supercomputer memory or NVIDIA A100 video memory at ultra-high speeds of up to 200Gbps.

NVIDIA HGX A100 is a high-performance server platform designed for AI scenarios, including big datasets and complicated models like those that power autonomous vehicles. It incorporates a fully optimized NVIDIA AI software stack in NGC.

The platform sets a new compute density benchmark, condensing 5 petaflops of AI performance and replaces siloed infrastructures with a single platform for a wide range of complex AI applications.

With the HGX A100, NIO is able to flexibly develop and deploy scalable AI systems. It also enables the company to increase model development efficiency by up to 20x, allowing it to launch autonomous vehicles sooner and evolve to newer, faster architectures.

Forging Ahead

NIO is already off to the races with its software-defined lineup, announcing plans to double the capacity of its plant in Hefei, China, to 240,000 vehicles per year, with a facility capable of producing up to 300,000.

As NIO scales production capabilities and continues its expansion into global markets, NVIDIA HGX is scaling with them, enabling the deployment of one of the most advanced AI platforms in the automotive industry.

The post From Cloud to Car: How NIO Develops Intelligent Vehicles on NVIDIA HGX appeared first on NVIDIA Blog.

Read More

The Berkeley Crossword Solver

The Berkeley Crossword Solver

We recently published the Berkeley Crossword Solver (BCS), the current state of the art for solving American-style crossword puzzles. The BCS combines neural question answering and probabilistic inference to achieve near-perfect performance on most American-style crossword puzzles, like the one shown below:



Figure 1: Example American-style crossword puzzle

An earlier version of the BCS, in conjunction with Dr.Fill, was the first computer program to outscore all human competitors in the world’s top crossword tournament. The most recent version is the current top-performing system on crossword puzzles from The New York Times, achieving 99.7% letter accuracy (see the technical paper, web demo, and code release).