VDTTS: Visually-Driven Text-To-Speech

Recent years have seen a tremendous increase in the creation and serving of video content to users across the world in a variety of languages and over numerous platforms. The process of creating high quality content can include several stages from video capturing and captioning to video and audio editing. In some cases dialogue is re-recorded (referred to as dialog replacement, post-sync or dubbing) in a studio in order to achieve high quality and replace original audio that might have been recorded in noisy conditions. However, the dialog replacement process can be difficult and tedious because the newly recorded audio needs to be well synced with the video, requiring several edits to match the exact timing of mouth movements.

In “More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech”, we present a proof-of-concept visually-driven text-to-speech model, called VDTTS, that automates the dialog replacement process. Given a text and the original video frames of the speaker, VDTTS is trained to generate the corresponding speech. As opposed to standard visual speech recognition models, which focus on the mouth region, we detect and crop full faces using MediaPipe to avoid potentially excluding information pertinent to the speaker’s delivery. This gives the VDTTS model enough information to generate speech that matches the video while also recovering aspects of prosody, such as timing and emotion. Despite not being explicitly trained to generate speech that is synchronized to the input video, the learned model still does so.

Given a text and video frames of a speaker, VDTTS generates speech with prosody that matches the video signal.

VDTTS Model
The VDTTS model resembles Tacotron at its core and has four main components: (1) text and video encoders that process the inputs; (2) a multi-source attention mechanism that connects encoders to a decoder; (3) a spectrogram decoder that incorporates the speaker embedding (similarly to VoiceFilter), and produces mel-spectrograms (which are a form of compressed representation in the frequency domain); and (4) a frozen, pretrained neural vocoder that produces waveforms from the mel-spectrograms.

The overall architecture of VDTTS. Text and video encoders process the inputs and then a multisource attention mechanism connects these to a decoder that produces mel-spectrograms. A vocoder then produces waveforms from the mel-spectrograms to generate speech as an output.

We train VDTTS using video and text pairs from LSVSR in which the text corresponds to the exact words spoken by a person in a video. Throughout our testing, we have determined that VDTTS cannot generate arbitrary text, thus making it less prevalent for misuse (e.g., the generation of fake content).

Quality
To showcase the unique strength of VDTTS in this post, we have selected two inference examples from the VoxCeleb2 test dataset and compare the performance of VDTTS to a standard text-to-speech (TTS) model. In both examples, the video frames provide prosody and word timing clues, visual information that is not available to the TTS model.

In the first example, the speaker talks at a particular pace that can be seen as periodic gaps in the ground-truth mel-spectrogram (shown below). VDTTS preserves this characteristic and generates audio that is much closer to the ground-truth than the audio generated by standard TTS without access to the video.

Similarly, in the second example, the speaker takes long pauses between some of the words. These pauses are captured by VDTTS and are reflected in the video below, whereas the TTS does not capture this aspect of the speaker’s rhythm.

We also plot fundamental frequency (F0) charts to compare the pitch generated by each model to the ground-truth pitch. In both examples, the F0 curve of VDTTS fits the ground-truth much better than the TTS curve, both in the alignment of speech and silence, and also in how the pitch changes over time. See more original videos and VDTTS generated videos.

We present two examples, (a) and (b), from the VoxCeleb2 test set. From top to bottom: input face images, ground-truth (GT) mel-spectrogram, mel-spectrogram output of VDTTS, mel-spectrogram output of a standard TTS model, and two plots showing the normalized F0 (normalized by mean non-zero pitch, i.e., mean is only over voiced periods) of VDTTS and TTS compared to the ground-truth signal.

Video Samples

Original VDTTS VDTTS video-only TTS
Original displays the original video clip. VDTTS, displays the audio predicted using both the video frames and the text as input. VDTTS video-only displays audio predictions using video frames only. TTS displays audio predictions using text only. Top transcript: “of space for people to make their own judgments and to come to their own”. Bottom transcript: “absolutely love dancing I have no dance experience whatsoever but as that”.

Model Performance
We’ve measured the VDTTS model’s performance using the VoxCeleb2 dataset and compared it to TTS and the TTS with length hint (a TTS that receives the scene length) models. We demonstrate that VDTTS outperforms both models by large margins in most of the aspects we measured: higher sync-to-video quality (measured by SyncNet Distance) and better speech quality as measured by mel cepstral distance (MCD), and lower Gross Pitch Error (GPE), which measures the percentage of frames where pitch differed by more than 20% on frames for which voice was present on both the predicted and reference audio.

SyncNet distance comparison between VDTTS, TTS and the TTS with Length hint (a lower metric is better).
Mel cepstral distance comparison between VDTTS, TTS and the TTS with Length hint (a lower metric is better).
Gross Pitch Error comparison between VDTTS, TTS and the TTS with Length hint (a lower metric is better).

Discussion and Future Work
One thing to note is that, intriguingly, VDTTS can produce video synchronized speech without any explicit losses or constraints to promote this, suggesting complexities such as synchronization losses or explicit modeling are unnecessary.

While this is a proof-of-concept demonstration, we believe that in the future, VDTTS can be upgraded to be used in scenarios where the input text differs from the original video signal. This kind of a model would be a valuable tool for tasks such as translation dubbing.

Acknowledgements
We would like to thank the co-authors of this research: Michelle Tadmor Ramanovich, Ye Jia, Brendan Shillingford, and Miaosen Wang. We are also grateful to the valued contributions, discussions, and feedback from Nadav Bar, Jay Tenenbaum, Zach Gleicher, Paul McCartney, Marco Tagliasacchi, and Yoni Tzafir.

Read More

How AI and imagery build a self-updating map

Building a map is complex, and keeping it up-to-date is even more challenging. Think about how often your city, town or neighborhood changes on a day-to-day basis. Businesses and shops open and close, stretches of highway are added, and roadways change. In today’s Maps 101 installment, we’ll dive into two ways Google Maps uses advancements in AI and imagery to help you see the latest information about the world around you every single day.

Automatically updating business hours

Over the past few years, businesses have experienced a lot of change — including constantly updating operating hours based on changing pandemic-related restrictions. To keep up with this pace of change, we developed a machine learning model that automatically identifies if business hours are likely wrong, then instantly updates them with AI-generated predictions.

Let’s look at Liam’s Lemonade Shop as an example. To start, our systems consider multiple factors — such as when Liam last updated their business profile, what we know about other shops’ hours, and the Popular Times information for the shop, which uses location trends to show when it tends to be busiest. Since it appears that Liam’s business profile hasn’t been updated in over a year and its busiest hours are typically Thursday afternoons — even though Google Maps says that it’s closed at that time — Liam’s business hours are likely out of date.

Still images of a business' hours and Popular Times information on Google Maps

To see if business hours need updating, we check a store’s Popular Times information and when its business profile was last updated.

So what’s next? Our algorithms analyze the business hours of other nearby lemonade shops, information from Liam’s website, and Street View images of Liam’s storefront that look specifically for business hour signs to determine the most accurate business hour prediction. At the same time, we enlist the help of the Google Maps community — including Local Guides and even the business owners themselves through their Google Business Profile — to verify the information we predicted. In Argentina, Australia, Chile, France, Japan, Mexico, New Zealand, Peru, and the United States, we also use Duplex conversational technology to call businesses just like Liam’s and ask for their hours directly. With this new AI-first approach, we’re on track to update the hours for over 20 million businesses around the globe in the next six months – helping you know exactly when your favorite store, restaurant or cafe is open for business .

Road information that reflects the real world

We’re also experimenting with ways we can use imagery to make updates to other helpful information. For instance, starting in the U.S., we’re launching a third-party imagery pilot to let you see the most up-to-date speed limit information in your town, which can help keep you safe while driving. Here’s how it works:

Say our systems think that the speed limit information on a particular highway needs to be updated. With the help of third-party imagery partners that already gather roadway imagery to improve delivery routes, we can request a photo of the specific stretch of road that also includes a speed limit sign. If the partner has this photo available, we then use a combination of AI and help from our operations team to identify the sign in the image, extract the new speed limit information, and update Google Maps.

Picture of an intersection that has a speed limit sign

Representative imagery featuring a speed limit sign, with license plates blurred

Over time, this technology will bring more details to the map that can help make your drives safer and more efficient — like where potholes and school zones are or where new construction is happening. And as with all Google Maps features, we designed this pilot with privacy top of mind. For instance, we only reference images taken on public roads, and partners are required to blur information (like faces and license plates) to avoid potentially identifying someone. For an extra layer of privacy, we blur the photo again when we receive it and delete the photo after we use it to update the map.

AI, imagery and Duplex technology will continue to play a critical role in helping make Google Maps the most comprehensive and useful map possible. For more behind-the-scenes looks at the technology that powers Google Maps, check out the rest of our Maps 101 blog series.

Read More

Try This Out: GFN Thursday Delivers Instant-Play Game Demos on GeForce NOW

GeForce NOW is about bringing new experiences to gamers.

This GFN Thursday introduces game demos to GeForce NOW. Members can now try out some of the hit games streaming on the service before purchasing the full PC version — including some finalists from the 2021 Epic MegaJam.

Plus, look for six games ready to stream from the GeForce NOW library starting today.

In addition, the 2.0.39 app update is rolling out for PC and Mac with a few fixes to improve the experience.

Dive In to Cloud Gaming With Demos

GeForce NOW supports new ways to play and is now offering free game demos to help gamers discover titles to play on the cloud — easy to find in the “Instant Play Free Demos” row.

Gamers can stream these demos before purchasing the full PC versions from popular stores like Steam, Epic Games Store, Ubisoft Connect, GoG and more. The demos are hosted on GeForce NOW, allowing members to check them out instantly — just click to play!

The first wave of demos, with more to come, includes: Chorus, Ghostrunner, Inscryption, Diplomacy Is Not an Option and The RiftBreaker Prologue.

Members can even get a taste of the full GeForce NOW experience with fantastic Priority and RTX 3080 membership features like RTX in Ghostrunner and DLSS in Chorus.

On top of these great titles, demos of some finalists from the 2021 Epic MegaJam will be brought straight from Unreal Engine to the cloud.

Zoom and nyoom to help BotiBoi gather as many files as possible and upload them to the server before the inevitable system crash in Boti Boi by the Purple Team. Assist a user by keeping files organized for fast access as seeking beeBots in Microwasp Seekers by Partly Atomic.

Keep an eye out for updates on demos coming to the cloud on GFN Thursdays and in the GeForce NOW app.

Get Your Game On 

TUNIC on GeForce NOW
Play as a small fox on a big adventure in TUNIC, now streaming through both Steam and Epic Games Store. 

Ready to jump into a weekend full of gaming?

GFN Thursday always comes with a new batch of games joining the GeForce NOW library. Check out these six titles ready to stream this week:

Finally, last week GFN Thursday announced that Star Control: Origins would be coming to the cloud later in April. The game is already available to stream on GeForce NOW.

With all these great games available to try out, we’ve got a question for you this week. Let us know on Twitter or in the comments below.

The post Try This Out: GFN Thursday Delivers Instant-Play Game Demos on GeForce NOW appeared first on NVIDIA Blog.

Read More

Discovering the systematic errors made by machine learning models

Discovering systematic errors with cross-modal embeddings

In this blog post, we introduce Domino, a new approach for discovering systematic errors made by machine learning models. We also discuss a framework for quantitatively evaluating methods like Domino.

Links:
📄 Paper (ICLR 2022)
🌍 Longer Walkthrough
💻 GitHub
📘 Docs
📒 Google Colab

Machine learning models that achieve high overall accuracy often make systematic errors on coherent slices of validation data.

What is a slice? A slice is a set of data samples that share a common characteristic. As an example, in large image datasets, photos of vintage cars comprise a slice (i.e. all images in the slice share a common subject). The term slice has a number of synonyms that you might be more familiar with (e.g. subgroup, subpopulation, stratum). These terms are largely interchangeable, but we’ll stick with “slice” throughout this post. We say that a model underperforms on a slice if performance on the data samples in the slice is significantly worse than its overall performance.

The search for underperforming slices is a critical, but often overlooked, part of model evaluation. When practitioners are aware of the slices on which their models underperform, they can make more informed decisions around model deployment. This is particularly important in safety-critical settings like medicine: a diagnostic model that underperforms on younger patients should likely not be deployed at a pediatric hospital. Slice awareness can also help practitioners debug and improve models: after an underperforming slice is identified, we can improve model robustness by either updating the training dataset or using robust optimization techniques (e.g. Sohoni et al., 2020; Sagawa et al., 2020).

Deploying models that underperform on critical data slices may have significant safety or fairness consequences. For example, models trained to detect collapsed lungs in chest X-rays have been shown to make predictions based on the presence of chest drains, a device typically used during treatment (Oakden-Rayner, 2019). As a result, these models often fail to detect collapsed lung in images without chest drains, a critical data slice where false negative predictions could be life-threatening.

However, in practice, some underperforming slices are hard to find. The examples in these “hidden” data slices are tied together by a concept not annotated in metadata or easily extracted from unstructured inputs (e.g. images, video, time-series data). Returning to our example from earlier, many chest X-ray datasets do not provide metadata indicating which patients’ images show chest tubes, making it difficult to evaluate performance on the slice. This raises the following question: How can we automatically identify data slices on which our model underperforms?

In this blog post, we discuss our recent exploration of this question. We introduce Domino, a novel method for identifying and describing underperforming slices. We also discuss an evaluation framework for rigorously evaluating our method across diverse slice types, tasks, and datasets.

What is slice discovery?

Slice discovery is the task of mining unstructured input data (e.g. images, videos, audio) for semantically meaningful subgroups on which a model performs poorly. We refer to automated techniques that mine input data for semantically meaningful slices as slice discovery methods (SDM). Given a labeled validation dataset and a trained classifier, an SDM computes a set of slicing functions that partition the dataset into slices. This process is illustrated below.

In order to be broadly useful across diverse settings, an ideal SDM should satisfy the following desiderata:

  1. Identified slices should contain examples on which the model underperforms, or has a high error rate.
  2. Identified slices should contain examples that are coherent, or align closely with a human-understandable concept.

This second desideratum is particularly hard to achieve: existing evaluations have shown that discovered slices often do not align with concepts understandable to a domain expert. Further, even if slices do align well with concepts, it may be difficult for humans to identify the commonality.

Domino: Slice discovery with cross-modal embeddings

In our work, we introduce Domino, a slice discovery method designed to identify coherent, underperforming data slices (i.e. groups of similar validation data points on which the model makes errors). It leverages a powerful class of recently-developed cross-modal representation learning approaches, which yield semantically-meaningful representations by embedding images and text in the same latent space. We demonstrate that using cross-modal representations both improves slice coherence and enables Domino to generate natural language descriptions for identified slices!

Domino follows a three-step procedure illustrated in the figure above:

  1. Embed: Domino encodes the validation images alongside text in a shared embedding space using a cross-modal encoder. In many domains, such encoders are publicly available (e.g. CLIP for natural images, VideoCLIP for natural videos, ConVIRT for medical images, and CLASP for amino acid sequences).
  2. Slice: Using an error-aware mixture model, Domino identifies regions in the embedding space with a high concentration of errors.
  3. Describe: Finally, to help practitioners understand the commonalities among the examples in each slice, Domino generates natural language descriptions of the slices. To do so, it leverages the cross-modal embeddings computed in Step 1, surfacing the text nearest to the slice in embedding space.

We now use Domino to audit a popular off-the-shelf classifier: a ResNet18 pretrained on ImageNet. Specifically, we interrogate the model’s ability to detect cars, exploring whether there are any interesting slices on which the model underperforms. In the figure below we show a couple of the slices that Domino discovered. The gray boxes show the natural language descriptions of the two slices produced by Domino, the $X$ row shows the top six images predicted by domino to be in the slice, the $Y$ row shows the ground truth label assigned to the image, and the $hat{Y}$ row shows the ResNet18’s predicted probability for the “car” class. Note that although we only include six images here, the complete slice includes dozens of images.

From these slices, we might hypothesize that the model struggles to recognize photos of cars taken from the inside and photos of racecars. Both of these slices describe rare subclasses of the target class. Depending on the intended use case for the model, we may want to add more training examples to boost performance in these slices. For example, Waymo (an autonomous vehicle company) may not care much whether the model misses photos of car interiors, but ESPN (a broadcaster with the television rights for Formula 1) would care a lot if the model couldn’t recognize race cars! Clearly, it’s important to practitioners that discovered slices map onto coherent concepts.

Evaluating slice discovery methods

In designing Domino, we were inspired by a number of really exciting slice discovery methods that were recently proposed. These include The Spotlight (D’Eon et al. 2022), GEORGE (Sohoni et al. 2020), and MultiAccuracy Boost (Kim et al. 2018). These methods all have (1) an embed step and (2) a slice step, like Domino, but use different embeddings and slicing algorithms. In our experiments, we evaluate SDMs along these two axes, ablating both the choice of embedding and the slicing algorithm. Notably, these methods do not include a (3) describe step, and generally require users to manually inspect examples and identify common attributes.

SDMs like Domino have traditionally been evaluated qualitatively, due to a lack of a simple quantitative approach. Typically, in these evaluations, the SDM is applied to a few different models and identified slices are visualized. Practitioners can then inspect the slices and judge whether the slices “make sense.” However, these qualitative evaluations are subjective and do not scale beyond more than a few settings. Moreover, they cannot tell us if the SDM has missed an important, coherent slice.

Ideally, we’d like to estimate the failure rate of an SDM: how often it fails to identify a coherent slice on which the model underperforms. Estimating this failure rate is very challenging because we don’t typically know the full set of slices on which a model underperforms. How could we possibly know if the SDM is missing any?

To solve this problem, we trained 1,235 deep classifiers that were specifically constrained to underperform on pre-defined slices. We did this across three domains: natural images, medical images and medical time-series. Our approach involved (1) obtaining a dataset with some annotated slices (e.g. a dataset with interesting annotated attributes, like CelebA or MIMIC-CXR), and (2) manipulating the dataset such that, with high probability, a model trained on it would exhibit poor performance on one or more of the annotated slices (e.g. by subsampling the dataset to induce a spurious correlation between the label and a metadata field).

Using this evaluation framework, we were able to evaluate Domino quantitatively. We find that Domino accurately identifies 36% of the 1,235 slices in our framework. Further, the natural language description of the generated slice exactly matches the name of the slice in 35% of settings.

We were also able to compare SDMs and run ablation studies evaluating specific SDM design choices. Two key findings emerged from these experiments:

  1. Cross-modal embeddings improve SDM performance. We found that the choice of representation matters – a lot! Slice discovery methods based on cross-modal embeddings outperform those based on a single modality by at least 9 percentage-points in precision-at-10. When compared with using the activations of the trained model, the gap grows to 15 percentage points. This finding is of particular interest given that classifier activations are a popular embedding choice in existing SDMs.
  2. Modeling both the prediction and class label enables accurate slice discovery. Good embeddings alone do not suffice – the choice of algorithm for actually extracting the underperforming slices from the embedding space, is significant as well. We find that a simple mixture model that jointly models the embeddings, labels and predictions enables a 105% improvement over the next best slicing algorithm. We hypothesize that this is because this algorithm is unique in modeling the class labels and the model’s predictions as separate variables, which leads to slices which are “pure” in their error type (false positive vs. false negative).

However, there’s still a long way to go: slice discovery is a challenging task, and Domino, the best performing method in our experiments, still fails to recover over 60% of coherent slices. We see a number of exciting avenues for future work that could begin to close this gap.

  • We suspect that improvements in the embeddings that power slice discovery will be driven by large cross modal datasets, so work in dataset curation and management could help push the needle.

  • In this blog post, we described slice discovery as a fully automated process, while, in the future, we expect that effective slice discovery systems will be highly interactive: practitioners will be able to quickly explore slices and provide feedback. Forager, a system for rapid data exploration, is an exciting step in this direction.

We are really excited to continue working on this important problem and to collaborate with others as we seek to develop more reliable slice discovery methods. To facilitate this process, we are releasing 984 models and their associated slices as part of dcbench, a suite of data centric benchmarks. This will allow others to reproduce our results and also develop new slice discovery methods. Additionally, we are also releasing domino, a Python package containing implementations of popular slice discovery methods. If you’ve developed a new slice discovery method and would like us to add it to the library please reach out!

Read More

Fast and Luxurious: The Intelligent NIO ET7 EV Built on NVIDIA DRIVE Orin Arrives

Meet the electric vehicle that’s quick-witted and fully outfitted.

Last week, NIO began deliveries of its highly anticipated ET7 fully electric vehicle, in Hefei, China. The full-size luxury sedan is the first production vehicle built on the NIO Adam supercomputer, powered by four NVIDIA DRIVE Orin systems-on-a-chip (SoCs).

The production launch of its flagship sedan follows a blockbuster year for NIO. In 2021, the EV maker delivered 91,429 vehicles, more than quadrupling sales from 2019.

The software-defined ET7 bounds past current model capabilities, boasting more than 620 miles of battery range and an impressive 0-to-60 mph in under 4 seconds.

With the DRIVE Orin-powered Adam, the ET7’s centralized, high-performance compute architecture powers advanced AI features and allows continuous over-the-air upgrades. As a result, the intelligent vehicle redefines the customer experience, with an AI-enhanced cockpit and point-to-point autonomous driving capabilities.

Sensors on the bottom of the sleek ET7 detect the road surface in real time so the vehicle can automatically adjust the suspension, creating a smoother, more luxurious ride.

The opulent interior and immersive augmented reality digital cockpit inside the sedan interact with the driver through voice recognition and driver monitoring. The sedan comes standard with over 100 configurations for comfort, safety and smart technologies.

Peak Performance

The ET7 outperforms in both drive quality and AI compute.

The NIO Adam supercomputer is one of the most powerful platforms to run in a vehicle, achieving more than 1,000 trillion operations per second (TOPS) of performance.

At its core is DRIVE Orin, the world’s most advanced autonomous vehicle processor. It delivers up to 254 TOPS to simultaneously run a high number of deep neural networks and applications while achieving systematic safety standards such as ISO 26262 ASIL-D.

By integrating multiple DRIVE Orin SoCs, Adam meets the diversity and redundancies necessary for safe autonomous operation.

On the Horizon

Following the start of ET7 deliveries, NIO is slated to launch a mid-sized performance sedan, the ET5 — also built on the Adam supercomputer — in September.

NIO plans to enter global markets with the ET7 in Germany, Denmark, Sweden and Netherlands later this year. With a goal of bringing one of the most advanced AI platforms to more customers, NIO intends to have vehicle offerings in 25 countries and regions by 2025.

With the ET7 now entering the market, customers can enjoy a software-defined experience that’s as fast as it is luxurious.

The post Fast and Luxurious: The Intelligent NIO ET7 EV Built on NVIDIA DRIVE Orin Arrives appeared first on NVIDIA Blog.

Read More

NVIDIA Orin Leaps Ahead in Edge AI, Boosting Leadership in MLPerf Tests

In its debut in the industry MLPerf benchmarks, NVIDIA Orin, a low-power system-on-chip based on the NVIDIA Ampere architecture, set new records in AI inference, raising the bar in per-accelerator performance at the edge.

Overall, NVIDIA with its partners continued to show the highest performance and broadest ecosystem for running all machine-learning workloads and scenarios in this fifth round of the industry metric for production AI.

In edge AI, a pre-production version of our NVIDIA Orin led in five of six performance tests. It ran up to 5x faster than our previous generation Jetson AGX Xavier, while delivering an average of 2x better energy efficiency.

NVIDIA Orin is available today in the NVIDIA Jetson AGX Orin developer kit for robotics and autonomous systems. More than 6,000 customers including Amazon Web Services, John Deere, Komatsu, Medtronic and Microsoft Azure use the NVIDIA Jetson platform for AI inference or other tasks.

It’s also a key component of our NVIDIA Hyperion platform for autonomous vehicles. China’s largest EV maker. BYD, is the latest automaker to announce it will use the Orin-based DRIVE Hyperion architecture for their next-generation automated EV fleets.

Orin is also a key ingredient in NVIDIA Clara Holoscan for medical devices, a platform system makers and researchers are using to develop next generation AI instruments.

Small Module, Big Stack

Servers and devices with NVIDIA GPUs including Jetson AGX Orin were the only edge accelerators to run all six MLPerf benchmarks.

With its JetPack SDK, Orin runs the full NVIDIA AI platform, a software stack already proven in the data center and the cloud. And it’s backed by a million developers using the NVIDIA Jetson platform.

NVIDIA leads in MLPerf inference April 2022
NVIDIA leads across the board in per-accelerator inference performance and is the only company to submit on all workloads.
     Footnote: MLPerf v2.0 Inference Closed; Per-accelerator performance derived from the best MLPerf results for respective submissions using reported accelerator count in Data Center Offline and Server. Qualcomm AI 100: 2.0-130, Intel Xeon 8380 from MLPerf v.1.1 submission: 1.1-023 and 1.1-024, Intel Xeon 8380H 1.1-026, NVIDIA A30: 2.0-090, NVIDIA A100 (Arm): 2.0-077, NVIDIA A100 (X86): 2.0-094. 
     MLPerf name and logo are trademarks. See www.mlcommons.org for more information.​

NVIDIA and partners continue to show leading performance across all tests and scenarios in the latest MLPerf inference round.

The MLPerf benchmarks enjoy broad backing from organizations including Amazon, Arm, Baidu, Dell Technologies, Facebook, Google, Harvard, Intel, Lenovo, Microsoft, Stanford and the University of Toronto.

Most Partners, Submissions

The NVIDIA AI platform again attracted the largest number of MLPerf submissions from the broadest ecosystem of partners.

Azure followed up its solid December debut on MLPerf training tests with strong results in this round on AI inference, both using NVIDIA A100 Tensor Core GPUs. Azure’s ND96amsr_A100_v4 instance matched our highest performing eight-GPU submissions in nearly every inference test, demonstrating the power that’s readily available from the public cloud.

System makers ASUS and H3C made their MLPerf debut in this round with submissions using the NVIDIA AI platform. They joined returning system makers Dell Technologies, Fujitsu, GIGABYTE, Inspur, Lenovo, Nettrix, and Supermicro that submitted results on more than two dozen NVIDIA-Certified Systems.

Why MLPerf Matters

Our partners participate in MLPerf because they know it’s a valuable tool for customers evaluating AI platforms and vendors.

MLPerf’s diverse tests cover today’s most popular AI workloads and scenarios. That gives users confidence the benchmarks will reflect performance they can expect across the spectrum of their jobs.

Software Makes It Shine

All the software we used for our tests is available from the MLPerf repository.

Two key components that enabled our inference results — NVIDIA TensorRT for optimizing AI models and NVIDIA Triton Inference Server for deploying them efficiently — are available free on NGC, our catalog of GPU-optimized software.

Organizations around the world are embracing Triton, including cloud service providers such as Amazon and Microsoft.

We continuously fold all our optimizations into containers available on NGC. That way every user can get started putting AI into production with leading performance.

The post NVIDIA Orin Leaps Ahead in Edge AI, Boosting Leadership in MLPerf Tests appeared first on NVIDIA Blog.

Read More

Receive notifications for image analysis with Amazon Rekognition Custom Labels and analyze predictions

Amazon Rekognition Custom Labels is a fully managed computer vision service that allows developers to build custom models to classify and identify objects in images that are specific and unique to your business.

Rekognition Custom Labels doesn’t require you to have any prior computer vision expertise. You can get started by simply uploading tens of images instead of thousands. If the images are already labeled, you can begin training a model in just a few clicks. If not, you can label them directly within the Rekognition Custom Labels console, or use Amazon SageMaker Ground Truth to label them. Rekognition Custom Labels uses transfer learning to automatically inspect the training data, select the right model framework and algorithm, optimize the hyperparameters, and train the model. When you’re satisfied with the model accuracy, you can start hosting the trained model with just one click.

However, if you’re a business user looking to solve a computer vision problem, visualize inference results of the custom model, and receive notifications when such inference results are available, you have to rely on your engineering team to build such an application. For example, an agricultural operations manager can be notified when a crop is found to have a disease, a winemaker can be notified when the grapes are ripe for harvesting, or a store manager can be notified when it’s time to restock inventories such as soft drinks in a vertical refrigerator.

In this post, we walk you through the process of building a solution that allows you to visualize the inference result and send notifications to subscribed users when specific labels are identified in images that are processed using models built by Rekognition Custom Labels.

Solution overview

The following diagram illustrates our solution architecture.

Architecture Diagram

This solution uses the following AWS services to implement a scalable and cost-effective architecture:

  • Amazon Athena – A serverless interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
  • AWS Lambda – A serverless compute service that lets you run code in response to triggers such as changes in data, shifts in system state, or user actions. Because Amazon S3 can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.
  • Amazon QuickSight – A very fast, easy-to-use, cloud-powered business analytics service that makes it easy to build visualizations, perform ad hoc analysis, and quickly get business insights from the data.
  • Amazon Rekognition Custom Labels – Allows you to train a custom computer vision model to identify the objects and scenes in images that are specific to your business needs.
  • Amazon Simple Notification Service – Amazon SNS is a fully managed messaging service for both application-to-application (A2A) and application-to-person (A2P) communication.
  • Amazon Simple Queue Service – Amazon SQS is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications.
  • Amazon Simple Storage Service – Amazon S3 serves as an object store for your documents and allows for central management with fine-tuned access controls.

The solution utilizes a serverless workflow that gets triggered when an image is uploaded to the input S3 bucket. An SQS queue receives an event notification for object creation. The solution also creates dead-letter queues (DLQs) to set aside and isolate messages that can’t be processed correctly. A Lambda function feeds off of the SQS queue and makes the DetectLabels API call to detect all labels in the image. To scale this solution and make it a loosely coupled design, the Lambda function sends the prediction results to another SQS queue. This SQS queue triggers another Lambda function, which analyzes all the labels found in the predictions. Based on the user preference (configured during solution deployment), the function publishes a message to an SNS topic. The SNS topic is configured to deliver email notifications to the user. You can configure the Lambda function to add a URL to the message sent to Amazon SNS to access the image (using an Amazon S3 presigned URL). Finally, the Lambda function uploads a prediction result and image metadata to an S3 bucket. You can then use Athena and QuickSight to analyze and visualize the results from the S3 bucket.

Prerequisites

You need to have a model trained and running with Rekognition Custom Labels.

Rekognition Custom Labels lets you manage the machine learning model training process on the Amazon Rekognition console, which simplifies the end-to-end model development process. For this post, we use a classification model trained to detect plant leaf disease.

Deploy the solution

You deploy an AWS CloudFormation template to provision the necessary resources, including S3 buckets, SQS queues, SNS topic, Lambda functions, and AWS Identity and Access Management (IAM) roles. The template creates the stack the us-east-1 Region, but you can use the template to create your stack in any Region where the above AWS services are available.

  1. Launch the following CloudFormation template in the Region and AWS account where you deployed the Rekognition Custom Labels model:

  1. For Stack name, enter a stack name, such as rekognition-customlabels-analytics-and-notification.
  2. For CustomModelARN, enter the ARN of the Amazon Rekognition Custom Labels model that you want to use.

The Rekognition Custom Labels model needs to be deployed in the same AWS account.

  1. For EmailNotification, enter an email address where you want to receive notifications.
  2. For InputBucketName, enter a unique name for the S3 bucket the stack creates; for example, plant-leaf-disease-data-input.

This is where the incoming plant leaf images are stored.

  1. For LabelsofInterest, you can enter up to 10 different labels you want to be notified of, in comma-separated format. For our plant disease example, enter bacterial-leaf-blight,leaf-smut.
  2. For MinConfidence, enter the minimum confidence threshold to receive notification. Labels detected with a confidence below the value of MinConfidence aren’t returned in the response and will not generate notification.
  3. For OutputBucketName, enter a unique name for the S3 bucket the stack creates; for example, plant-leaf-disease-data-output.

The output bucket contains JSON files with image metadata (labels found and confidence score).

  1. Choose Next.

  1. On the Configure stack options page, set any additional parameters for the stack, including tags.
  2. Choose Next.
  3. In the Capabilities and transforms section, select the check box to acknowledge that AWS CloudFormation might create IAM resources.
  4. Choose Create stack.

The stack details page should show the status of the stack as CREATE_IN_PROGRESS. It can take up to 5 minutes for the status to change to CREATE_COMPLETE.

Amazon SNS will send a subscription confirmation message to the email address. You need to confirm the subscription.

Test the solution

Now that we have deployed the resources, we’re ready to test the solution. Make sure you start the model.

  1. On the Amazon S3 console, choose Buckets.
  2. Choose the input S3 bucket.

  1. Upload test images to the bucket.

In production, you can set up automated processes to deliver images to this bucket.

These images trigger the workflow. If the label confidence exceeds the specified threshold, you receive an email notification like the following.

You can also configure the SNS topic to deliver these notifications to any destinations supported by the service.

Analyze the prediction results

After you test the solution, you can extend the solution to create a visual analysis for the predictions of processed images. For this purpose, we use Athena, an interactive query service that makes it easy to analyze data directly from Amazon S3 using standard SQL, and QuickSight to visualize the data.

Configure Athena

If you are not familiar with Amazon Athena, see this tutorial. On the Athena console, create a table in the Athena data catalog with the following code:

CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`rekognition_customlabels_analytics` (
`Image` string,
`Label` string,
`Confidence` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://<<OUTPUT BUCKET NAME>>/'
TBLPROPERTIES ('has_encrypted_data'='false');

Populate the Location field in the preceding query with your output bucket name, such as plant-leaf-disease-data-output.

This code tells Athena how to interpret each row of the text in the S3 bucket.

You can now query the data:

SELECT * FROM "default"."rekognition_customlabels_analytics" limit 10;

Configure QuickSight

To configure QuickSight, complete the following steps:

  1. Open the QuickSight console.
  2. If you’re not signed up for QuickSight, you’re prompted with the option to sign up. Follow the steps to sign up to use QuickSight.
  3. After you log in to QuickSight, choose Manage QuickSight under your account.

  1. In the navigation pane, choose Security & permissions.
  2. Under QuickSight access to AWS services, choose Add or remove.

A page appears for enabling QuickSight access to AWS services.

  1. Select Amazon Athena.

  1. In the pop-up window, choose Next.

  1. On the S3 tab, select the necessary S3 buckets. For this post, I select the bucket that stores my Athena query results.
  2. For each bucket, also select Write permission for Athena Workgroup.
  3. Choose Finish.
  4. Choose Update.
  5. On the QuickSight console, choose New analysis.
  6. Choose New dataset.
  7. For Datasets, choose Athena.
  8. For Data source name, enter Athena-CustomLabels-analysis.
  9. For Athena workgroup, choose primary.
  10. Choose Create data source.

  1. For Database, choose default on the drop-down menu.
  2. For Tables, select the table rekognition_customlabels_analytics.
  3. Choose Select.

  1. Choose Visualize.

  1. On the Visualize page, under the Fields list, choose label and select the pie chart from Visual types.

You can add more visualizations in the dashboard. When your analysis is ready, you can choose Share to create a dashboard and share it within your organization.

Summary

In this post, we showed how you can create a solution to receive notifications for specific labels (such as bacterial leaf blight or leaf smut) found in processed images using Rekognition Custom Labels. In addition, we showed how you can create dashboards to visualize the results using Athena and QuickSight.

You can now easily share such visualization dashboards with business users and allow them to subscribe to notifications instead of having to rely on your engineering teams to build such an application.


About the Authors

Jay Rao is a Principal Solutions Architect at AWS. He enjoys providing technical and strategic guidance to customers and helping them design and implement solutions on AWS.

Pashmeen Mistry is the Senior Product Manager for Amazon Rekognition Custom Labels. Outside of work, Pashmeen enjoys adventurous hikes, photography, and spending time with his family.

Read More