Swede-sational: Linköping University to Build Country’s Fastest AI Supercomputer

Swede-sational: Linköping University to Build Country’s Fastest AI Supercomputer

The land famed for its midsummer festivities and everyone’s favorite flatpack furniture store is about to add another jewel to its crown.

Linköping University, home to 36,000 staff and students, has announced its plans to build Sweden’s fastest AI supercomputer, based on the NVIDIA DGX SuperPOD computing infrastructure.

Carrying the name of renowned Swedish scientist Jacob Berzelius — considered to be one of the founders of modern chemistry — the new BerzeLiUs supercomputer will deliver 300 petaflops of AI performance to power state-of-the-art AI research and deep learning models.

The effort is spearheaded by a 300 million Swedish Krona ($33.6 million) donation from the Knut and Alice Wallenberg Foundation to accelerate Swedish AI research across academia and industry. The foundation heads the Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) network — the country’s largest private research initiative focused on AI innovation.

“I am extremely happy and proud that Linköping University will, through the National Supercomputer Centre, be host for this infrastructure”, says Jan-Ingvar Jönsson, vice-chancellor of Linköping University. “This gives us confidence that Sweden is not simply maintaining its international position, but also strengthening it.”

A Powerful New AI Resource

Hosting world-class supercomputers is nothing new for the team at Linköping University.

The Swedish National Supercomputer Center (NSC) already houses six traditional supercomputers on campus, with a combined total of 6 petaflops of performance. Included among these is Tetralith, which held the title of the most powerful supercomputer in the Nordics after its installation in 2018.

But with BerzeLiUs the team is making a huge leap.

“BerzeLiUs will be more than twice as fast as Tetralith,” confirmed Niclas Andersson, technical director at NSC. “This is a super-fast AI resource — the fastest computing cluster we have ever installed.”

The powerful new AI resource will boost collaboration between academia and leading Swedish industrial companies, primarily those financed by the Knut and Alice Wallenberg Foundation, such as the Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) as well as other life science and quantum technology initiatives.

Full Speed Ahead

Building a leading AI supercomputer usually can take years of planning and development. But by building BerzeLiUs with NVIDIA DGX SuperPOD technology, Linköping will be able to deploy the fully integrated system and start running complex AI models as the new year begins.

The system will be built and installed by Atos. Initially, the supercomputer will consist of 60 NVIDIA DGX A100 systems interconnected across an NVIDIA Mellanox InfiniBand fabric and 1.5 petabytes of high-performance storage from DDN. BerzeLiUs will also feature the Atos Codex AI Suite, enabling researchers to speed up processing times on their complex data.

“This new supercomputer will supercharge AI research in Sweden,” said Jaap Zuiderveld, vice president for EMEA at NVIDIA. “It will position Sweden as a leader in academic research, and it will give Swedish businesses a competitive edge in telecommunications, design, drug development, manufacturing and more industries.”

Join Linköping University at GTC

Dive deeper into the cutting-edge research performed at Linköping University. Join Anders Eklund, associate professor at Linköping University, and Joel Hedlund, data director at AIDA, to explore how AI is powering innovation in radiology and pathology imaging.

It’s not too late to get access to hundreds of live and on-demand talks at GTC. Register for GTC now through Oct. 9 using promo code CMB4KN to get 20 percent off. Academics, students, government, and nonprofit attendees join free when registering with their organization’s email address.

The post Swede-sational: Linköping University to Build Country’s Fastest AI Supercomputer appeared first on The Official NVIDIA Blog.

Read More

Collaborating with AI to create Bach-like compositions in AWS DeepComposer

Collaborating with AI to create Bach-like compositions in AWS DeepComposer

AWS DeepComposer provides a creative and hands-on experience for learning generative AI and machine learning (ML). We recently launched the Edit melody feature, which allows you to add, remove, or edit specific notes, giving you full control of the pitch, length, and timing for each note. In this post, you can learn to use the Edit melody feature to collaborate with the autoregressive convolutional neural network (AR-CNN) algorithm and create interesting Bach-style compositions.

Through human-AI collaboration, we can surpass what humans and AI systems can create independently. For example, you can seek inspiration from AI to create art or music outside their area of expertise or offload the more routine tasks, like creating variations on a melody, and focus on the more interesting and creative tasks. Alternatively, you can assist the AI by correcting mistakes or removing artifacts it creates. You can also influence the output generated by the AI system by controlling the various training and inference parameters.

You can co-create music in the AWS DeepComposer Music Studio by collaborating with the AI (AR-CNN) model using the Edit melody feature. The AR-CNN Bach model modifies a melody note by note to guide the track towards sounding more Bach-like. You can modify four advanced parameters when you perform inference to influence how the input melody is modified:

  • Maximum notes to add – Changes the maximum number of notes added to your original melody
  • Maximum notes to remove – Changes the maximum number of notes removed from your original melody
  • Sampling iterations – Changes the exact number of times you add or remove a note based on note-likelihood distributions inferred by the model
  • Creative risk – Allows the AI model to deviate from creating Bach-like harmonies

The values you choose directly impact the composition created by the model by nudging the model in one way or another. For more information about these parameters, see AWS DeepComposer Learning Capsule on using the AR-CNN model.

Although the advanced parameters allow you to guide the output the AR-CNN model creates, they don’t provide note-level control over the music produced. For example, the AR-CNN model allows you to control the number of notes to add or remove during inference, but you don’t have control over the exact notes the model adds or removes.

The Edit melody feature bridges this gap by providing an interactive view of the generated melody so you can add missing notes, remove out-of-tune notes, or even change a note’s pitch and length. This granular level of editing facilitates better human-AI collaboration. It enables you to correct mistakes the model makes and harmonize the output to your liking, giving you more ownership of the creation process.

For this post, we explore the use case of co-creating Bach-like background music to match the following video.

Collaborating with AI using the AWS DeepComposer Music Studio

To start composing your melody, complete the following steps:

  1. Open the AWS DeepComposer Music Studio console.
  2. Choose an Input melody.

You can record a custom melody, import a melody, or choose a sample melody on the console.  For this post, we experimented with two melodies: the New World sample melody and a custom melody we created using the MIDI keyboard.

New World melody:

Custom melody:

  1. Choose the Autoregressive generative AI technique.
  2. Choose the Autoregressive CNN Bach model.

There are several considerations when choosing the advanced parameters. First, we wanted the original input melody to be recognizable. After some iterating, we found that setting the Maximum notes to add to 60 and Maximum notes to remove to 40 created a desirable outcome. For Creative risk, we wanted the model to create something interesting and adventurous. At the same time, we realized that a very high Creative risk value would deviate too much from the Bach style, so we took a moderate approach and chose a Creative risk of 2.

  1. You can repeat these steps a few times to iteratively create music.

Editing your input melody

After the AR-CNN model has generated a composition to your satisfaction, you can use the Edit melody feature to modify the melody and try to match the video’s transitions as much as possible.

  1. Choose the right arrow to open the input melody section.
  2. Choose Edit melody.
  3. On the Edit melody page, edit your track in any of the following ways:
    • Choose a cell (double-click) to add or remove a note at that pitch or time.
    • Drag a cell up or down to change a note’s pitch.
    • Drag the edge of a cell left or right to change a note’s length.
  4. When finished, choose Apply changes.

We drew inspiration from the AI-generated notes in different ways. For the New World melody, we noticed the model added short and bouncy notes (the circles with solid lines in the following screenshot), which made the composition sound similar to an American folk song. To match that style, we added a few notes in the second half of the composition (the dotted-lined circles).

For our custom melody, we noticed the model changed the chords slightly earlier than expected (see the following screenshot). This created lingering and overlapping sounds that we liked for the mountain road scenes.

On the other hand, we noticed the AI model needed our help to remove some notes that sounded out of place. After we listened to the track a few times, we decided to change some pitches manually to nudge the track towards something that sounded a bit more harmonious.

Generating accompaniments using the GAN generative AI technique

After using the AR-CNN Bach model to explore options for our melody track, we decided to try using a different generative AI model (GAN) to create musical accompaniments.

  1. Under Model parameters, for Generative AI technique, choose Generative adversarial network.
  2. Feed the edited compositions to the GAN model to generate accompaniments.

We chose the MuseGAN generative algorithm and the Symphony model because we wanted to create accompaniments to match the serene and somber setting in the video.

  1. You can optionally export your compositions into a music-editing tool of your choice to change the instrument set and perform post-processing.

Let’s watch the videos containing our AI-inspired creations in the background.

The first video uses the New World melody.

The following video uses our custom melody.

Conclusion

In this post, we demonstrated how to use the Edit melody feature in the AWS DeepComposer Music Studio to collaborate with generative AI models and create interesting Bach-style compositions. You can modify a melody to your liking by adding, removing, and editing specific notes. This gives you full control of the pitch, length, and timing for each note to produce an original melody.


About the Authors

 Rahul Suresh is an Engineering Manager with the AWS AI org, where he has been working on AI based products for making machine learning accessible for all developers. Prior to joining AWS, Rahul was a Senior Software Developer at Amazon Devices and helped launch highly successful smart home products. Rahul is passionate about building machine learning systems at scale and is always looking for getting these advanced technologies in the hands of customers. In addition to his professional career, Rahul is an avid reader and a history buff.

 

 

Enoch Chen is a Senior Technical Program Manager for AWS AI Devices. He is a big fan of machine learning and loves to explore innovative AI applications. Recently he helped bring DeepComposer to thousands of developers. Outside of work, Enoch enjoys playing piano and listening to classical music.

 

 

 

Carlos Daccarett is a Front-End Engineer at AWS. He loves bringing design mocks to life. In his spare time, he enjoys hiking, golfing, and snowboarding.

 

 

 

 

Dylan Jackson is a Senior ML Engineer and AI Researcher at AWS. He works to build experiences which facilitate the exploration of AI/ML, making new and exciting techniques accessible to all developers. Before AWS, Dylan was a Senior Software Developer at Goodreads where he leveraged both a full-stack engineering and machine learning skillset to protect millions of readers from spam, high-volume robotic traffic, and scaling bottlenecks. Dylan is passionate about exploring both the theoretical underpinnings and the real-world impact of AI/ML systems. In addition to his professional career, he enjoys reading, cooking, and working on small crafts projects.

Read More

Evaluating an automatic speech recognition service

Evaluating an automatic speech recognition service

Over the past few years, many automatic speech recognition (ASR) services have entered the market, offering a variety of different features. When deciding whether to use a service, you may want to evaluate its performance and compare it to another service. This evaluation process often analyzes a service along multiple vectors such as feature coverage, customization options, security, performance and latency, and integration with other cloud services.

Depending on your needs, you’ll want to check for features such as speaker labeling, content filtering, and automatic language identification. Basic transcription accuracy is often a key consideration during these service evaluations. In this post, we show how to measure the basic transcription accuracy of an ASR service in six easy steps, provide best practices, and discuss common mistakes to avoid.

Illustration showing a table of contents: The evaluation basics, six steps for performing an evaluation, and best practices and common mistakes to avoid.

The evaluation basics

Defining your use case and performance metric

Before starting an ASR performance evaluation, you first need to consider your transcription use case and decide how to measure a good or bad performance. Literal transcription accuracy is often critical. For example, how many word errors are in the transcripts? This question is especially important if you pay annotators to review the transcripts and manually correct the ASR errors, and you want to minimize how much of the transcript needs to be re-typed.

The most common metric for speech recognition accuracy is called word error rate (WER), which is recommended by the US National Institute of Standards and Technology for evaluating the performance of ASR systems. WER is the proportion of transcription errors that the ASR system makes relative to the number of words that were actually said. The lower the WER, the more accurate the system. Consider this example:

Reference transcript (what the speaker said): well they went to the store to get sugar

Hypothesis transcript (what the ASR service transcribed): they went to this tour kept shook or

In this example, the ASR service doesn’t appear to be accurate, but how many errors did it make? To quantify WER, there are three categories of errors:

  • Substitutions – When the system transcribes one word in place of another. Transcribing the fifth word as this instead of the is an example of a substitution error.
  • Deletions – When the system misses a word entirely. In the example, the system deleted the first word well.
  • Insertions – When the system adds a word into the transcript that the speaker didn’t say, such as or inserted at the end of the example.

Of course, counting errors in terms of substitutions, deletions, and insertions isn’t always straightforward. If the speaker says “to get sugar” and the system transcribes kept shook or, one person might count that as a deletion (to), two substitutions (kept instead of get and shook instead of sugar), and an insertion (or). A second person might count that as three substitutions (kept instead of to, shook instead of get, and or instead of sugar). Which is the correct approach?

WER gives the system the benefit of the doubt, and counts the minimum number of possible errors. In this example, the minimum number of errors is six. The following aligned text shows how to count errors to minimize the total number of substitutions, deletions, and insertions:

REF: WELL they went to THE  STORE TO   GET   SUGAR
HYP: **** they went to THIS TOUR  KEPT SHOOK OR
     D                 S    S     S    S     S

Many ASR evaluation tools use this format. The first line shows the reference transcript, labeled REF, and the second line shows the hypothesis transcript, labeled HYP. The words in each transcript are aligned, with errors shown in uppercase. If a word was deleted from the reference or inserted into the hypothesis, asterisks are shown in place of the word that was deleted or inserted. The last line shows D for the word that was deleted by the ASR service, and S for words that were substituted.

Don’t worry if these aren’t the actual errors that the system made. With the standard WER metric, the goal is to find the minimum number of words that you need to correct. For example, the ASR service probably didn’t really confuse “get” and “shook,” which sound nothing alike. The system probably misheard “sugar” as “shook or,” which do sound very similar. If you take that into account (and there are variants of WER that do), you might end up counting seven or eight word errors. However, for the simple case here, all that matters is counting how many words you need to correct without needing to identify the exact mistakes that the ASR service made.

You might recognize this as the Levenshtein edit distance between the reference and the hypothesis. WER is defined as the normalized Levenshtein edit distance:

In other words, it’s the minimum number of words that need to be corrected to change the hypothesis transcript into the reference transcript, divided by the number of words that the speaker originally said. Our example would have the following WER calculation:

WER is often multiplied by 100, so the WER in this example might be reported as 0.67, 67%, or 67. This means the service made errors for 67% of the reference words. Not great! The best achievable WER score is 0, which means that every word is transcribed correctly with no inserted words. On the other hand, there is no worst WER score—it can even go above 1 (above 100%) if the system made a lot of insertion errors. In that case, the system is actually making more errors than there are words in the reference—not only does it get all the words wrong, but it also manages to add new wrong words to the transcript.

For other performance metrics besides WER, see the section Adapting the performance metric to your use case later in this post.

Normalizing and preprocessing your transcripts

When calculating WER and many other metrics, keep in mind that the problem of text normalization can drastically affect the calculation. Consider this example:

Reference: They will tell you again: our ballpark estimate is $450.

ASR hypothesis: They’ll tell you again our ball park estimate is four hundred fifty dollars.

The following code shows how most tools would count the word errors if you just leave the transcripts as-is:

REF: THEY WILL    tell you AGAIN: our **** BALLPARK estimate is **** ******* ***** $450.   
HYP: **** THEY'LL tell you AGAIN  our BALL PARK     estimate is FOUR HUNDRED FIFTY DOLLARS.
     D    S                S          I    S                    I    I       I     S

The word error rate would therefore be:

According to this calculation, there were errors for 90% of the reference words. That doesn’t seem right. The ASR hypothesis is basically correct, with only small differences:

  • The words they will are contracted to they’ll
  • The colon after again is omitted
  • The term ballpark is spelled as a single compound word in the reference, but as two words in the hypothesis
  • $450 is spelled with numerals and a currency symbol in the reference, but the ASR system spells it using the alphabet as four hundred fifty dollars

The problem is that you can write down the original spoken words in more than one way. The reference transcript spells them one way and the ASR service spells them in a different way. Depending on your use case, you may or may not want to count these written differences as errors that are equivalent to missing a word entirely.

If you don’t want to count these kinds of differences as errors, you should normalize both the reference and the hypothesis transcripts before you calculate WER. Normalizing involves changes such as:

  • Lowercasing all words
  • Removing punctuation (except apostrophes)
  • Contracting words that can be contracted
  • Expanding written abbreviations to their full forms (such Dr. as to doctor)
  • Spelling all compound words with spaces (such as blackboard to black board or part-time to part time)
  • Converting numerals to words (or vice-versa)

If you there are other differences that you don’t want to count as errors, you might consider additional normalizations. For example, some languages have multiple spellings for some words (such as favorite and favourite) or optional diacritics (such as naïve vs. naive), and you may want to convert these to a single spelling before calculating WER. We also recommend removing filled pauses like uh and um, which are irrelevant for most uses of ASR, and therefore shouldn’t be included in the WER calculation.

A second, related issue is that WER by definition counts the number of whole word errors. Many tools define words as strings separated by spaces for this calculation, but not all writing systems use spaces to separate words. In this case, you may need to tokenize the text before calculating WER. Alternatively, for writing systems where a single character often represents a word (such as Chinese), you can calculate a character error rate instead of a word error rate, using the same procedure.

Six steps for performing an ASR evaluation

To evaluate an ASR service using WER, complete the following steps:

  1. Choose a small sample of recorded speech.
  2. Transcribe it carefully by hand to create reference transcripts.
  3. Run the audio sample through the ASR service.
  4. Create normalized ASR hypothesis transcripts.
  5. Calculate WER using an open-source tool.
  6. Make an assessment using the resulting measurement.

Choosing a test sample

Choosing a good sample of speech to evaluate is critical, and you should do this before you create any ASR transcripts in order to avoid biasing the results. You should think about the sample in terms of utterances. An utterance is a short, uninterrupted stretch of speech that one speaker produces without any silent pauses. The following are three example utterances:

An utterance is sometimes one complete sentence, but people don’t always talk in complete sentences—they hesitate, start over, or jump between multiple thoughts within the same utterance. Utterances are often only one or two words long and are rarely more than 50 words. For the test sample, we recommend selecting utterances that are 25–50 words long. However, this is flexible and can be adjusted if your audio contains mostly short utterances, or if short utterances are especially important for your application.

Your test sample should include at least 800 spoken utterances. Ideally, each utterance should be spoken by a different person, unless you plan to transcribe speech from only a few individuals. Choose utterances from representative portions of your audio. For example, if there is typically background traffic noise in half of your audio, then half of the utterances in your test sample should include traffic noise as well. If you need to extract utterances from long audio files, you can use a tool like Audacity.

Creating reference transcripts

The next step is to create reference transcripts by listening to each utterance in your test sample and writing down what they said word-for-word. Creating these reference transcripts by hand can be time-consuming, but it’s necessary for performing the evaluation. Write the transcript for each utterance on its own line in a plain text file named reference.txt, as shown below.

hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still under warranty so i wanted to see if someone could come look at it
no i checked everywhere the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet
i tried to update my address on the on your web site but it just says error code 402 disabled account id after i filled out the form

The reference transcripts are extremely literal, including when the speaker hesitates and restarts in the third utterance (on the on your). If the transcripts are in English, write them using all lowercase with no punctuation except for apostrophes, and in general be sure to pay attention to the text normalization issues that we discussed earlier. In this example, besides lowercasing and removing punctuation from the text, compound words have been normalized by spelling them as two words (ice maker, web site), the initialism I.D. has been spelled as a single lowercase word id, and the number 402 is spelled using numerals rather than the alphabet. By applying these same strategies to both the reference and the hypothesis transcripts, you can ensure that different spelling choices aren’t counted as word errors.

Running the sample through the ASR service

Now you’re ready to run the test sample through the ASR service. For instructions on doing this on the Amazon Transcribe console, see Create an Audio Transcript. If you’re running a large number of individual audio files, you may prefer to use the Amazon Transcribe developer API.

Creating ASR hypothesis transcripts

Take the hypothesis transcripts generated by the ASR service and paste them into a plain text file with one utterance per line. The order of the utterances must correspond exactly to the order in the reference transcript file that you created: if line 3 of your reference transcripts file has the reference for the utterance pat went to the store, then line 3 of your hypothesis transcripts file should have the ASR output for that same utterance.

The following is the ASR output for the three utterances:

Hi I'm calling about a refrigerator I bought from you The ice maker stopped working and it's still in the warranty so I wanted to see if someone could come look at it
No I checked everywhere in the mailbox The package room I asked my neighbor who sometimes gets my packages but it hasn't shown up yet
I tried to update my address on the on your website but it just says error code 40 to Disabled Accounts idea after I filled out the form

These transcripts aren’t ready to use yet—you need to normalize them first using the same normalization conventions that you used for the reference transcripts. First, lowercase the text and remove punctuation except apostrophes, because differences in case or punctuation aren’t considered as errors for this evaluation. The word website should be normalized to web site to match the reference transcript. The number is already spelled with numerals, and it looks like the initialism I.D. was transcribed incorrectly, so no need to do anything there.

After the ASR outputs have been normalized, the final hypothesis transcripts look like the following:

hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still in the warranty so i wanted to see if someone could come look at it
no i checked everywhere in the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet
i tried to update my address on the on your web site but it just says error code 40 to disabled accounts idea after i filled out the form

Save these transcripts to a plain text file named hypothesis.txt.

Calculating WER

Now you’re ready to calculate WER by comparing the reference and hypothesis transcripts. This post uses the open-source asr-evaluation evaluation tool to calculate WER, but other tools such as SCTK or JiWER are also available.

Install the asr-evaluation tool (if you’re using it) with pip install asr-evaluation, which makes the wer script available on the command line. Use the following command to compare the reference and hypothesis text files that you created:

wer -i reference.txt hypothesis.txt

The script prints something like the following:

REF: hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still ** UNDER warranty so i wanted to see if someone could come look at it
HYP: hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still IN THE   warranty so i wanted to see if someone could come look at it
SENTENCE 1
Correct          =  96.9%   31   (    32)
Errors           =   6.2%    2   (    32)
REF: no i checked everywhere ** the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet
HYP: no i checked everywhere IN the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet
SENTENCE 2
Correct          = 100.0%   24   (    24)
Errors           =   4.2%    1   (    24)
REF: i tried to update my address on the on your web site but it just says error code ** 402 disabled ACCOUNT  ID   after i filled out the form
HYP: i tried to update my address on the on your web site but it just says error code 40 TO  disabled ACCOUNTS IDEA after i filled out the form
SENTENCE 3
Correct          =  89.3%   25   (    28)
Errors           =  14.3%    4   (    28)
Sentence count: 3
WER:     8.333% (         7 /         84)
WRR:    95.238% (        80 /         84)
SER:   100.000% (         3 /          3)

If you want to calculate WER manually instead of using a tool, you can do so by calculating the Levenshtein edit distance between the reference and hypothesis transcript pairs divided by the total number of words in the reference transcripts. When you’re calculating the Levenshtein edit distance between the reference and hypothesis, be sure to calculate word-level edits, rather than character-level edits, unless you’re evaluating a written language where every character is a word.

In the evaluation output above, you can see the alignment between each reference transcript REF and hypothesis transcript HYP. Errors are printed in uppercase, or using asterisks if a word was deleted or inserted. This output is useful if you want to re-count the number of errors and recalculate WER manually to exclude certain types of words and errors from your calculation. It’s also useful to verify that the WER tool is counting errors correctly.

At the end of the output, you can see the overall WER: 8.333%. Before you go further, skim through the transcript alignments that the wer script printed out. Check whether the references correspond to the correct hypotheses. Do the error alignments look reasonable? Are there any text normalization differences that are being counted as errors that shouldn’t be?

Making an assessment

What should the WER be if you want good transcripts? The lower the WER, the more accurate the system. However, the WER threshold that determines whether an ASR system is suitable for your application ultimately depends on your needs, budget, and resources. You’re now equipped to make an objective assessment using the best practices we shared, but only you can decide what error rate is acceptable.

You may want to compare two ASR services to determine if one is significantly better than the other. If so, you should repeat the previous three steps for each service, using exactly the same test sample. Then, count how many utterances have a lower WER for the first service compared to the second service. If you’re using asr-evaluation, the WER for each individual utterance is shown as the percentage of Errors below each utterance.

If one service has a lower WER than the other for at least 429 of the 800 test utterances, you can conclude that this service provides better transcriptions of your audio. 429 represents a conventional threshold for statistical significance when using a sign test for this particular sample size. If your sample doesn’t have exactly 800 utterances, you can manually calculate the sign test to decide if one service has a significantly lower WER than the other. This test assumes that you followed good practices and chose a representative sample of utterances.

Adapting the performance metric to your use case

Although this post uses the standard WER metric, the most important consideration when evaluating ASR services is to choose a performance metric that reflects your use case. WER is a great metric if the hypothesis transcripts will be corrected, and you want to minimize the number of words to correct. If this isn’t your goal, you should carefully consider other metrics.

For example, if your use case is keyword extraction and your goal is to see how often a specific set of target keywords occur in your audio, you might prefer to evaluate ASR transcripts using metrics such as precision, recall, or F1 score for your keyword list, rather than WER.

If you’re creating automatic captions that won’t be corrected, you might prefer to evaluate ASR systems in terms of how useful the captions are to viewers, rather than the minimum number of word errors. With this in mind, you can roughly divide English words into two categories:

  • Content words – Verbs like “run”, “write”, and “find”; nouns like “cloud”, “building”, and “idea”; and modifiers like “tall”, “careful”, and “quickly”
  • Function words – Pronouns like “it” and “they”; determiners like “the” and “this”; conjunctions like “and”, “but”, and “or”; prepositions like “of”, “in”, and “over”; and several other kinds of words

For creating uncorrected captions and extracting keywords, it’s more important to transcribe content words correctly than function words. For these use cases, we recommend ignoring function words and any errors that don’t involve content words in your calculation of WER. There is no definite list of function words, but this file provides one possible list for North American English.

Common mistakes to avoid

If you’re comparing two ASR services, it’s important to evaluate the ASR hypothesis transcript produced by each service using a true reference transcript that you create by hand, rather than comparing the two ASR transcripts to each other. Comparing ASR transcripts to each other lets you see how different the systems are, but won’t give you any sense of which service is more accurate.

We emphasized the importance of text normalization for calculating WER. When you’re comparing two different ASR services, the services may offer different features, such as true-casing, punctuation, and number normalization. Therefore, the ASR output for two systems may be different even if both systems correctly recognized exactly the same words. This needs to be accounted for in your WER calculation, so you may need to apply different text normalization rules for each service to compare them fairly.

Avoid informally eyeballing ASR transcripts to evaluate their quality. Your evaluation should be tailored to your needs, such as minimizing the number of corrections, maximizing caption usability, or counting keywords. An informal visual evaluation is sensitive to features that stand out from the text, like capitalization, punctuation, proper names, and numerals. However, if these features are less important than word accuracy for your use case—such as if the transcripts will be used for automatic keyword extraction and never seen by actual people—then an informal visual evaluation won’t help you make the best decision.

Useful resources

The following are tools and open-source software that you may find useful:

Conclusion

This post discusses a few of the key elements needed to evaluate the performance aspect of an ASR service in terms of word accuracy. However, word accuracy is only one of the many dimensions that you need to evaluate when choosing on a particular ASR service. It’s critical that you include other parameters such as the ASR service’s total feature set, ease of use, existing integrations, privacy and security, customization options, scalability implications, customer service, and pricing.


About the Authors

Scott Seyfarth is a Data Scientist at AWS AI. He works on improving the Amazon Transcribe and Transcribe Medical services. Scott is also a phonetician and a linguist who has done research on Armenian, Javanese, and American English.

 

 

 

Paul Zhao is Product Manager at AWS AI. He manages Amazon Transcribe and Amazon Transcribe Medical. In his past life, Paul was a serial entrepreneur, having launched and operated two startups with successful exits.

Read More

Make your everyday smarter with Jacquard

Make your everyday smarter with Jacquard

Technology is most helpful when it’s frictionless. That is why we believe that computing should power experiences through the everyday things around you—an idea we call “ambient computing.” That’s why we developed the Jacquard platform to deliver ambient computing in a familiar, natural way: By building it into things you wear, love and use every day. 

The heart of Jacquard is the Jacquard Tag, a tiny computer built to make everyday items more helpful. We first used this on the sleeve of a jacket so that it could recognize the gestures of the person wearing it, and we built that same technology into the Cit-E backpack with Saint Laurent. Then, we collaborated with Adidas and EA on our GMR shoe insert, enabling its wearers to combine real-life play with the EA SPORTS FIFA mobile game. 

Whether it’s touch or movement-based, the tag can interpret different inputs customized for the garments and gear we’ve collaborated with brands to create. And now we’re sharing that two new backpacks, developed with Samsonite, will integrate Jacquard technology. A fine addition to our collection, the Konnect-I Backpack comes in two styles: Slim ($199) and Standard ($219).

  • Jacquard Samsonite
  • Jacquard Samsonite
  • Jacquard Samsonite
  • Jacquard Samsonite

While they might look like regular backpacks, the left strap unlocks tons of capabilities. Using your Jacquard app, you can customize what gestures control which actions—for instance, you can program Jacquard to deliver call and text notifications, trigger a selfie, control your music or prompt Google Assistant to share the latest news. For an added level of interaction, the LED light on your left strap will light up according to the alerts you’ve set.

This is only the beginning for the Jacquard platform, and thanks to updates, you can expect your Jacquard Tag gear to get better over time. Just like Google wants to make the world’s information universally accessible and useful, we at Jacquard want to help people access information through everyday items and natural movements.

Read More

NVIDIA CEO Outlines Vision for ‘Age of AI’ in News-Packed GTC Kitchen Keynote

NVIDIA CEO Outlines Vision for ‘Age of AI’ in News-Packed GTC Kitchen Keynote

Outlining a sweeping vision for the “age of AI,” NVIDIA CEO Jensen Huang Monday kicked off this week’s GPU Technology Conference.

Huang made major announcements in data centers, edge AI, collaboration tools and healthcare in a talk simultaneously released in nine episodes, each under 10 minutes.

“AI requires a whole reinvention of computing – full-stack rethinking – from chips, to systems, algorithms, tools, the ecosystem,” Huang said, standing in front of the stove of his Silicon Valley home.

Behind a series of announcements touching on everything from healthcare to robotics to videoconferencing, Huang’s underlying story was simple: AI is changing everything, which has put NVIDIA at the intersection of changes that touch every facet of modern life.

More and more of those changes can be seen, first, in Huang’s kitchen, with its playful bouquet of colorful spatulas, that has served as the increasingly familiar backdrop for announcements throughout the COVID-19 pandemic.

“NVIDIA is a full stack computing company – we love working on extremely hard computing problems that have great impact on the world – this is right in our wheelhouse,” Huang said. “We are all-in, to advance and democratize this new form of computing – for the age of AI.”

This week’s GTC is one of the biggest yet. It features more than 1,000 sessions—400 more than the last GTC—in 40 topic areas. And it’s the first to run across the world’s time zones, with sessions in English, Chinese, Korean, Japanese, and Hebrew.

Accelerated Data Center 

Modern data centers, Huang explained, are software-defined, making them more flexible and adaptable.

That creates an enormous load. Running a data center’s infrastructure can consume 20-30 percent of its CPU cores. And as “east-west traffic, or traffic within a data center, and microservices increase, this load will increase dramatically.

“A new kind of processor is needed,” Huang explained: “We call it the data processing unit.”

The DPU consists of accelerators for networking, storage, security and programmable Arm CPUs to offload the hypervisor, Huang said.

The new NVIDIA BlueField 2 DPU is a programmable processor with powerful Arm cores and acceleration engines for at-line-speed processing for networking, storage and security. It’s the latest fruit of NVIDIA’s acquisition of high-speed interconnect provider Mellanox Technologies, which closed in April.

Data Center — DOCA — A Programmable Data Center Infrastructure Processor

NVIDIA also announced DOCA, its programmable data-center-infrastructure-on-a-chip architecture.

“DOCA SDKs let developers write infrastructure apps for software-defined networking, software-defined storage, cybersecurity, telemetry and in-network computing applications yet to be invented,” Huang said.

Huang also touched on a partnership with VMware, announced last week, to port VMware onto BlueField. VMware “runs the world’s enterprises — they are the OS platform in 70 percent of the world’s companies,” Huang explained.

Data Center — DPU Roadmap in ‘Full Throttle’

Further out, Huang said NVIDIA’s DPU roadmap shows advancements coming fast.

BlueField-2 is sampling now, BlueField-3 is finishing and BlueField-4 is in high gear, Huang reported.

“We are going to bring a ton of technology to networking,” Huang said. “In just a couple of years, we’ll span nearly 1,000 times in compute throughput” on the DPU.

BlueField-4, arriving in 2023, will add support for the CUDA parallel programming platform and NVIDIA AI — “turbocharging the in-network computing vision.”

You can get those capabilities now, Huang announced, with the new BlueField-2X. It adds an NVIDIA Ampere GPU to BlueField-2 for in-networking computing with CUDA and NVIDIA AI.

“Bluefield-2X is like having a Bluefield-4, today,” Huang said.

Data Center — GPU Inference Momentum

Consumer internet companies are also turning to NVIDIA technology to deliver AI services.

Inference — which puts fully-trained AI models to work — is key to a new generation of AI-powered consumer services.

In aggregate, NVIDIA GPU inference compute in the cloud already exceeds all cloud CPUs, Huang said.

Huang announced that Microsoft is adopting NVIDIA AI on Azure to power smart experiences on Microsoft Office, including smart grammar correction and text prediction.

Microsoft Office joins Square, Twitter, eBay, GE Healthcare and Zoox, among other companies, in a broad array of industries using NVIDIA GPUs for inference.

Data Center — Cloudera and VMware 

The ability to put vast quantities of data to work, fast, is key to modern AI and data science.

NVIDIA RAPIDS is the fastest extract, transform, load, or ETL, engine on the planet, and supports multi-GPU and multi-node.

NVIDIA modeled its API after hugely popular data science frameworks — Pandas, XGBoost and ScikitLearn — so RAPIDS is easy to pick up.

On the industry-standard data processing benchmark, running the 30 complex database queries on a 10TB dataset, a 16-node NVIDIA DGX cluster ran 20x faster than the fastest CPU server.

Yet it’s one-seventh the cost and uses one-third the power.

Huang announced that Cloudera, a hybrid-cloud data platform that lets you manage, secure, analyze and learn predictive models from data, will accelerate the Cloudera Data Platform with NVIDIA RAPIDS, NVIDIA AI and NVIDIA-accelerated Spark.

NVIDIA and VMware also announced a second partnership, Huang said.

The companies will create a data center platform that supports GPU acceleration for all three major computing domains today: virtualized, distributed scale-out and composable microservices.

“Enterprises running VMware will be able to enjoy NVIDIA GPU and AI computing in any computing mode,” Huang said. “

(Cutting) Edge AI 

Someday, Huang said, trillions of AI devices and machines will populate the Earth – in homes, office buildings, warehouses, stores, farms, factories, hospitals, airports.

The NVIDIA EGX AI platform makes it easy for the world’s enterprises to stand up a state-of-the-art edge-AI server quickly, Huang said. It can control factories of robots, perform automatic checkout at retail or help nurses monitor patients, Huang explained.

Huang announced the EGX platform is expanding to combine the NVIDIA Ampere GPU and BlueField-2 DPU on a single PCIe card. The updates give enterprises a common platform to build secure, accelerated data centers.

Huang also announced an early access program for a new service called NVIDIA Fleet Command. This new application makes it easy to deploy and manage updates across IoT devices, combining the security and real-time processing capabilities of edge computing with the remote management and ease of software-as-a-service.

Among the first companies provided early access to Fleet Command is KION Group, a leader in global supply chain solutions, which is using the NVIDIA EGX AI platform to develop AI applications for its intelligent warehouse systems.

Additionally, Northwestern Memorial Hospital, the No. 1 hospital in Illinois and one of the top 10 in the nation, is working with Whiteboard Coordinator to use Fleet Command for its IoT sensor platform.

“This is the iPhone moment for the world’s industries — NVIDIA EGX will make it easy to create, deploy and operate industrial AI services,” Huang said.

Edge AI — Democratizing Robotics

Soon, Huang added, everything that moves will be autonomous. AI software is the big breakthrough that will make robots smarter and more adaptable. But it’s the NVIDIA Jetson AI computer that will democratize robotics.

Jetson is an Arm-based SoC designed from the ground up for robotics. That’s thanks to the sensor processors, the CUDA GPU and Tensor Cores, and, most importantly, the richness of AI software that runs on it, Huang explained.

The latest addition to the Jetson family, the Jetson Nano 2GB, will be $59, Huang announced. That’s roughly half the cost of the $99 Jetson Nano Developer Kit announced last year.

“NVIDIA Jetson is mighty, yet tiny, energy-efficient and affordable,” Huang said.

Collaboration Tools

The shared, online world of the “metaverse” imagined in Neal Stephensen’s 1992 cyberpunk classic, “Snow Crash,” is already becoming real, in shared virtual worlds like Minecraft and Fortnite, Huang said.

First introduced in March 2019, NVIDIA Omniverse — a platform for simultaneous, real-time simulation and collaboration across a broad array of existing industry tools — is now in open beta.

“Omniverse allows designers, artists, creators and even AIs using different tools, in different worlds, to connect in a common world—to collaborate, to create a world together,” Huang said.

Another tool NVIDIA pioneered, NVIDIA Jarvis conversational AI, is also now in open beta, Huang announced. Using the new SpeedSquad benchmark, Huang showed it’s twice as responsive and more natural sounding when running on NVIDIA GPUs.

It also runs for a third of the cost, Huang said.

“What did I tell you?” Huang said, referring to a catch phrase he’s used in keynotes over the years. “The more you buy, the more you save.”

Collaboration Tools — Introducing NVIDIA Maxine

Video calls have moved from a curiosity to a necessity.

For work, social, school, virtual events, doctor visits — video conferencing is now the most critical application for many people. More than 30 million web meetings take place every day.

To improve this experience, Huang announced NVIDIA Maxine, a cloud-native streaming video AI platform for applications like video calls.

Using AI, Maxine can reduce the bandwidth consumed by video calls by a factor of 10. “AI can do magic for video calls,” Huang said.

“With Jarvis and Maxine, we have the opportunity to revolutionize video conferencing of today and invent the virtual presence of tomorrow,” Huang said.

Healthcare 

When it comes to drug discovery amidst the global COVID-19 pandemic, lives are on the line.

Yet for years the costs of new drug discovery for the $1.5 trillion pharmaceutical industry have risen. New drugs take over a decade to develop, cost over $2.5 billion in research and development — doubling every nine years — and 90 percent of efforts fail.

New tools are needed. “COVID-19 hits home this urgency,” Huang said.

Using breakthroughs in computer science, we can begin to use simulation and in-silico methods to understand the biological machinery of the proteins that affect disease and search for new drug candidates, Huang explained.

To accelerate this, Huang announced NVIDIA Clara Discovery — a state-of-the-art suite of tools for scientists to discover life-saving drugs.

“Where there are popular industry tools, our computer scientists accelerate them,” Huang said. “Where no tools exist, we develop them — like NVIDIA Parabricks, Clara Imaging, BioMegatron, BioBERT, NVIDIA RAPIDS.”

Huang also outlined an effort to build the U.K.’s fastest supercomputer, Cambridge-1, bringing state-of-the-art computing infrastructure to “an epicenter of healthcare research.”

Cambridge-1 will boast 400 petaflops of AI performance, making it among the world’s top 30 fastest supercomputers. It will host NVIDIA’s U.K. AI and healthcare collaborations with academia, industry and startups.

NVIDIA’s first partners are AstraZeneca, GSK, King’s College London, the Guy’s and St Thomas’ NHS Foundation Trust and startup Oxford Nanopore.

NVIDIA also announced a partnership with GSK to build the world’s first AI drug discovery lab.

Arm

Huang wrapped up his keynote with an update on NVIDIA’s partnership with Arm, whose power-efficient designs run the world’s smart devices.

NVIDIA agreed to acquire the U.K. semiconductor designer last month for $40 billion.

“Arm is the most popular CPU in the world,” Huang said. “Together, we will offer NVIDIA accelerated and AI computing technologies to the Arm ecosystem.”

Last year, Huang said, NVIDIA announced it would port CUDA and our scientific computing stack to Arm. Today, Huang announced a major initiative to advance the Arm platform — we’re making investments across three dimensions:

  • First, NVIDIA will complement Arm partners with GPU, networking, storage and security technologies to create complete accelerated platforms.
  • Second, NVIDIA is working with Arm partners to create platforms for HPC, cloud, edge and PC — this requires chips, systems and system software.
  • And third, NVIDIA is porting the NVIDIA AI and NVIDIA RTX engines to Arm.

“Today, these capabilities are available only on x86,” Huang said, “With this initiative, Arm platforms will also be leading-edge at accelerated and AI computing.”

 

The post NVIDIA CEO Outlines Vision for ‘Age of AI’ in News-Packed GTC Kitchen Keynote appeared first on The Official NVIDIA Blog.

Read More

NVIDIA AI on Microsoft Azure Machine Learning to Power Grammar Suggestions in Microsoft Editor for Word

NVIDIA AI on Microsoft Azure Machine Learning to Power Grammar Suggestions in Microsoft Editor for Word

It’s been said that good writing comes from editing. Fortunately for discerning readers everywhere, Microsoft is putting an AI-powered grammar editor at the fingertips of millions of people.

Like any good editor, it’s quick and knowledgeable. That’s because Microsoft Editor’s grammar refinements in Microsoft Word for the web can now tap into NVIDIA Triton Inference Server, ONNX Runtime and Microsoft Azure Machine Learning, which is part of Azure AI, to deliver this smart experience.

Speaking at the digital GPU Technology Conference, NVIDIA CEO Jensen Huang announced the news during the keynote presentation on October 5.

Everyday AI in Office

Microsoft is on a mission to wow users of Office productivity apps with the magic of AI. New, time-saving experiences will include real-time grammar suggestions, question-answering within documents — think Bing search for documents beyond “exact match” — and predictive text to help complete sentences.

Such productivity-boosting experiences are only possible with deep learning and neural networks. For example, unlike services built on traditional rules-based logic, when it comes to correcting grammar, Editor in Word for the web is able to understand the context of a sentence and suggest the appropriate word choices.

 

And these deep learning models, which can involve hundreds of millions of parameters, must be scalable and provide real-time inference for an optimal user experience. Microsoft Editor’s AI model  for grammar checking in Word on the web alone is expected to handle more than 500 billion queries a year.

Deployment at this scale could blow up deep learning budgets. Thankfully, NVIDIA Triton’s dynamic batching and concurrent model execution features, accessible through Azure Machine Learning, slashed the cost by about 70 percent and achieved a throughput of 450 queries per second on a single NVIDIA V100 Tensor Core GPU, with less than 200-millisecond response time. Azure Machine Learning provided the required scale and capabilities to manage the model lifecycle such as versioning and monitoring.

High Performance Inference with Triton on Azure Machine Learning

Machine learning models have expanded in size, and GPUs have become necessary during model training and deployment. For AI deployment in production, organizations are looking for scalable inference serving solutions, support for multiple framework backends, optimal GPU and CPU utilization and machine learning lifecycle management.

The NVIDIA Triton and ONNX Runtime stack in Azure Machine Learning deliver scalable high-performance inferencing. Azure Machine Learning customers can take advantage of Triton’s support for multiple frameworks, real time, batch and streaming inferencing, dynamic batching and concurrent execution.

Writing with AI in Word

Author and poet Robert Graves was quoted as saying, “There is no good writing, only good rewriting.”  In other words, write, and then edit and improve.

Editor in Word for the web lets you do both simultaneously. And while Editor is the first feature in Word to gain the speed and breadth of advances enabled by Triton and ONNX Runtime, it is likely just the start of more to come.

 

It’s not too late to get access to hundreds of live and on-demand talks at GTC. Register now through Oct. 9 using promo code CMB4KN to get 20 percent off.

 

The post NVIDIA AI on Microsoft Azure Machine Learning to Power Grammar Suggestions in Microsoft Editor for Word appeared first on The Official NVIDIA Blog.

Read More

To 3D and Beyond: Pixar’s USD Coming to an Industry Near You

To 3D and Beyond: Pixar’s USD Coming to an Industry Near You

It was the kind of career moment developers dream of but rarely experience. To whoops and cheers from the crowd at SIGGRAPH 2016, Dirk Van Gelder of Pixar Animation Studios launched Universal Scene Description.

USD would become the open-source glue filmmakers used to bind their favorite tools together so they could collaborate with colleagues around the world, radically simplifying the job of creating animated movies. At its birth, it had backing from three seminal partners—Autodesk, Foundry and SideFX.

Today, more than a dozen companies from Apple to Unity support USD. The standard is on the cusp of becoming the solder that fuses all sorts of virtual and physical worlds into environments where everything from skyscrapers to sports cars and smart cities will be designed and tested in simulation.

What’s more, it’s helping spawn machinima, an emerging form of digital storytelling based on game content.

How USD Found an Audience

The 2016 debut “was pretty exciting” for Van Gelder, who spent more than 20 years developing Pixar’s tools.

“We had talked to people about USD, but we weren’t sure they’d embrace it,” he said. “I did a live demo on a laptop of a scene from Finding Dory so they could see USD’s scalability and performance and what we at Pixar could do with it, and they really got the message.”

One of those in the crowd was Rev Lebaredian, vice president of simulation technology at NVIDIA.

“Dirk’s presentation of USD live and in real time inspired us. It triggered a series of ideas and events that led to what is NVIDIA Omniverse today, with USD as its soul. So, it was fate that Dirk would end up on the Omniverse team,” said Lebaredian of the 3D graphics platform, now in open beta, that aims to carry the USD vision forward.

Developers Layer Effects on 3D Graphics

Adobe’s developers were among many others who welcomed USD and now support it in their products.

“USD has a whole world of features that are incredibly powerful,” said Davide Pesare, who worked on USD at Pixar and is now a senior R&D manager at Adobe.

“For example, with USD layering, artists can work in the same scene without stepping on each other’s toes. Each artist has his or her own layer, so you can let the modeler work while someone else is building the shading,” he said.

“Today USD has spread beyond the film industry where it is pervasive in animation and special effects. Game developers are looking at it, Apple’s products can read it, we have partners in architecture using it and the number of products compatible with USD is only going to grow,” Pesare said.

CityEngine uses USD
Thinking on a grand scale: NVIDIA and partner Esri, a specialist in mapping software, are both building virtual worlds using USD.

Building a Virtual 3D Home for Architects

Although it got its start in the movies, USD can play many roles.

Millions of architects, engineers and designers need a way to quickly review progress on construction projects with owners and real-estate developers. Each stakeholder wants different programs often running on different computers, tablets or even handsets. It’s a script for an IT horror film where USD can write a happy ending.

Companies such as Autodesk, Bentley Systems, McNeel & Associates and Trimble Inc. are already exploring what USD can do for this community. NVIDIA used Omniverse to create a video showing some of the possibilities, such as previewing how the sun will play on the glassy interior of a skyscraper through the day.

Product Design Comes Alive with USD

It’s a similar story with a change of scene in the manufacturing industry. Here, companies have a cast of thousands of complex products they want to quickly design and test, ranging from voice-controlled gadgets to autonomous trucks.

The process requires iterations using programs in the hands of many kinds of specialists who demand photorealistic 3D models. Beyond de rigueur design reviews, they dream of the possibilities like putting visualizations in the hands of online customers.

Showing the shape of things to come, the Omniverse team produced a video for the debut of the NVIDIA DGX A100 system with exploding views of how its 30,000 components snap into a million drill holes. More recently, it generated a video of NVIDIA’s GeForce RTX 30 Series graphics card, (below) complete with a virtual tour of its new cooling subsystem, thanks to USD in Omniverse.

“These days my team spends a lot of time working on real-time physics and other extensions of USD for autonomous vehicles and robotics for the NVIDIA Isaac and DRIVE platforms,” Van Gelder said.

To show what’s possible today, engineers used USD to import into Omniverse an accurately modelled luxury car and details of a 17-mile stretch of highway around NVIDIA’s Silicon valley headquarters. The simulation, to be shown this week at GTC, shows the potential for environments detailed enough to test both vehicles and their automated driving capabilities.

Another team imported Kaya, a robotic car for consumers, so users could program the digital model and test its behavior in an Omniverse simulation before building or buying a physical robot.

The simulation was accurate despite the fact “the wheels are insanely complex because they can drive forward, backward or sideways,” said Mike Skolones, manager of the team behind NVIDIA Isaac Sim.

Lights! Camera! USD!

In gaming, Epic’s Unreal Engine supports USD and Unity and Blender are working to support it as well. Their work is accelerating the rise of machinima, a movie-like spinoff from gaming demonstrated in a video for NVIDIA Omniverse Machinima.

Meanwhile, back in Hollywood, studios are well along in adopting USD.

Pixar produced Finding Dory using USD. Dreamworks Animation described its process adopting USD to create the 2019 feature How to Train Your Dragon: The Hidden World. Disney Animation Studios blended USD into its pipeline for animated features, too.

Steering USD into the Omniverse

NVIDIA and partners hope to take USD into all these fields and more with Omniverse, an environment one team member describes as “like Google Docs for 3D graphics.”

Omniverse plugs the power of NVIDIA RTX real-time ray-tracing graphics into USD’s collaborative, layered editing. The recent “Marbles at Night” video (below) showcased that blend, created by a dozen artists scattered across the U.S., Australia, Poland, Russia and the U.K.

That’s getting developers like Pesare of Adobe excited.

“All industries are going to want to author everything with real time texturing, modeling, shading and animation,” said Pesare.

That will pave the way for a revolution in people consuming real-time media with AR and VR glasses linked on 5G networks for immersive, interactive experience anywhere, he added.

He’s one of more than 400 developers who’ve had a hands-on with Omniverse so far. Others come from companies like Ericsson, Foster & Partners and Industrial Light & Magic.

USD Gives Lunar Explorers a Hand

The Frontier Development Lab (FDL), a NASA partner, recently approached NVIDIA for help simulating light on the surface of the moon.

Using data from a lunar satellite, the Omniverse team generated images FDL used to create a video for a public talk, explaining its search for water ice on the moon and a landing site for a lunar rover.

Back on Earth, challenges ahead include using USD’s Hydra renderer to deliver content at 30 frames per second that might blend images from a dozen sources for a filmmaker, an architect or a product designer.

“It’s a Herculean effort to get this in the hands of the first customers for production work,” said Richard Kerris, general manager of NVIDIA’s media and entertainment group and former chief technologist at Lucasfilm. “We’re effectively building an operating system for creatives across multiple markets, so support for USD is incredibly important,” he said.

Kerris called on anyone with an RTX-enabled system to get their hands on the open beta of Omniverse and drive the promise of USD forward.

“We can’t wait to see what you will build,” he said.

It’s not too late to get access to hundreds of live and on-demand talks at GTC. Register now through Oct. 9 using promo code CMB4KN to get 20 percent off.

The post To 3D and Beyond: Pixar’s USD Coming to an Industry Near You appeared first on The Official NVIDIA Blog.

Read More

NVIDIA Jarvis and Merlin Announced in Open Beta, Enabling Conversational AI and Democratizing Recommenders

NVIDIA Jarvis and Merlin Announced in Open Beta, Enabling Conversational AI and Democratizing Recommenders

We’ve all been there: on a road trip and hungry. Wouldn’t it be amazing to ask your car’s driving assistant and get recommendations to nearby food, personalized to your taste?

Now, it’s possible for any business to build and deploy such experiences and many more with NVIDIA GPU systems and software libraries. That’s because NVIDIA Jarvis for conversational AI services and NVIDIA Merlin for recommender systems have entered open beta. Speaking today at the GPU Technology Conference, NVIDIA CEO Jensen Huang announced the news.

While AI for voice services and recommender systems has never been more needed in our digital worlds, development tools have lagged. And the need for better voice AI services is rising sharply.

More people are working from home and remotely learning, shopping, visiting doctors and more, putting strains on services and revealing shortcomings in user experiences. Some call centers report a 34 percent increase in hold times and a 68 percent increase in call escalations, according to a report from Harvard Business Review.

Meanwhile, current recommenders personalize the internet but often come up short. Retail recommenders suggest items recently purchased or continue pursuing people with annoying promos. Media and entertainment recommendations are often more of the same and not diverse. These systems are often fairly crude because they only go off of past recommendations or similarities.

NVIDIA Jarvis and NVIDIA Merlin allow companies to explore larger deep learning models, and develop more nuanced and intelligent recommendation systems. Conversational AI services built on Jarvis and recommender systems built on Merlin offer the fast track forward to better services from businesses.

Early Access Jarvis Adopter Advances

Some companies in the NVIDIA Developer program have already begun work on conversational AI services with NVIDIA Jarvis. Early adopters included Voca, an AI agent for call center support; Kensho, for automatic voice transcriptions for finance and business; and Square, offering a virtual assistant for scheduling appointments.

London-based Intelligent Voice, which offers high-performance speech recognition services, is always looking for more, said its CTO, Nigel Cannings.

“Jarvis takes a multimodal approach that fuses key elements of automatic speech recognition with entity and intent matching to address new use cases where high-throughput and low latency are required,” he said. “The Jarvis API is very easy to use, integrate and customize to our customers’ workflows for optimized performance.”

It has allowed Intelligent Voice to pivot quickly during the COVID crisis to bring to market in record time a complete new product, Myna, that allows accurate and useful meeting recall.

Better Conversational AI Needed

In the U.S., call center assistants handle 200 million calls per day, and telemedicine services enable 2.4 million daily physician visits, demanding transcriptions with high accuracy.

Traditional voice systems leave room for improvement. With processing constrained by CPUs, their lower quality models result in lag-filled robotic voice products. Jarvis includes Megatron-BERT models, the largest today, to offer the highest accuracy and lowest latency.

Deploying real-time conversational AI for natural interactions requires model computations in under 300 milliseconds — versus 600 milliseconds on CPU-powered models.

Jarvis provides more natural interactions through sensor fusion — the integration of video cameras and microphones. Its ability to handle multiple data streams in real time enables the delivery of improved services.

Complex Model Pipelines, Easier Solutions

Model pipelines in conversational AI can be complex and require coordination across multiple services.

Microservices are required to run at scale with automatic speech recognition models, natural language understanding, text-to-speech and domain-specific apps. These super-specialized tasks, sped up when run in parallel processing, gain a 3x cost advantage over a competing CPU-only server.

NVIDIA Jarvis is a comprehensive framework, offering software libraries for building conversational AI applications and including GPU-optimized services for ASR, NLU, TTS and computer vision that use the latest deep learning models.

Developers can meld these multiple skills within their applications, and quickly help our hungry vacationer find just the right place.

Merlin Creates a More Relevant Internet

Recommender systems are the engine of the personalized internet and they’re everywhere online. They suggest food you might like, offer items related to your purchases and can capture your interest in the moment with retargeted advertising for product offers as you bounce from site to site.

But when recommenders don’t do their best, people may walk away empty-handed and businesses leave money on the table.

On some of the world’s largest online commerce sites, recommender systems account for as much as 30 percent of revenue. Just a 1 percent improvement in the relevance of recommendations can translate into billions of dollars in revenue.

Recommenders at Scale on GPUs

At Tencent, recommender systems support videos, news, music and apps. Using NVIDIA Merlin, the company reduced its recommender training time from 20 hours to three.

“With the use of the Merlin HugeCTR advertising recommendation acceleration framework, our advertising business model can be trained faster and more accurately, which is expected to improve the effect of online advertising,” said Ivan Kong, AI technical leader at Tencent TEG.

Democratizes Access to Recommenders

Now everyone has access to the NVIDIA Merlin application framework, which allows businesses of all kinds to build recommenders accelerated by NVIDIA GPUs.

Merlin’s collection of libraries includes tools for building deep learning-based systems that provide better predictions than traditional methods and increase click-through rates. Each stage of the pipeline is optimized to support hundreds of terabytes of data, all accessible through easy-to-use APIs.

Merlin is used at one of the world’s largest media companies and is in testing with hundreds of companies worldwide. Social media giants in the U.S. are experimenting with its ability to share related news. Streaming media services are testing it for suggestions on next views and listens. And major retailers are looking at it for suggestions on next items to purchase.

Those who are interested can learn more about the technology advances behind Merlin since its initial launch, including its support for  NVTabular, multi-GPU support, HugeCTR and NVIDIA Triton Inference Server.

Businesses can sign up for the NVIDIA Jarvis beta for access to the latest developments in conversational AI, and get started with the NVIDIA Merlin beta for the fastest way to upload terabytes of training data and deploy recommenders at scale.

It’s not too late to get access to hundreds of live and on-demand talks at GTC. Register now through Oct. 9 using promo code CMB4KN to get 20 percent off.

 

The post NVIDIA Jarvis and Merlin Announced in Open Beta, Enabling Conversational AI and Democratizing Recommenders appeared first on The Official NVIDIA Blog.

Read More

AI Can See Clearly Now: GANs Take the Jitters Out of Video Calls

AI Can See Clearly Now: GANs Take the Jitters Out of Video Calls

Ming-Yu Liu and Arun Mallya were on a video call when one of them started to break up, then freeze.

It’s an irksome reality of life in the pandemic that most of us have shared. But unlike most of us, Liu and Mallya could do something about it.

They are AI researchers at NVIDIA and specialists in computer vision. Working with colleague Ting-Chun Wang, they realized they could use a neural network in place of the software called a video codec typically used to compress and decompress video for transmission over the net.

Their work enables a video call with one-tenth the network bandwidth users typically need. It promises to reduce bandwidth consumption by orders of magnitude in the future.

“We want to provide a better experience for video communications with AI so even people who only have access to extremely low bandwidth can still upgrade from voice to video calls,” said Mallya.

Better Connections Thanks to GANs

The technique works even when callers are wearing a hat, glasses, headphones or a mask. And just for fun, they spiced up their demo with a couple bells and whistles so users can change their hair styles or clothes digitally or create an avatar.

A more serious feature in the works (shown at top) uses the neural network to align the position of users’ faces for a more natural experience. Callers watch their video feeds, but they appear to be looking directly at their cameras, enhancing the feeling of a face-to-face connection.

“With computer vision techniques, we can locate a person’s head over a wide range of angles, and we think this will help people have more natural conversations,” said Wang.

Say hello to the latest way AI is making virtual life more real.

How AI-Assisted Video Calls Work

The mechanism behind AI-assisted video calls is simple.

A sender first transmits a reference image of the caller, just like today’s systems that typically use a compressed video stream. Then, rather than sending a fat stream of pixel-packed images, it sends data on the locations of a few key points around the user’s eyes, nose and mouth.

A generative adversarial network on the receiver’s side uses the initial image and the facial key points to reconstruct subsequent images on a local GPU. As a result, much less data is sent over the network.

Liu’s work in GANs hit the spotlight last year with GauGAN, an AI tool that turns anyone’s doodles into photorealistic works of art. GauGAN has already been used to create more than a million images and is available at the AI Playground.

“The pandemic motivated us because everyone is doing video conferencing now, so we explored how we can ease the bandwidth bottlenecks so providers can serve more people at the same time,” said Liu.

GPUs Bust Bandwidth Bottlenecks

The approach is part of an industry trend of shifting network bottlenecks into computational tasks that can be more easily tackled with local or cloud resources.

“These days lots of companies want to turn bandwidth problems into compute problems because it’s often hard to add more bandwidth and easier to add more compute,” said Andrew Page, a director of advanced products in NVIDIA’s media group.

NVIDIA Maxine bundles a suite of tools for video conferencing and streaming services.

AI Instruments Tune Video Services

GAN video compression is one of several capabilities coming to NVIDIA Maxine, a cloud-AI video-streaming platform to enhance video conferencing and calls. It packs audio, video and conversational AI features in a single toolkit that supports a broad range of devices.

Announced this week at GTC, Maxine lets service providers deliver video at super resolution with real-time translation, background noise removal and context-aware closed captioning. Users can enjoy features such as face alignment, support for virtual assistants and realistic animation of avatars.

“Video conferencing is going through a renaissance,” said Page. “Through the pandemic, we’ve all lived through its warts, but video is here to stay now as a part of our lives going forward because we are visual creatures.”

Maxine harnesses the power of NVIDIA GPUs with Tensor Cores running software such as NVIDIA Jarvis, an SDK for conversational AI that delivers a suite of speech and text capabilities. Together, they deliver AI capabilities that are useful today and serve as building blocks for tomorrow’s video products and services.

Learn more about NVIDIA Research.

It’s not too late to get access to hundreds of live and on-demand talks at GTC. Register now through Oct. 9 using promo code CMB4KN to get 20 percent off.

The post AI Can See Clearly Now: GANs Take the Jitters Out of Video Calls appeared first on The Official NVIDIA Blog.

Read More