Siemens Gamesa Taps NVIDIA Digital Twin Platform for Scientific Computing to Accelerate Clean Energy Transition

Siemens Gamesa Renewable Energy is working with NVIDIA to create physics-informed digital twins of wind farms — groups of wind turbines used to produce electricity.

The company has thousands of turbines around the globe that light up schools, homes, hospitals and factories with clean energy. In total they generate over 100 gigawatts of wind power, enough to power nearly 87 million households annually.

Virtual representations of Siemens Gamesa’s wind farms will be built using NVIDIA Omniverse and Modulus, which together comprise NVIDIA’s digital twin platform for scientific computing.

The platform will help Siemens Gamesa achieve quicker calculations to optimize wind farm layouts, which is expected to lead to farms capable of producing up to 20 percent more power than previous designs.

With the global level of annual wind power installations likely to quadruple between 2020 and 2025, it’s more important than ever to maximize the power produced by each turbine.

The global trillion-dollar renewable energy industry is turning to digital twins, like those of Siemens Gamesa’s wind farms — and one of Earth itself — to further climate research and accelerate the clean energy transition.

And the world’s rapid clean energy technology improvements mean that a dollar spent on wind and solar conversion systems today results in 4x more electricity than a dollar spent on the same systems a decade ago. This has tremendous bottom-line implications for the transition towards a greener Earth.

With NVIDIA Modulus, an AI framework for developing physics-informed machine learning models, and Omniverse, a 3D design collaboration and world simulation platform, researchers can now simulate computational fluid dynamics up to 4,000x faster than traditional methods — and view the simulations at high fidelity.

“The collaboration between Siemens Gamesa and NVIDIA has meant a great step forward in accelerating the computational speed and the deployment speed of our latest algorithms development in such a complex field as computational fluid dynamics,” said Sergio Dominguez,  onshore digital portfolio manager at Siemens Gamesa.

Maximizing Wind Power

Adding a turbine next to another on a farm can change the wind flow and create wake effects — that is, decreases in downstream wind speed — which lead to a reduction in the farm’s production of electricity.

Omniverse digital twins of wind farms will help Siemens Gamesa to accurately simulate the effect that a turbine might have on another when placed in close proximity.

Using NVIDIA Modulus and physics-ML models running on GPUs, researchers can now run computational fluid dynamics simulations orders of magnitude faster than with traditional methods, like those based on Reynolds-averaged Navier-Stokes equations or large eddy simulations, which can take over a month to run, even on a 100-CPU cluster.

This up to 4,000x speedup allows the rapid and accurate simulation of wake effects.

Analyzing and minimizing potential wake effects in real time, while simultaneously optimizing wind farms for a variety of other wind and weather scenarios, require hundreds or thousands of iterations and simulation runs, which were traditionally prohibited by time constraints and costs.

NVIDIA Omniverse and Modulus enable accurate simulations of the complex interactions between the turbines, using high-fidelity and high-resolution models that are based on low-resolution inputs.

Learn more about NVIDIA Omniverse and Modulus at GTC, running through March 24.

Watch NVIDIA founder and CEO Jensen Huang’s GTC keynote address.

The post Siemens Gamesa Taps NVIDIA Digital Twin Platform for Scientific Computing to Accelerate Clean Energy Transition appeared first on NVIDIA Blog.

Read More

NVIDIA Unveils Onramp to Hybrid Quantum Computing

We’re working with leaders in quantum computing to build the tools developers will need to program tomorrow’s ultrahigh performance systems.

Today’s high-performance computers are simulating quantum computing jobs at scale and with performance far beyond what’s possible on today’s smaller and error-prone quantum systems. In this way, classical HPC systems are helping quantum researchers chart the right path forward.

As quantum computers improve, researchers share a vision of a hybrid computing model where quantum and classical computers work together, each addressing the challenges they’re best suited to. To be broadly useful, these systems will need a unified programming environment that’s efficient and easy to use.

We’re building this onramp to the future of computing today. Starting with commercially available tools, like NVIDIA cuQuantum, we’re collaborating with IBM, Oak Ridge National Laboratory, Pasqal and many others.

A Common Software Layer

As a first step, we’re developing a new quantum compiler. Called nvq++, it targets the Quantum Intermediate Representation (QIR), a specification of a low-level machine language that quantum and classical computers can use to talk to each other.

Researchers at Oak Ridge National Laboratory, Quantinuum, Quantum Circuits Inc., and others have embraced the QIR Alliance, led by the Linux Foundation. It enables an agnostic programming approach that will deliver the best from both quantum and classical computers.

Researchers at the Oak Ridge National Laboratory will be among the first to use this new software.

Ultimately, we believe the HPC community will embrace this unified programming model for hybrid systems.

Ready-to-Use Quantum Tools

You don’t have to wait for hybrid quantum systems. Any developer can start world-class quantum research today using accelerated computing and our tools.

NVIDIA cuQuantum is now in general release. It runs complex quantum circuit simulations with libraries for tensor networks and state vectors.

And our cuQuantum DGX Appliance, a container with all the components needed to run cuQuantum jobs optimized for NVIDIA DGX A100 systems, is available in beta release.

Researchers are already using these products to tackle real-world challenges.

For example, QC Ware is running quantum chemistry and quantum machine learning algorithms using cuQuantum on the Perlmutter supercomputer at the Lawrence Berkeley National Laboratory. The work aims to advance drug discovery and climate science.

An Expanding Quantum Ecosystem

Our quantum products are supported by an expanding ecosystem of companies.

For example, Xanadu has integrated cuQuantum into PennyLane, an open-source framework for quantum machine learning and quantum chemistry. The Oak Ridge National Lab is using cuQuantum in TNQVM, a framework for tensor network quantum circuit simulations.

In addition, other companies now support cuQuantum in their commercially available quantum simulators and frameworks, such as the Classiq Quantum Algorithm Design platform from Classiq, and Orquestra from Zapata Computing.

They join existing collaborators including Google Quantum AI, IBM, IonQ and Pasqal, that announced support for our software in November.

Learn More at GTC

Register free for this week’s GTC, to hear QC Ware discuss its research on quantum chemistry.

It’s among at least ten sessions on quantum computing at GTC. And to get the big picture, watch NVIDIA CEO Jensen Huang’s GTC keynote here.

The post NVIDIA Unveils Onramp to Hybrid Quantum Computing appeared first on NVIDIA Blog.

Read More

Speed Dialer: How AT&T Rings Up New Opportunities With Data Science

AT&T’s wireless network connects more than 100 million subscribers from the Aleutian Islands to the Florida Keys, spawning a big data sea.

Abhay Dabholkar runs a research group that acts like a lighthouse on the lookout for the best tools to navigate it.

“It’s fun, we get to play with new tools that can make a difference for AT&T’s day-to-day work, and when we give staff the latest and greatest tools it adds to their job satisfaction,” said Dabholkar, a distinguished AI architect who’s been with the company more than a decade.

Recently, the team tested on GPU-powered servers the NVIDIA RAPIDS Accelerator for Apache Spark, software that spreads work across nodes in a cluster.

It processed a month’s worth of mobile data — 2.8 trillion rows of information — in just five hours. That’s 3.3x faster at 60 percent lower cost than any prior test.

A Wow Moment

“It was a wow moment because on CPU clusters it takes more than 48 hours to process just seven days of data — in the past, we had the data but couldn’t use it because it took such a long time to process it,” he said.

Specifically, the test benchmarked what’s called ETL, the extract, transform and load process that cleans up data before it can be used to train the AI models that uncover fresh insights.

“Now we’re thinking GPUs can be used for ETL and all sorts of batch-processing workloads we do in Spark, so we’re exploring other RAPIDS libraries to extend work from feature engineering to ETL and machine learning,” he said.

Today, AT&T runs ETL on CPU servers, then moves data to GPU servers for training. Doing everything in one GPU pipeline can save time and cost, he added.

Pleasing Customers, Speeding Network Design

The savings could show up across a wide variety of use cases.

For example, users could find out more quickly where they get optimal connections, improving customer satisfaction and reducing churn. “We could decide parameters for our 5G towers and antennas more quickly, too,” he said.

Identifying what area in the AT&T fiber footprint to roll out a support truck can require time-consuming geospatial calculations, something RAPIDS and GPUs could accelerate, said Chris Vo, a senior member of the team who supervised the RAPIDS tests.

“We probably get 300-400 terabytes of fresh data a day, so this technology can have incredible impact — reports we generate over two or three weeks could be done in a few hours,” Dabholkar said.

Three Use Cases and Counting

The researchers are sharing their results with members of AT&T’s data platform team.

“We recommend that if a job is taking too long and you have a lot of data, turn on GPUs — with Spark, the same code that runs on CPUs runs on GPUs,” he said.

So far, separate teams have found their own gains across three different use cases; other teams have plans to run tests on their workloads, too.

Dabholkar is optimistic business units will take their test results to production systems.

“We are a telecom company with all sorts of datasets processing petabytes of data daily, and this can significantly improve our savings,” he said.

Other users including the U.S. Internal Revenue Service are on a similar journey. It’s a path many will take given Apache Spark is used by more than 13,000 companies including 80 percent of the Fortune 500.

Register free for GTC to hear AT&T’s Chris Vo talk about his work, learn more about data science at these sessions and hear NVIDIA CEO Jensen Huang’s keynote.

The post Speed Dialer: How AT&T Rings Up New Opportunities With Data Science appeared first on NVIDIA Blog.

Read More

NVIDIA Hopper GPU Architecture Accelerates Dynamic Programming Up to 40x Using New DPX Instructions

The NVIDIA Hopper GPU architecture unveiled today at GTC will accelerate dynamic programming — a problem-solving technique used in algorithms for genomics, quantum  computing, route optimization and more — by up to 40x with new DPX instructions.

An instruction set built into NVIDIA H100 GPUs, DPX will help developers write code to achieve speedups on dynamic programming algorithms in multiple industries, boosting workflows for disease diagnosis, quantum simulation, graph analytics and routing optimizations.

What Is Dynamic Programming? 

Developed in the 1950s, dynamic programming is a popular technique for solving complex problems with two key techniques: recursion and memoization.

Recursion involves breaking a problem down into simpler sub-problems, saving time and computational effort. In memoization, the answers to these sub-problems — which are reused several times when solving the main problem — are stored. Memoization increases efficiency, so sub-problems don’t need to be recomputed when needed later on in the main problem.

DPX instructions accelerate dynamic programming algorithms by up to 7x on an NVIDIA H100 GPU, compared with NVIDIA Ampere architecture-based GPUs. In a node with four NVIDIA H100 GPUs, that acceleration can be boosted even further.

Use Cases Span Healthcare, Robotics, Quantum Computing, Data Science

Dynamic programming is commonly used in many optimization, data processing and omics algorithms. To date, most developers have run these kinds of algorithms on CPUs or FPGAs — but can unlock dramatic speedups using DPX instructions on NVIDIA Hopper GPUs.

Omics 

Omics covers a range of biological fields including genomics (focused on DNA), proteomics (focused on proteins) and transcriptomics (focused on RNA). These fields, which inform the critical work of disease research and drug discovery, all rely on algorithmic analyses that can be sped up with DPX instructions.

For example, the Smith-Waterman and Needleman-Wunsch dynamic programming algorithms are used for DNA sequence alignment, protein classification and protein folding. Both use a scoring method to measure how well genetic sequences from different samples align.

Smith-Waterman produces highly accurate results, but takes more compute resources and time than other alignment methods. By using DPX instructions on a node with four NVIDIA H100 GPUs, scientists can speed this process 35x to achieve real-time processing, where the work of base calling and alignment takes place at the same rate as DNA sequencing.

This acceleration will help democratize genomic analysis in hospitals worldwide, bringing scientists closer to providing patients with personalized medicine.

Route Optimization

Finding the optimal route for multiple moving pieces is essential for autonomous robots moving through a dynamic warehouse, or even a sender transferring data to multiple receivers in a computer network.

To tackle this optimization problem, developers rely on Floyd-Warshall, a dynamic programming algorithm used to find the shortest distances between all pairs of destinations in a map or graph. In a server with four NVIDIA H100 GPUs, Floyd-Warshall acceleration is boosted 40x compared to a traditional dual-socket CPU-only server.

Paired with the NVIDIA cuOpt AI logistics software, this speedup in routing optimization could be used for real-time applications in factories, autonomous vehicles, or mapping and routing algorithms in abstract graphs.

Quantum Simulation

Countless other dynamic programming algorithms could be accelerated on NVIDIA H100 GPUs with DPX instructions. One promising field is quantum computing, where dynamic programming is used in tensor optimization algorithms for quantum simulation. DPX instructions could help developers accelerate the process of identifying the right tensor contraction order.

SQL Query Optimization

Another potential application is in data science. Data scientists working with the SQL programming language often need to perform several “join” operations on a set of tables.  Dynamic programming helps find an optimal order for these joins, often saving orders of magnitude in execution time and thus speeding up SQL queries.

Learn more about the NVIDIA Hopper GPU architecture. Register free for GTC, running online through March 24. And watch the replay of NVIDIA founder and CEO Jensen Huang’s keynote address.

The post NVIDIA Hopper GPU Architecture Accelerates Dynamic Programming Up to 40x Using New DPX Instructions appeared first on NVIDIA Blog.

Read More

H100 Transformer Engine Supercharges AI Training, Delivering Up to 6x Higher Performance Without Losing Accuracy

The largest AI models can require months to train on today’s computing platforms. That’s too slow for businesses.

AI, high performance computing and data analytics are growing in complexity with some models, like large language ones, reaching trillions of parameters.

The NVIDIA Hopper architecture is built from the ground up to accelerate these next-generation AI workloads with massive compute power and fast memory to handle growing networks and datasets.

Transformer Engine, part of the new Hopper architecture, will significantly speed up AI performance and capabilities, and help train large models within days or hours.

Training AI Models With Transformer Engine

Transformer models are the backbone of language models used widely today, such asBERT and GPT-3. Initially developed for natural language processing use cases, their versatility is increasingly being applied to computer vision, drug discovery and more.

However, model size continues to increase exponentially, now reaching trillions of parameters. This is causing training times to stretch into months due to huge amounts of computation, which is impractical for business needs.

Transformer Engine uses 16-bit floating-point precision and a newly added 8-bit floating-point data format combined with advanced software algorithms that will further speed up AI performance and capabilities.

AI training relies on floating-point numbers, which have fractional components, like 3.14. Introduced with the NVIDIA Ampere architecture, the TensorFloat32 (TF32) floating-point format is now the default 32-bit format in the TensorFlow and PyTorch frameworks.

Most AI floating-point math is done using 16-bit “half” precision (FP16), 32-bit “single” precision (FP32) and, for specialized operations, 64-bit “double” precision (FP64). By reducing the math to just eight bits, Transformer Engine makes it possible to train larger networks faster.

When coupled with other new features in the Hopper architecture — like the NVLink Switch system, which provides a direct high-speed interconnect between nodes — H100-accelerated server clusters will be able to train enormous networks that were nearly impossible to train at the speed necessary for enterprises.

Diving Deeper Into Transformer Engine

Transformer Engine uses software and custom NVIDIA Hopper Tensor Core technology designed to accelerate training for models built from the prevalent AI model building block, the transformer. These Tensor Cores can apply mixed FP8 and FP16 formats to dramatically accelerate AI calculations for transformers. Tensor Core operations in FP8 have twice the throughput of 16-bit operations.

The challenge for models is to intelligently manage the precision to maintain accuracy while gaining the performance of smaller, faster numerical formats. Transformer Engine enables this with custom, NVIDIA-tuned heuristics that dynamically choose between FP8 and FP16 calculations and automatically handle re-casting and scaling between these precisions in each layer.

Transformer Engine uses per-layer statistical analysis to determine the optimal precision (FP16 or FP8) for each layer of a model, achieving the best performance while preserving model accuracy.

The NVIDIA Hopper architecture also advances fourth-generation Tensor Cores by tripling the floating-point operations per second compared with prior-generation TF32, FP64, FP16 and INT8 precisions. Combined with Transformer Engine and fourth-generation NVLink, Hopper Tensor Cores enable an order-of-magnitude speedup for HPC and AI workloads.

Revving Up Transformer Engine

Much of the cutting-edge work in AI revolves around large language models like Megatron 530B. The chart below shows the growth of model size in recent years, a trend that is widely expected to continue. Many researchers are already working on trillion-plus parameter models for natural language understanding and other applications, showing an unrelenting appetite for AI compute power.

Growth in natural language understanding models continues at a vigorous pace. Source: Microsoft.

Meeting the demand of these growing models requires a combination of computational power and a ton of high-speed memory. The NVIDIA H100 Tensor Core GPU delivers on both fronts, with the speedups made possible by Transformer Engine to take AI training to the next level.

When combined, these innovations deliver higher throughput and a 9x reduction in time to train, from seven days to just 20 hours:

NVIDIA H100 Tensor Core GPU delivers up to 9x more training throughput compared to previous generation, making it possible to train large models in reasonable amounts of time.

Transformer Engine can also be used for inference without any data format conversions. Previously, INT8 was the go-to precision for optimal inference performance. However, it requires that the trained networks be converted to INT8 as part of the optimization process, something the NVIDIA TensorRT inference optimizer makes easy.

Using models trained with FP8 will allow developers to skip this conversion step altogether and do inference operations using that same precision. And like INT8-formatted networks, deployments using Transformer Engine can run in a much smaller memory footprint.

On Megatron 530B, NVIDIA H100 inference per-GPU throughput is up to 30x higher than NVIDIA A100, with a 1-second response latency, showcasing it as the optimal platform for AI deployments:

Transformer Engine will also increase inference throughput by as much as 30x for low-latency applications.

To learn more about NVIDIA H100 GPU and the Hopper architecture, watch the GTC 2022 keynote from Jensen Huang. Register for GTC 2022 for free to attend sessions with NVIDIA and industry leaders.”

The post H100 Transformer Engine Supercharges AI Training, Delivering Up to 6x Higher Performance Without Losing Accuracy appeared first on NVIDIA Blog.

Read More

NVIDIA Maxine Reinvents Real-Time Communication With AI

Everyone wants to be heard. And with more people than ever in video calls or live streaming from their home offices, rich audio free from echo hiccups and background noises like barking dogs is key to better sounding online experiences.

NVIDIA Maxine offers GPU-accelerated, AI-enabled software development kits to help developers build scalable, low-latency audio and video effects pipelines that improve call quality and user experience.

Today, NVIDIA announced at GTC that Maxine is adding acoustic echo cancellation and AI-based upsampling for better sound quality.

Acoustic Echo Cancellation eliminates acoustic echo from the audio stream in real time, preserving speech quality even during double-talk. With AI-based technology, Maxine achieves more effective echo cancellation than that achieved via traditional digital signal processing algorithms.

Audio Super Resolution improves the quality of a low-bandwidth audio signal by restoring the energy lost in higher frequency bands using AI-based techniques. Maxine Audio Super Resolution supports upsampling the audio  from 8 kHz (narrowband) to 16 kHz (wideband), from 16 kHz to 48 kHz (ultra-wideband) and from 8 kHz to 48 kHz. Lower sampling rates such as 8 kHz often result in muffled voices and emphasize artifacts such as sibilance and make the speech difficult to understand.

Modern film and television studios often use 48 kHz (or higher) sampling rate for recording audio, in order to maintain fidelity of the original signal and preserve clarity. Audio Super Resolution can help restore the fidelity of old audio recordings, derived from magnetic tapes or other low bandwidth media.

Bridging the Sound Gap 

Most modern telecommunication takes place using wideband or ultra-wideband audio. Since NVIDIA Audio Super Resolution can upsample and restore the narrowband audio in real-time, the technology can effectively be used to bridge the quality gap between traditional copper wire phone lines and modern VoIP-based wideband communication systems.

Real-time communication — whether for conference calls, call centers or live streaming of all kinds — is taking a big leap forward with Maxine.

Since its initial release, Maxine has been adopted by many of the world’s leading providers for video communications, content creation and live streaming.

The worldwide market for video conferencing is expected to increase to nearly $13 billion in 2028, up from about $6.3 billion in 2021, according to Fortune Business Insights.

WFH: A Way of Life 

The move to work from home, or WFH, has become an accepted norm across companies, and organizations are adapting to the new expectations.

Analyst firm Gartner estimates that only a quarter of meetings for enterprises will be in person in 2024, a decline from 60 percent pre-pandemic.

Virtual collaboration in the U.S. has played an important role as people have taken on hybrid and remote positions in the past two years amid the pandemic.

But as organizations seek to maintain company culture and workplace experience, the stakes have risen for higher-quality media interactivity.

Solving the Cocktail Party Problem    

But sometimes work and home life collide. As a result, meetings are often filled with background noises from kids, construction work outside or emergency vehicle sirens, causing brief interruptions in the flow of conference calls.

Maxine helps solve an age-old audio issue known as the cocktail party problem. With AI, it can filter out unwanted background noises, allowing users to be better heard, whether they’re in a home office or on the road.

The Maxine GPU-accelerated platform provides an end-to-end deep learning pipeline that integrates with customizable state-of-the-art models, enabling high-quality features with a standard microphone and camera.

Sound Like Your Best Self

In addition to being impacted by background noise, audio quality in virtual activities can sometimes sound thin, missing low- and mid-level frequencies, or even be barely audible.

Maxine enables upsampling of audio in real time so that voices sound fuller, deeper and more audible.

Logitech: Better Audio for Headsets and Blue Yeti Microphones

Logitech, a leading maker of peripherals, is implementing Maxine for better interactions with its popular headsets and microphones.

Tapping into AI libraries, Logitech has integrated Maxine directly inside G Hub audio drivers to enhance communications with its devices without the need for additional software. Maxine takes advantage of the powerful Tensor Cores in NVIDIA RTX GPUs so consumers can enjoy real-time processing of their mic signal.

Logitech is now utilizing Maxine’s state-of-the-art denoising in its G Hub software. That has allowed it to remove echoes and background noises — such as fans, as well as keyboard and mouse clicks — that can distract from video conferences or live-streaming sessions.

“NVIDIA Maxine makes it fast and easy for Logitech G gamers to clean up their mic signal and eliminate unwanted background noises in a single click.” said Ujesh Desai, GM of Logitech G. “You can even use G HUB to test your mic signal to make sure you have your Maxine settings dialed in.”

Logitech is now taking advantage of Maxine’s state-of-the-art denoising in its G Hub software. That has allowed it to remove echoes and background noises — such as fans, as well as keyboard and mouse clicks — that can distract from video conferences or live-streaming sessions.

“NVIDIA Maxine makes it fast and easy for users to clean up their mic signal and eliminate unwanted background noises in a single click,” said Ujesh Desai, vice president at Logitech. “You can even test your mic signal to find the perfect settings for your setup.”

Tencent Cloud Boosts Content Creators

Tencent Cloud is helping content creators with their productions by offering technology from NVIDIA Maxine that makes it quick and easy to add creative backgrounds.

NVIDIA Maxine’s AI Green Screen feature enables users to create a more immersive presence with high-quality foreground and background separation — without the need for a traditional green screen. Once the real background is separated, it can easily be replaced with a virtual background, or blurred to create a depth-of-field effect. Tencent Cloud is offering this new capability as a software-as-a-service package for content creators.

NVIDIA Maxine’s AI Green Screen technology helps content creators with their productions by enabling more immersive high quality experiences, without the need for specialized equipment and lighting” said Director of the Product Center, Vulture Li at Tencent Cloud audio and video platform.

Making Virtual Experiences Better

NVIDIA Maxine provides state-of-the-art real-time AI audio, video and augmented reality features that can be built into customizable, end-to-end deep learning pipelines.

The AI-powered SDKs from Maxine help developers to create applications that include audio and image denoising, super resolution, gaze correction, 3D body pose estimation and translation features.

Maxine also enables real-time voice-to-text translation for a growing number of languages. At GTC, NVIDIA demonstrated Maxine translating between English, French, German and Spanish.

These effects will allow millions of people to enjoy high-quality and engaging live-streaming video across any device.

 

Join us at GTC this week to learn more about Maxine in the following session:

The post NVIDIA Maxine Reinvents Real-Time Communication With AI appeared first on NVIDIA Blog.

Read More

Getting People Talking: Microsoft Improves AI Quality and Efficiency of Translator Using NVIDIA Triton

When your software can evoke tears of joy, you spread the cheer.

So, Translator, a Microsoft Azure Cognitive Service, is applying some of the world’s largest AI models to help more people communicate.

“There are so many cool stories,” said Vishal Chowdhary, development manager for Translator.

Like the five-day sprint to add Haitian Creole to power apps that helped aid workers after Haiti suffered a 7.0 earthquake in 2010. Or the grandparents who choked up in their first session using the software to speak live with remote grandkids who spoke a language they did not understand.

An Ambitious Goal

“Our vision is to eliminate barriers in all languages and modalities with this same API that’s already being used by thousands of developers,” said Chowdhary.

With some 7,000 languages spoken worldwide, it’s an ambitious goal.

So, the team turned to a powerful, and complex, tool — a mixture of experts (MoE) AI approach.

It’s a state-of-the-art member of the class of transformer models driving rapid advances in natural language processing. And with 5 billion parameters, it’s 80x larger than the biggest model the team has in production for natural-language processing.

MoE models are so compute-intensive, it’s hard to find anyone who’s put them into production. In an initial test, CPU-based servers couldn’t meet the team’s requirement to use them to translate a document in one second.

A 27x Speedup

Then the team ran the test on accelerated systems with NVIDIA Triton Inference Server, part of the NVIDIA AI Enterprise 2.0 platform announced this week at GTC.

“Using NVIDIA GPUs and Triton we could do it, and do it efficiently,” said Chowdhary.

In fact, the team was able to achieve up to a 27x speedup over non-optimized GPU runtimes.

“We were able to build one model to perform different language understanding tasks — like summarizing, text generation and translation — instead of having to develop separate models for each task,” said Hanny Hassan Awadalla, a principal researcher at Microsoft who supervised the tests.

How Triton Helped

Microsoft’s models break down a big job like translating a stack of documents into many small tasks of translating hundreds of sentences. Triton’s dynamic batching feature pools these many requests to make best use of a GPU’s muscle.

The team praised Triton’s ability to run any model in any mode using CPUs, GPUs or other accelerators.

“It seems very well thought out with all the features I wanted for my scenario, like something I would have developed for myself,” said Chowdhary, whose team has been developing large-scale distributed systems for more than a decade.

Under the hood, two software components were key to Triton’s success. NVIDIA extended FasterTransformer — a software layer that handles inference computations — to support MoE models. CUTLASS, an NVIDIA math library, helped implement the models efficiently.

Proven Prototype in Four Weeks

Though the tests were complex, the team worked with NVIDIA engineers to get an end-to-end prototype with Triton up and running in less than a month.

“That’s a really impressive timeline to make a shippable product — I really appreciate that,” said Awadalla.

And though it was the team’s first experience with Triton, “we used it to ship the MoE models by rearchitecting our runtime environment without a lot of effort, and now I hope it becomes part of our long-term host system,” Chowdhary added.

Taking the Next Steps

The accelerated service will arrive in judicious steps, initially for document translation in a few major languages.

“Eventually, we want our customers to get the goodness of these new models transparently in all our scenarios,” said Chowdhary.

The work is part of a broad initiative at Microsoft. It aims to fuel advances across a wide sweep of its products such as Office and Teams, as well as those of its developers and customers from small one-app companies to Fortune 500 enterprises.

Paving the way, Awadalla’s team published research in September on training MoE models with up to 200 billion parameters on NVIDIA A100 Tensor Core GPUs. Since then, the team’s accelerated that work another 8x by using 80G versions of the A100 GPUs on models with more than 300 billion parameters.

“The models will need to get larger and larger to better represent more languages, especially for ones where we don’t have a lot of data,” Adawalla said.

The post Getting People Talking: Microsoft Improves AI Quality and Efficiency of Translator Using NVIDIA Triton appeared first on NVIDIA Blog.

Read More