Do the Math: New RTX AI PC Hardware Delivers More AI, Faster

Do the Math: New RTX AI PC Hardware Delivers More AI, Faster

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for RTX PC users.

At the IFA Berlin consumer electronics and home appliances trade show this week, new RTX AI PCs will be announced, powered by RTX GPUs for advanced AI in gaming, content creation, development and academics and a neural processing unit (NPU) for offloading lightweight AI.

RTX GPUs, built with specialized AI hardware called Tensor Cores, provide the compute performance needed to run the latest and most demanding AI models. They now accelerate more than 600 AI-enabled games and applications, with more than 100 million GeForce RTX and NVIDIA RTX GPUs in users’ hands worldwide.

Since the launch of NVIDIA DLSS — the first widely deployed PC AI technology — more than five years ago, on-device AI has expanded beyond gaming to livestreaming, content creation, software development, productivity and STEM use cases.

Accelerating AI 

AI boils down to massive matrix multiplication — in other words, incredibly complex math. CPUs can do math, but, as serial processors, they can only perform one operation per CPU core at a time. This makes them far too slow for practical use with AI.

GPUs, on the other hand, are parallel processors, performing multiple operations at once. With hundreds of Tensor Cores each and being optimized for AI, RTX GPUs can accelerate incredibly complex mathematical operations.

RTX-powered systems give users a powerful GPU accelerator for demanding AI workloads in gaming, content creation, software development and STEM subjects. Some also include an NPU, a lightweight accelerator for offloading select low-power workloads.

Local accelerators make AI capabilities always available (even without an internet connection), offer low latency for high responsiveness and increase privacy so that users don’t have to upload sensitive materials to an online database before they become usable by an AI model.

Advanced Processing Power

NVIDIA powers much of the world’s AI — from data center to the edge to an install base of over 100 million PCs worldwide.

The GeForce RTX and NVIDIA RTX GPUs found in laptops, desktops and workstations share the same architecture as cloud servers and provide up to 686 AI trillion operations operations per second (TOPS) across the GeForce RTX 40 Series Laptop GPU lineup.

RTX GPUs unlock top-tier performance and power a wider range of AI and generative AI than systems with just an integrated system-on-a-chip (SoC).

“Many projects, especially within Windows, are built for and expect to run on NVIDIA cards. In addition to the wide software support base, NVIDIA GPUs also have an advantage in terms of raw performance.” — Jon Allman, industry analyst at Puget Systems

Gamers can use DLSS for AI-enhanced performance and can look forward to NVIDIA ACE digital human technology for next-generation in-game experiences. Creators can use AI-accelerated video and photo editing tools, asset generators, AI denoisers and more. Everyday users can tap RTX Video Super Resolution and RTX Video HDR for improved video quality, and NVIDIA ChatRTX and NVIDIA Broadcast for productivity improvements. And developers can use RTX-powered coding and debugging tools, as well as the NVIDIA RTX AI Toolkit to build and deploy AI-enabled apps for RTX.

Large language models — like Google’s Gemma, Meta’s Llama and Microsoft’s Phi — all run faster on RTX AI PCs, as systems with GPUs load LLMs into VRAM. Add in NVIDIA TensorRT-LLM acceleration and RTX GPUs can run LLMs 10-100x faster than on CPUs.

New RTX AI PCs Available Now

New systems from ASUS and MSI are now shipping with up to GeForce RTX 4070 Laptop GPUs — delivering up to 321 AI TOPS of performance — and power-efficient SoCs with Windows 11 AI PC capabilities. Windows 11 AI PCs will receive a free update to Copilot+ PC experiences when available.

ASUS’ Zephyrus G16 comes with up to a GeForce RTX 4070 Laptop GPU to supercharge photo and video editing, image generation and coding, while game-enhancing features like DLSS create additional high-quality frames and improve image quality. The 321 TOPS of local AI processing power available from the GeForce RTX 4070 Laptop GPU enables multiple AI applications to run simultaneously, changing the way gamers, creators and engineers work and play.

The ASUS ProArt P16 is the first AI PC built for advanced AI workflows across creativity, gaming, productivity and more. Its GeForce RTX 4070 Laptop GPU provides creatives with RTX AI acceleration in top 2D, 3D, video editing and streaming apps. The ASUS ProArt P13 also comes with state-of-the-art graphics and an OLED touchscreen for ease of creation. Both laptops also come NVIDIA Studio-validated, enabling and accelerating your creativity.

The MSI Stealth A16 AI+ features the latest GeForce RTX 40 Series Laptop GPUs, delivering up to 321 AI TOPS with a GeForce RTX 4070 Laptop GPU. This fast and intelligent AI-powered PC is designed to excel in gaming, creation and productivity, offering access to next-level technology.

These laptops join hundreds of RTX AI PCs available today from top manufacturers, with support for the 600+ AI applications and games accelerated by RTX.

Generative AI is transforming graphics and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More

From RAG to Richness: Startup Uplevels Retrieval-Augmented Generation for Enterprises

From RAG to Richness: Startup Uplevels Retrieval-Augmented Generation for Enterprises

Well before OpenAI upended the technology industry with its release of ChatGPT in the fall of 2022, Douwe Kiela already understood why large language models, on their own, could only offer partial solutions for key enterprise use cases.

The young Dutch CEO of Contextual AI had been deeply influenced by two seminal papers from Google and OpenAI, which together outlined the recipe for creating fast, efficient transformer-based generative AI models and LLMs.

Soon after those papers were published in 2017 and 2018, Kiela and his team of AI researchers at Facebook, where he worked at that time, realized LLMs would face profound data freshness issues.

They knew that when foundation models like LLMs were trained on massive datasets, the training not only imbued the model with a metaphorical “brain” for “reasoning” across data. The training data also represented the entirety of a model’s knowledge that it could draw on to generate answers to users’ questions.

Kiela’s team realized that, unless an LLM could access relevant real-time data in an efficient, cost-effective way, even the smartest LLM wouldn’t be very useful for many enterprises’ needs.

So, in the spring of 2020, Kiela and his team published a seminal paper of their own, which introduced the world to retrieval-augmented generation. RAG, as it’s commonly called, is a method for continuously and cost-effectively updating foundation models with new, relevant information, including from a user’s own files and from the internet. With RAG, an LLM’s knowledge is no longer confined to its training data, which makes models far more accurate, impactful and relevant to enterprise users.

Today, Kiela and Amanpreet Singh, a former colleague at Facebook, are the CEO and CTO of Contextual AI, a Silicon Valley-based startup, which recently closed an $80 million Series A round, which included NVIDIA’s investment arm, NVentures. Contextual AI is also a member of NVIDIA Inception, a program designed to nurture startups. With roughly 50 employees, the company says it plans to double in size by the end of the year.

The platform Contextual AI offers is called RAG 2.0. In many ways, it’s an advanced, productized version of the RAG architecture Kiela and Singh first described in their 2020 paper.

RAG 2.0 can achieve roughly 10x better parameter accuracy and performance over competing offerings, Kiela says.

That means, for example, that a 70-billion-parameter model that would typically require significant compute resources could instead run on far smaller infrastructure, one built to handle only 7 billion parameters without sacrificing accuracy. This type of optimization opens up edge use cases with smaller computers that can perform at significantly higher-than-expected levels.

“When ChatGPT happened, we saw this enormous frustration where everybody recognized the potential of LLMs, but also realized the technology wasn’t quite there yet,” explained Kiela. “We knew that RAG was the solution to many of the problems. And we also knew that we could do much better than what we outlined in the original RAG paper in 2020.”

Integrated Retrievers and Language Models Offer Big Performance Gains 

The key to Contextual AI’s solutions is its close integration of its retriever architecture, the “R” in RAG, with an LLM’s architecture, which is the generator, or “G,” in the term. The way RAG works is that a retriever interprets a user’s query, checks various sources to identify relevant documents or data and then brings that information back to an LLM, which reasons across this new information to generate a response.

Since around 2020, RAG has become the dominant approach for enterprises that deploy LLM-powered chatbots. As a result, a vibrant ecosystem of RAG-focused startups has formed.

One of the ways Contextual AI differentiates itself from competitors is by how it refines and improves its retrievers through back propagation, a process of adjusting algorithms — the weights and biases — underlying its neural network architecture.

And, instead of training and adjusting two distinct neural networks, that is, the retriever and the LLM, Contextual AI offers a unified state-of-the-art platform, which aligns the retriever and language model, and then tunes them both through back propagation.

Synchronizing and adjusting weights and biases across distinct neural networks is difficult, but the result, Kiela says, leads to tremendous gains in precision, response quality and optimization. And because the retriever and generator are so closely aligned, the responses they create are grounded in common data, which means their answers are far less likely than other RAG architectures to include made up or “hallucinated” data, which a model might offer when it doesn’t “know” an answer.

“Our approach is technically very challenging, but it leads to much stronger coupling between the retriever and the generator, which makes our system far more accurate and much more efficient,” said Kiela.

Tackling Difficult Use Cases With State-of-the-Art Innovations

RAG 2.0 is essentially LLM-agnostic, which means it works across different open-source language models, like Mistral or Llama, and can accommodate customers’ model preferences. The startup’s retrievers were developed using NVIDIA’s Megatron LM on a mix of NVIDIA H100 and A100 Tensor Core GPUs hosted in Google Cloud.

One of the significant challenges every RAG solution faces is how to identify the most relevant information to answer a user’s query when that information may be stored in a variety of formats, such as text, video or PDF.

Contextual AI overcomes this challenge through a “mixture of retrievers” approach, which aligns different retrievers’ sub-specialties with the different formats data is stored in.

Contextual AI deploys a combination of RAG types, plus a neural reranking algorithm, to identify information stored in different formats which, together, are optimally responsive to the user’s query.

For example, if some information relevant to a query is stored in a video file format, then one of the RAGs deployed to identify relevant data would likely be a Graph RAG, which is very good at understanding temporal relationships in unstructured data like video. If other data were stored in a text or PDF format, then a vector-based RAG would simultaneously be deployed.

The neural reranker would then help organize the retrieved data and the prioritized information would then be fed to the LLM to generate an answer to the initial query.

“To maximize performance, we almost never use a single retrieval approach — it’s usually a hybrid because they have different and complementary strengths,” Kiela said. “The exact right mixture depends on the use case, the underlying data and the user’s query.”

By essentially fusing the RAG and LLM architectures, and offering many routes for finding relevant information, Contextual AI offers customers significantly improved performance. In addition to greater accuracy, its offering lowers latency thanks to fewer API calls between the RAG’s and LLM’s neural networks.

Because of its highly optimized architecture and lower compute demands, RAG 2.0 can run in the cloud, on premises or fully disconnected. And that makes it relevant to a wide array of industries, from fintech and manufacturing to medical devices and robotics.

“The use cases we’re focusing on are the really hard ones,” Kiela said. “Beyond reading a transcript, answering basic questions or summarization, we’re focused on the very high-value, knowledge-intensive roles that will save companies a lot of money or make them much more productive.”

Read More

Crystal-Clear Gaming: ‘Visions of Mana’ Sharpens on GeForce NOW

Crystal-Clear Gaming: ‘Visions of Mana’ Sharpens on GeForce NOW

It’s time to mana-fest the spirit of adventure with Square Enix’s highly anticipated action role-playing game, Visions of Mana, launching today in the cloud.

Members can also head to a galaxy far, far away, from the comfort of their homes, with the power of the cloud and Ubisoft’s Star Wars Outlaws, with early access available on GeForce NOW.

Plus, be among the first to get early access to the Call of Duty: Black Ops 6 Open Beta on GeForce NOW without having to wait around for downloads — early access runs Aug. 30-Sept. 4 for those who preorder the game. Call of Duty: Black Ops 6 will join the cloud when the full game is released Oct. 25.

These triple-A titles are part of 26 titles joining the GeForce NOW library of over 2,000 games this week.

Cloudy With a Chance of Mana

Visions of Mana details a whimsical journey through the enchanting world of Mana. Step into the shoes of Val, a curious young man on an epic quest to escort his childhood friend Hina to the sacred Tree of Mana. Along the way, encounter a colorful cast of characters and face off against endearing yet formidable enemies.

Visions of Mana on GeForce NOW
Get ready for mana madness in the cloud.

The game’s combat system blends action and strategy for real-time battles where party members can be switched on the fly. Different party members offer unique skills, such as Val’s powerful sword strikes or magical spells cast by Careena’s dragon companion Ramcoh. Plus, traverse the game’s expansive, semi-open world on Pikuls — adorable, rideable creatures that can also ram through enemies.

Stream the game on an Ultimate or Priority membership for a seamless magical adventure. Experience the lush landscapes and dynamic combat in stunning detail — at up to 4K resolution and 120 frames per second for Ultimate members. GeForce NOW makes it easy to stay connected to the world of Mana whether at home or on the go, ensuring the journey to the Tree of Mana is always within reach.

Join the Galaxy’s Most Wanted

Star Wars Outlaws on GeForce NOW
Make your own destiny in the cloud.

In Star Wars Outlaws — the highly anticipated single-player action-adventure game from Ubisoft — explore the depths of the galaxy’s underworld as part of the beloved Star Wars franchise’s first-ever open-world game, set between the events of The Empire Strikes Back and Return of the Jedi.

Step into the shoes of Kay Vess, a daring scoundrel seeking freedom and adventure. Navigate distinct locations across the galaxy — both iconic and new — and encounter bustling cities, cantinas and sprawling outdoor landscapes. Fight, steal and outwit others while joining the ranks of the galaxy’s most wanted. Plus, play alongside Nix, a loyal companion who helps turn any situation to Kay’s advantage through blaster combat, stealth and clever distractions.

Get ready for high-stakes missions, space dogfights and an ever-changing reputation based on player choices. Unlock the power of the Force with a GeForce NOW Ultimate membership and stream from GeForce RTX 4080 SuperPODs at up to 4K resolution at 120 fps, without the need for upgraded hardware. The AI-powered graphics of NVIDIA DLSS 3.5 with Ray Reconstruction enhance the game for maximum performance, offering the clarity of a Jedi’s vision, and NVIDIA Reflex technology enables unbeatable responsiveness.

Answering the Call

Get ready for the anticipated addition to the Call of Duty franchise — Call of Duty: Black Ops 6. GeForce NOW will support the title’s PC Game Pass, Battle.net and Steam versions in the cloud.

Explore a range of new features, including innovative mechanics such as Omnimovement, which allows players to sprint, slide and dive in any direction to enhance the fluidity of combat. The Black Ops 6 Campaign features a spy action thriller set in the early 90s, a period of transition and upheaval in global politics, characterized by the end of the Cold War and the rise of the United States. With a mind-bending narrative and unbound by the rules of engagement, the title embodies signature Black Ops.

Experience the new gameplay mechanics and features before the title’s official launch with early access to the preorder beta from Aug. 30-Sept. 4 — available for those who preorder the game or have an active PC Game Pass subscription. The Open Beta follows soon after on Sept. 6-9 and will be available for all gamers to hop into the action, even without preordering the game.

GeForce NOW Ultimate members can gain an advantage on the field with ultra low-latency gaming, streaming from GeForce RTX 4080 SuperPODs in the cloud.

WOW, New Games

WoW The World Within on GeForce NOW
Join the underground party from the cloud.

The newest World of Warcraft expansion, The War Within, is available to play from the cloud today. Head to the subterranean realm of Khaz Algar, featuring four new zones, including the Isle of Dorn — home to the Earthen, a newly playable allied race. Experience game additions such as Delves, bite-sized world instances for solo or small-group play, and Warbands, which allow players to manage and share achievements across multiple characters.

In addition, GeForce NOW recently added support for over 25 of the top AddOns from CurseForge, a leading platform for WoW customization, enabling members to explore new adventures under the surface from the cloud.

In addition, members can look for the following:

  • Endzone 2 (New release on Steam, Aug. 26)
  • Age of Mythology: Retold (New release on Steam, Xbox, available on PC Game Pass, Advanced Access on Aug. 27)
  • Core Keeper (New release on Xbox, available on PC Game Pass, Aug. 27)
  • Star Wars Outlaws (New release on Ubisoft Connect, early access Aug. 27)
  • Akimbot (New release on Steam, Aug. 29)
  • Gori: Cuddly Carnage (New release on Steam, Aug. 29)
  • MEMORIAPOLIS (New release on Steam, Aug. 29)
  • Visions of Mana (New release on Steam, Aug. 29)
  • Avatar: Frontiers of Pandora (Steam)
  • Cat Quest III (Steam)
  • Cooking Simulator (Xbox, available on PC Game Pass)
  • Crown Trick (Xbox, available on Microsoft Store)
  • Darksiders Genesis (Xbox, available on Microsoft Store)
  • Expeditions: Rome (Xbox, available on Microsoft Store)
  • Heading Out (Steam)
  • Into the Breach (Xbox, available on Microsoft Store)
  • Iron Harvest (Xbox, available on Microsoft Store)
  • The Knight Witch (Xbox, available on Microsoft Store)
  • Lightyear Frontier (Xbox, available on PC Game Pass)
  • Metro Exodus Enhanced Edition (Xbox, available on Microsoft Store)
  • Outlast 2 (Xbox, available on Microsoft Store)
  • Saturnalia (Steam)
  • SteamWorld Heist II (Steam, Xbox, available on Microsoft Store)
  • This War of Mine (Xbox, available on PC Game Pass)
  • We Were Here Too (Steam)
  • Yoku’s Island Express (Xbox, available on Microsoft Store)

What are you planning to play this weekend? Let us know on X or in the comments below.

 

Read More

NVIDIA Blackwell Sets New Standard for Generative AI in MLPerf Inference Debut

NVIDIA Blackwell Sets New Standard for Generative AI in MLPerf Inference Debut

As enterprises race to adopt generative AI and bring new services to market, the demands on data center infrastructure have never been greater. Training large language models is one challenge, but delivering LLM-powered real-time services is another.

In the latest round of MLPerf industry benchmarks, Inference v4.1, NVIDIA platforms delivered leading performance across all data center tests. The first-ever submission of the upcoming NVIDIA Blackwell platform revealed up to 4x more performance than the NVIDIA H100 Tensor Core GPU on MLPerf’s biggest LLM workload, Llama 2 70B, thanks to its use of a second-generation Transformer Engine and FP4 Tensor Cores.

The NVIDIA H200 Tensor Core GPU delivered outstanding results on every benchmark in the data center category — including the latest addition to the benchmark, the Mixtral 8x7B mixture of experts (MoE) LLM, which features a total of 46.7 billion parameters, with 12.9 billion parameters active per token.

MoE models have gained popularity as a way to bring more versatility to LLM deployments, as they’re capable of answering a wide variety of questions and performing more diverse tasks in a single deployment. They’re also more efficient since they only activate a few experts per inference — meaning they deliver results much faster than dense models of a similar size.

The continued growth of LLMs is driving the need for more compute to process inference requests. To meet real-time latency requirements for serving today’s LLMs, and to do so for as many users as possible, multi-GPU compute is a must. NVIDIA NVLink and NVSwitch provide high-bandwidth communication between GPUs based on the NVIDIA Hopper architecture and provide significant benefits for real-time, cost-effective large model inference. The Blackwell platform will further extend NVLink Switch’s capabilities with larger NVLink domains with 72 GPUs.

In addition to the NVIDIA submissions, 10 NVIDIA partners — ASUSTek, Cisco, Dell Technologies, Fujitsu, Giga Computing, Hewlett Packard Enterprise (HPE), Juniper Networks, Lenovo, Quanta Cloud Technology and Supermicro — all made solid MLPerf Inference submissions, underscoring the wide availability of NVIDIA platforms.

Relentless Software Innovation

NVIDIA platforms undergo continuous software development, racking up performance and feature improvements on a monthly basis.

In the latest inference round, NVIDIA offerings, including the NVIDIA Hopper architecture, NVIDIA Jetson platform and NVIDIA Triton Inference Server, saw leaps and bounds in performance gains.

The NVIDIA H200 GPU delivered up to 27% more generative AI inference performance over the previous round, underscoring the added value customers get over time from their investment in the NVIDIA platform.

Triton Inference Server, part of the NVIDIA AI platform and available with NVIDIA AI Enterprise software, is a fully featured open-source inference server that helps organizations consolidate framework-specific inference servers into a single, unified platform. This helps lower the total cost of ownership of serving AI models in production and cuts model deployment times from months to minutes.

In this round of MLPerf, Triton Inference Server delivered near-equal performance to NVIDIA’s bare-metal submissions, showing that organizations no longer have to choose between using a feature-rich production-grade AI inference server and achieving peak throughput performance.

Going to the Edge

Deployed at the edge, generative AI models can transform sensor data, such as images and videos, into real-time, actionable insights with strong contextual awareness. The NVIDIA Jetson platform for edge AI and robotics is uniquely capable of running any kind of model locally, including LLMs, vision transformers and Stable Diffusion.

In this round of MLPerf benchmarks, NVIDIA Jetson AGX Orin system-on-modules achieved more than a 6.2x throughput improvement and 2.4x latency improvement over the previous round on the GPT-J  LLM workload. Rather than developing for a specific use case, developers can now use this general-purpose 6-billion-parameter model to seamlessly interface with human language, transforming generative AI at the edge.

Performance Leadership All Around

This round of MLPerf Inference showed the versatility and leading performance of NVIDIA platforms — extending from the data center to the edge — on all of the benchmark’s workloads, supercharging the most innovative AI-powered applications and services. To learn more about these results, see our technical blog.

H200 GPU-powered systems are available today from CoreWeave — the first cloud service provider to announce general availability — and server makers ASUS, Dell Technologies, HPE, QTC and Supermicro.

See notice regarding software product information.

Read More

More Than Fine: Multi-LoRA Support Now Available in NVIDIA RTX AI Toolkit

More Than Fine: Multi-LoRA Support Now Available in NVIDIA RTX AI Toolkit

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for RTX PC users.

Large language models are driving some of the most exciting developments in AI with their ability to quickly understand, summarize and generate text-based content.

These capabilities power a variety of use cases, including productivity tools, digital assistants, non-playable characters in video games and more. But they’re not a one-size-fits-all solution, and developers often must fine-tune LLMs to fit the needs of their applications.

The NVIDIA RTX AI Toolkit makes it easy to fine-tune and deploy AI models on RTX AI PCs and workstations through a technique called low-rank adaptation, or LoRA. A new update, available today, enables support for using multiple LoRA adapters simultaneously within the NVIDIA TensorRT-LLM AI acceleration library, improving the performance of fine-tuned models by up to 6x.

Fine-Tuned for Performance

LLMs must be carefully customized to achieve higher performance and meet growing user demands.

These foundational models are trained on huge amounts of data but often lack the context needed for a developer’s specific use case. For example, a generic LLM can generate video game dialogue, but it will likely miss the nuance and subtlety needed to write in the style of a woodland elf with a dark past and a barely concealed disdain for authority.

To achieve more tailored outputs, developers can fine-tune the model with information related to the app’s use case.

Take the example of developing an app to generate in-game dialogue using an LLM. The process of fine-tuning starts with using the weights of a pretrained model, such as information on what a character may say in the game. To get the dialogue in the right style, a developer can tune the model on a smaller dataset of examples, such as dialogue written in a more spooky or villainous tone.

In some cases, developers may want to run all of these different fine-tuning processes simultaneously. For example, they may want to generate marketing copy written in different voices for various content channels. At the same time, they may want to summarize a document and make stylistic suggestions — as well as draft a video game scene description and imagery prompt for a text-to-image generator.

It’s not practical to run multiple models simultaneously, as they won’t all fit in GPU memory at the same time. Even if they did, their inference time would be impacted by memory bandwidth — how fast data can be read from memory into GPUs.

Lo(RA) and Behold

A popular way to address these issues is to use fine-tuning techniques such as low-rank adaptation. A simple way of thinking of it is as a patch file containing the customizations from the fine-tuning process.

Once trained, customized LoRA adapters can integrate seamlessly with the foundation model during inference, adding minimal overhead. Developers can attach the adapters to a single model to serve multiple use cases. This keeps the memory footprint low while still providing the additional details needed for each specific use case.

Architecture overview of supporting multiple clients and use-cases with a single foundation model using multi-LoRA capabilities

In practice, this means that an app can keep just one copy of the base model in memory, alongside many customizations using multiple LoRA adapters.

This process is called multi-LoRA serving. When multiple calls are made to the model, the GPU can process all of the calls in parallel, maximizing the use of its Tensor Cores and minimizing the demands of memory and bandwidth so developers can efficiently use AI models in their workflows. Fine-tuned models using multi-LoRA adapters perform up to 6x faster.

LLM inference performance on GeForce RTX 4090 Desktop GPU for Llama 3B int4 with LoRA adapters applied at runtime. Input sequence length is 1,000 tokens and output sequence length is 100 tokens. LoRA adapter max rank is 64.

In the example of the in-game dialogue application described earlier, the app’s scope could be expanded, using multi-LoRA serving, to generate both story elements and illustrations — driven by a single prompt.

The user could input a basic story idea, and the LLM would flesh out the concept, expanding on the idea to provide a detailed foundation. The application could then use the same model, enhanced with two distinct LoRA adapters, to refine the story and generate corresponding imagery. One LoRA adapter generates a Stable Diffusion prompt to create visuals using a locally deployed Stable Diffusion XL model. Meanwhile, the other LoRA adapter, fine-tuned for story writing, could craft a well-structured and engaging narrative.

In this case, the same model is used for both inference passes, ensuring that the space required for the process doesn’t significantly increase. The second pass, which involves both text and image generation, is performed using batched inference, making the process exceptionally fast and efficient on NVIDIA GPUs. This allows users to rapidly iterate through different versions of their stories, refining the narrative and the illustrations with ease.

This process is outlined in more detail in a recent technical blog.

LLMs are becoming one of the most important components of modern AI. As adoption and integration grows, demand for powerful, fast LLMs with application-specific customizations will only increase. The multi-LoRA support added today to the RTX AI Toolkit gives developers a powerful new way to accelerate these capabilities.

Read More

Better Molecules, Faster: NVIDIA NIM Agent Blueprint Redefines Hit Identification With Generative AI-Based Virtual Screening

Better Molecules, Faster: NVIDIA NIM Agent Blueprint Redefines Hit Identification With Generative AI-Based Virtual Screening

Aiming at making the process faster and smarter, NVIDIA on Wednesday released the NIM Agent Blueprint for generative AI-based virtual screening.

This innovative approach will reduce the time and cost of developing life-saving drugs, enabling quicker access to critical treatments for patients.

This NIM Agent Blueprint introduces a paradigm shift in the drug discovery process, particularly in the crucial “hit-to-lead” transition, by moving from traditional fixed database screening to generative AI-driven molecule design and pre-optimization, enabling researchers to design better molecules faster.

What’s a NIM? What’s a NIM Agent Blueprint?

NVIDIA NIM microservices are modular, cloud-native components that accelerate AI model deployment and execution. These microservices allow researchers to integrate and scale advanced AI models within their workflows, enabling faster and more efficient processing of complex data.

The NIM Agent Blueprint, a comprehensive guide, shows how these microservices can optimize key stages of drug discovery, such as hit identification and lead optimization.

How Are They Used?

Drug discovery is a complex process with three critical stages: target identification, hit identification and lead optimization. Target identification involves choosing the right biology to modify to treat the disease; hit identification is identifying potential molecules that will bind to that target; and lead optimization is improving the design of those molecules to be safer and more effective.

This NVIDIA NIM Agent Blueprint, called generative virtual screening for accelerated drug discovery, identifies and improves virtual hits in a smarter and more efficient way.

At its core are three essential AI models, now including the recently integrated AlphaFold2 as part of NVIDIA’s NIM microservices.

  • AlphaFold2, renowned for its groundbreaking impact on protein structure prediction, is now available as an NVIDIA NIM.
  • MolMIM is a novel model developed by NVIDIA that generates molecules while simultaneously optimizing for multiple properties, such as high solubility and low toxicity.
  • DiffDock is an advanced tool for quickly modeling the binding of small molecules to their protein targets.

These models work in concert to improve the hit-to-lead process, making it more efficient and faster.

Each of these AI models is packaged within NVIDIA NIM microservices — portable containers designed to accelerate the performance, shorten time-to-market and simplify the deployment of generative AI models anywhere.

The NIM Agent Blueprint integrates these microservices into a flexible, scalable, generative AI workflow that can help transform drug discovery.

Leading computational drug discovery and biotechnology software providers that are using NIM microservices now, such as Benchling, Dotmatics, Terray, TetraScience and Cadence Molecular Sciences (OpenEye), are using NIM Agent Blueprints in their computer-aided drug discovery platforms.

These integrations aim to make the hit-to-lead process faster and more intelligent, leading to the identification of more viable drug candidates in less time and at lower cost.

Leading computational drug discovery and biotechnology software providers that are using NIM microservices now, such as Schrödinger, Benchling, Dotmatics, Terray, TetraScience and Cadence Molecular Sciences (OpenEye), are using NIM Agent Blueprints in their computer-aided drug discovery platforms.

These integrations aim to make the hit-to-lead process faster and more intelligent, leading to the identification of more viable drug candidates in less time and at lower cost.

Global professional services company Accenture is poised to tailor the NIM Agent Blueprint to the specific needs of drug development programs by optimizing the molecule generation step with input from pharmaceutical partners to inform the MolMIM NIM.

In addition, the NIM microservices that comprise the NIM Agent Blueprint will soon be available on AWS HealthOmics, a purpose-built service that helps customers orchestrate biological analyses. This includes streamlining the integration of AI into existing drug discovery workflows.

Revolutionizing Drug Development With AI

The stakes in drug discovery are high.

Developing a new drug typically costs around $2.6 billion and can take 10-15 years, with a success rate of less than 10%.

By making molecular design smarter with NVIDIA’s AI-powered NIM Agent Blueprint, pharmaceutical companies can reduce these costs and shorten development timelines in the $1.5 trillion global pharmaceutical market.

This NIM Agent Blueprint represents a significant shift from traditional drug discovery methods, offering a generative AI approach that pre-optimizes molecules for desired therapeutic properties.

For example, MolMIM, the generative model for molecules within this NIM Agent Blueprint, uses advanced functions to steer the generation of molecules with optimized pharmacokinetic properties — such as absorption rate, protein binding, half-life and other properties — a marked advancement over previous methods.

This smarter approach to small molecule design enhances the potential for successful lead optimization, accelerating the overall drug discovery process.

This leap in technology could lead to faster, more targeted treatments, addressing growing challenges in healthcare, from rising costs to an aging population.

NVIDIA’s commitment to supporting researchers with the latest advancements in accelerated computing underscores its role in solving the most complex problems in drug discovery.

Visit build.nvidia.com to download the NIM Agent Blueprint for generative AI-based virtual screening and take the first step toward faster, more efficient drug development.

See notice regarding software product information.

Read More

From Prototype to Prompt: NVIDIA NIM Agent Blueprints Fast-Forward Next Wave of Enterprise Generative AI

From Prototype to Prompt: NVIDIA NIM Agent Blueprints Fast-Forward Next Wave of Enterprise Generative AI

The initial wave of generative AI was driven by its use in internet services that showed incredible new possibilities with tools that could help people write, research and imagine faster than ever.

The second wave of generative AI is now here, powered by the availability of advanced open-source foundation models, as well as advancements in agentic AI that are improving efficiency and autonomy of AI workflows. Enterprises across industries can use models like Google Gemma, Llama 3.1 405B, Microsoft Phi, Mixtral and Nemotron to develop their own AI applications to support business growth and boost productivity.

To accelerate business transformation, enterprises need blueprints for canonical generative AI workflows like digital human customer service chatbots, retrieval-augmented generation and drug discovery. While NVIDIA NIM microservices help make these models efficient and accessible for enterprise use, building enterprise generative AI applications is a complex, multistep process.

Launched today, NVIDIA NIM Agent Blueprints include everything an enterprise developer needs to build and deploy customized generative AI applications that make a transformative  impact on business objectives.

Blueprints for Data-Driven Enterprise Flywheels

NIM Agent Blueprints are reference AI workflows tailored for specific use cases. They include sample applications built with NVIDIA NIM and partner microservices, reference code, customization documentation and a Helm chart for deployment.

With NIM Agent Blueprints, developers can gain a head start on creating their own applications using NVIDIA’s advanced AI tools and end-to-end development experience for each use case. The blueprints are designed to be modified and enhanced, and allow developers to leverage both information retrieval and agent-based workflows capable of performing complex tasks.

NIM Agent Blueprints also help developers improve their applications throughout the AI lifecycle. As users interact with AI applications, new data is generated. This data can be used to refine and enhance the models in a continuous learning cycle, creating a data-driven generative AI flywheel.

NIM Agent Blueprints help enterprises build their own generative AI flywheels with applications that link models with their data. NVIDIA NeMo facilitates this process, while NVIDIA AI Foundry serves as the production environment for running the flywheel.

The first NIM Agent Blueprints available are:

  • digital human for customer service
  • generative virtual screening for accelerated drug discovery
  • multimodal PDF data extraction for enterprise RAG

ServiceNow is a leader in enterprise AI that has integrated advanced generative AI capabilities into its digital workflow platform. It’s already bringing NIM microservices into its Now Assist AI solutions and working with the technologies featured in the digital human for customer service NIM Blueprint.

“AI is not just a tool, it’s the foundation of a fundamental shift in how companies can better equip employees and serve customers,” said Jon Sigler, senior vice president, Platform and AI at ServiceNow. “Through our collaboration with NVIDIA, we’ve built new generative AI products and features that are driving growth and powering AI transformation for ServiceNow customers.”

More NIM Agent Blueprints are in development for creating generative AI applications for customer service, content generation, software engineering, retail shopping advisors and R&D. NVIDIA plans to introduce new NIM Agent Blueprints monthly.

NVIDIA Ecosystem Supercharges Enterprise Adoption

NVIDIA’s partner ecosystem — including global systems integrators and service delivery partners Accenture, Deloitte, SoftServe, Quantiphi and World Wide Technology — are bringing NIM Agent Blueprints to the world’s enterprises.

NIM Agent Blueprints can be optimized using customer interaction data with tools from NVIDIA’s ecosystem of partners, such as Dataiku and DataRobot for model fine-tuning, governance, and monitoring, deepset, LlamaIndex and Langchain for building workflows, Weights and Biases for generative AI application evaluations, and CrowdStrike, Datadog, Fiddler AI, New Relic and Trend Micro for additional safeguarding. Infrastructure platform providers, including Nutanix, Red Hat and Broadcom, will support NIM Agent Blueprints on their enterprise solutions.

Customers can build and deploy NIM Agent Blueprints on NVIDIA-Certified Systems from manufacturers such as Cisco, Dell Technologies, Hewlett Packard Enterprise and Lenovo, as well as on NVIDIA-accelerated cloud instances from Amazon Web Services, Google Cloud, Microsoft Azure and Oracle Cloud Infrastructure.

To help enterprises put their data to work in generative AI applications, NIM Agent Blueprints can integrate with data and storage platforms from NVIDIA partners, such as Cohesity, Datastax, Dropbox, NetApp and VAST Data.

A Collaborative Future for Developers and Data Scientists

Generative AI is now fostering collaboration between developers and data scientists. Developers use NIM Agent Blueprints as a foundation to build their applications, while data scientists implement the data flywheel to continually improve their custom NIM microservices. As a NIM improves, so do the related applications, creating a cycle of continuous enhancement and data generation.

With NIM Agent Blueprints — and support from NVIDIA’s partners — virtually every enterprise can seamlessly integrate generative AI into their applications to drive efficiency and innovation across industries.

Enterprises can experience NVIDIA NIM Agent Blueprints today.

See notice regarding software product information.

Read More

NVIDIA Launches NIM Microservices for Generative AI in Japan, Taiwan

NVIDIA Launches NIM Microservices for Generative AI in Japan, Taiwan

Nations around the world are pursuing sovereign AI to produce artificial intelligence using their own computing infrastructure, data, workforce and business networks to ensure AI systems align with local values, laws and interests.

In support of these efforts, NVIDIA today announced the availability of four new NVIDIA NIM microservices that enable developers to more easily build and deploy high-performing generative AI applications.

The microservices support popular community models tailored to meet regional needs. They enhance user interactions through accurate understanding and improved responses based on local languages and cultural heritage.

In the Asia-Pacific region alone, generative AI software revenue is expected to reach $48 billion by 2030 — up from $5 billion this year, according to ABI Research.

Llama-3-Swallow-70B, trained on Japanese data, and Llama-3-Taiwan-70B, trained on Mandarin data, are regional language models that provide a deeper understanding of local laws, regulations and other customs.

The RakutenAI 7B family of models, built on Mistral-7B, were trained on English and Japanese datasets, and are available as two different NIM microservices for Chat and Instruct. Rakuten’s foundation and instruct models have achieved leading scores among open Japanese large language models, landing the top average score in the LM Evaluation Harness benchmark carried out from January to March 2024.

Training a large language model (LLM) on regional languages enhances the effectiveness of its outputs by ensuring more accurate and nuanced communication, as it better understands and reflects cultural and linguistic subtleties.

The models offer leading performance for Japanese and Mandarin language understanding, regional legal tasks, question-answering, and language translation and summarization compared with base LLMs like Llama 3.

Nations worldwide — from Singapore, the United Arab Emirates, South Korea and Sweden to France, Italy and India — are investing in sovereign AI infrastructure.

The new NIM microservices allow businesses, government agencies and universities to host native LLMs in their own environments, enabling developers to build advanced copilots, chatbots and AI assistants.

Developing Applications With Sovereign AI NIM Microservices

Developers can easily deploy the sovereign AI models, packaged as NIM microservices, into production while achieving improved performance.

The microservices, available with NVIDIA AI Enterprise, are optimized for inference with the NVIDIA TensorRT-LLM open-source library.

NIM microservices for Llama 3 70B — which was used as the base model for the new Llama–3-Swallow-70B and Llama-3-Taiwan-70B NIM microservices — can provide up to 5x higher throughput. This lowers the total cost of running the models in production and provides better user experiences by decreasing latency.

The new NIM microservices are available today as hosted application programming interfaces (APIs).

Tapping NVIDIA NIM for Faster, More Accurate Generative AI Outcomes

The NIM microservices accelerate deployments, enhance overall performance and provide the necessary security for organizations across global industries, including healthcare, finance, manufacturing, education and legal.

The Tokyo Institute of Technology fine-tuned Llama-3-Swallow 70B using Japanese-language data.

“LLMs are not mechanical tools that provide the same benefit for everyone. They are rather intellectual tools that interact with human culture and creativity. The influence is mutual where not only are the models affected by the data we train on, but also our culture and the data we generate will be influenced by LLMs,” said Rio Yokota, professor at the Global Scientific Information and Computing Center at the Tokyo Institute of Technology. “Therefore, it is of paramount importance to develop sovereign AI models that adhere to our cultural norms. The availability of Llama-3-Swallow as an NVIDIA NIM microservice will allow developers to easily access and deploy the model for Japanese applications across various industries.”

For instance, a Japanese AI company, Preferred Networks, uses the model to develop a healthcare specific model trained on a unique corpus of Japanese medical data, called Llama3-Preferred-MedSwallow-70B, that tops scores on the Japan National Examination for Physicians.

Chang Gung Memorial Hospital (CGMH), one of the leading hospitals in Taiwan, is building a custom-made AI Inference Service (AIIS) to centralize all LLM applications within the hospital system. Using Llama 3-Taiwan 70B, it is improving the efficiency of frontline medical staff with more nuanced medical language that patients can understand.

“By providing instant, context-appropriate guidance, AI applications built with local-language LLMs streamline workflows and serve as a continuous learning tool to support staff development and improve the quality of patient care,” said Dr. Changfu Kuo, director of the Center for Artificial Intelligence in Medicine at CGMH, Linko Branch. “NVIDIA NIM is simplifying the development of these applications, allowing for easy access and deployment of models trained on regional languages with minimal engineering expertise.”

Taiwan-based Pegatron, a maker of electronic devices, will adopt the Llama 3-Taiwan 70B NIM microservice for internal- and external-facing applications. It has integrated it with its PEGAAi Agentic AI System to automate processes, boosting efficiency in manufacturing and operations.

Llama-3-Taiwan 70B NIM is also being used by global petrochemical manufacturer Chang Chun Group, world-leading printed circuit board company Unimicron,  technology-focused media company TechOrange, online contract service company LegalSign.ai and generative AI startup APMIC. These companies are also collaborating on the open model.

Creating Custom Enterprise Models With NVIDIA AI Foundry

While regional AI models can provide culturally nuanced and localized responses, enterprises still need to fine-tune them for their business processes and domain expertise.

NVIDIA AI Foundry is a platform and service that includes popular foundation models, NVIDIA NeMo for fine-tuning, and dedicated capacity on NVIDIA DGX Cloud to provide developers a full-stack solution for creating a customized foundation model packaged as a NIM microservice.

Additionally, developers using NVIDIA AI Foundry have access to the NVIDIA AI Enterprise software platform, which provides security, stability and support for production deployments.

NVIDIA AI Foundry gives developers the necessary tools to more quickly and easily build and deploy their own custom, regional language NIM microservices to power AI applications, ensuring culturally and linguistically appropriate results for their users.

Read More

NVIDIA Launches Array of New CUDA Libraries to Expand Accelerated Computing and Deliver Order-of-Magnitude Speedup to Science and Industrial Applications

NVIDIA Launches Array of New CUDA Libraries to Expand Accelerated Computing and Deliver Order-of-Magnitude Speedup to Science and Industrial Applications

News summary: New libraries in accelerated computing deliver order-of-magnitude speedups and reduce energy consumption and costs in data processing, generative AI, recommender systems, AI data curation, data processing, 6G research, AI-physics and more. They include:

  • LLM applications: NeMo Curator, to create custom datasets, adds image curation and Nemotron-4 340B for high-quality synthetic data generation
  • Data processing: cuVS for vector search to build indexes in minutes instead of days and a new Polars GPU Engine in open beta
  • Physical AI: For physics simulation, Warp accelerates computations with a new TIle API. For wireless network simulation, Aerial adds more map formats for ray tracing and simulation. And for link-level wireless simulation, Sionna adds a new toolchain for real-time inference

Companies around the world are increasingly turning to NVIDIA accelerated computing to speed up applications they first ran on CPUs only. This has enabled them to achieve extreme speedups and benefit from incredible energy savings.

In Houston, CPFD makes computational fluid dynamics simulation software for industrial applications, like its Barracuda Virtual Reactor software that helps design next-generation recycling facilities. Plastic recycling facilities run CPFD software in cloud instances powered by NVIDIA accelerated computing. With a CUDA GPU-accelerated virtual machine, they can efficiently scale and run simulations 400x faster and 140x more energy efficiently than using a CPU-based workstation.

A conveyor belt filled with plastic bottles flowing through a recycling facility. AI-generated image.
Bottles being loaded into a plastics recycling facility. AI-generated image.

A popular video conferencing application captions several hundred thousand virtual meetings an hour. When using CPUs to create live captions, the app could query a transformer-powered speech recognition AI model three times a second. After migrating to GPUs in the cloud, the application’s throughput increased to 200 queries per second — a 66x speedup and 25x energy-efficiency improvement.

In homes across the globe, an e-commerce website connects hundreds of millions of shoppers a day to the products they need using an advanced recommendation system powered by a deep learning model, running on its NVIDIA accelerated cloud computing system. After switching from CPUs to GPUs in the cloud, it achieved significantly lower latency with a 33x speedup and nearly 12x energy-efficiency improvement.

With the exponential growth of data, accelerated computing in the cloud is set to enable even more innovative use cases.

NVIDIA Accelerated Computing on CUDA GPUs Is Sustainable Computing

NVIDIA estimates that if all AI, HPC and data analytics workloads that are still running on CPU servers were CUDA GPU-accelerated, data centers would save 40 terawatt-hours of energy annually. That’s the equivalent energy consumption of 5 million U.S. homes per year.

Accelerated computing uses the parallel processing capabilities of CUDA GPUs to complete jobs orders of magnitude faster than CPUs, improving productivity while dramatically reducing cost and energy consumption.

Although adding GPUs to a CPU-only server increases peak power, GPU acceleration finishes tasks quickly and then enters a low-power state. The total energy consumed with GPU-accelerated computing is significantly lower than with general-purpose CPUs, while yielding superior performance.

Energy-efficiency improvements are achieved for on-premises, cloud-based and hybrid workloads when using accelerated computing on GPUs compared to CPUs.
GPUs achieve 20x greater energy efficiency compared to traditional computing on CPU-only servers because they deliver greater performance per watt, completing more tasks in less time.

In the past decade, NVIDIA AI computing has achieved approximately 100,000x more energy efficiency when processing large language models. To put that into perspective, if the efficiency of cars improved as much as NVIDIA has advanced the efficiency of AI on its accelerated computing platform, they’d get 500,000 miles per gallon. That’s enough to drive to the moon, and back, on less than a gallon of gasoline.

In addition to these dramatic boosts in efficiency on AI workloads, GPU computing can achieve incredible speedups over CPUs. Customers of the NVIDIA accelerated computing platform running workloads on cloud service providers saw speedups of 10-180x across a gamut of real-world tasks, from data processing to computer vision, as the chart below shows.

Data processing, scientific computing, speech AI, recommender systems, search, computer vision and other workloads run by cloud customers achieved 10-160x speedups.
Speedups of 10-180x achieved in real-world implementations by cloud customers across workloads with the NVIDIA accelerated computing platform.

As workloads continue to demand exponentially more computing power, CPUs have struggled to provide the necessary performance, creating a growing performance gap and driving “compute inflation.” The chart below illustrates a multiyear trend of how data growth has far outpaced the growth in compute performance per watt of CPUs.

A trend known as compute inflation is highlighted by a graph, with an arc showing CPU performance per watt scaling down while data growth quickly rises.
The widening gap between data growth and the lagging compute performance per watt of CPUs.

The energy savings of GPU acceleration frees up what would otherwise have been wasted cost and energy.

With its massive energy-efficiency savings, accelerated computing is sustainable computing.

The Right Tools for Every Job 

GPUs cannot accelerate software written for general-purpose CPUs. Specialized algorithm software libraries are needed to accelerate specific workloads. Just like a mechanic would have an entire toolbox from a screwdriver to a wrench for different tasks, NVIDIA provides a diverse set of libraries to perform low-level functions like parsing and executing calculations on data.

Each NVIDIA CUDA library is optimized to harness hardware features specific to NVIDIA GPUs. Combined, they encompass the power of the NVIDIA platform.

New updates continue to be added on the CUDA platform roadmap, expanding across diverse use cases:

LLM Applications

NeMo Curator gives developers the flexibility to quickly create custom datasets in large language model (LLM) use cases. Recently, we announced capabilities beyond text to expand to multimodal support, including image curation.

SDG (synthetic data generation) augments existing datasets with high-quality, synthetically generated data to customize and fine-tune models and LLM applications. We announced Nemotron-4 340B, a new suite of models specifically built for SDG that enables businesses and developers to use model outputs and build custom models.

Data Processing Applications

cuVS is an open-source library for GPU-accelerated vector search and clustering that delivers incredible speed and efficiency across LLMs and semantic search. The latest cuVS allows large indexes to be built in minutes instead of hours or even days, and searches them at scale.

Polars is an open-source library that makes use of query optimizations and other techniques to process hundreds of millions of rows of data efficiently on a single machine. A new Polars GPU engine powered by NVIDIA’s cuDF library will be available in open beta. It delivers up to a 10x performance boost compared to CPU, bringing the energy savings of accelerated computing to data practitioners and their applications.

Physical AI

Warp, for high-performance GPU simulation and graphics, helps accelerate spatial computing by making it easier to write differentiable programs for physics simulation, perception, robotics and geometry processing. The next release will have support for a new Tile API that allows developers to use Tensor Cores inside GPUs for matrix and Fourier computations.

Aerial is a suite of accelerated computing platforms that includes Aerial CUDA-Accelerated RAN and Aerial Omniverse Digital Twin for designing, simulating and operating wireless networks for commercial applications and industry research. The next release will include a new expansion of Aerial with more map formats for ray tracing and simulations with higher accuracy.

Sionna is a GPU-accelerated open-source library for link-level simulations of wireless and optical communication systems. With GPUs, Sionna achieves orders-of-magnitude faster simulation, enabling interactive exploration of these systems and paving the way for next-generation physical layer research. The next release will include the entire toolchain required to design, train and evaluate neural network-based receivers, including support for real-time inference of such neural receivers using NVIDIA TensorRT.

NVIDIA provides over 400 libraries. Some, like CV-CUDA, excel at pre- and post-processing of computer vision tasks common in user-generated video, recommender systems, mapping and video conferencing. Others, like cuDF, accelerate data frames and tables central to SQL databases and pandas in data science.

CAD – Computer-Aided Design, CAE – Computer-Aided Engineering, EDA – Electronic Design Automation

Many of these libraries are versatile — for example, cuBLAS for linear algebra acceleration — and can be used across multiple workloads, while others are highly specialized to focus on a specific use case, like cuLitho for silicon computational lithography.

For researchers who don’t want to build their own pipelines with NVIDIA CUDA-X libraries, NVIDIA NIM provides a streamlined path to production deployment by packaging multiple libraries and AI models into optimized containers. The containerized microservices deliver improved throughput out of the box.

Augmenting these libraries’ performance are an expanding number of hardware-based acceleration features that deliver speedups with the highest energy efficiencies. The NVIDIA Blackwell platform, for example, includes a decompression engine that unpacks compressed data files inline up to 18x faster than CPUs. This dramatically accelerates data processing applications that need to frequently access compressed files in storage like SQL, Apache Spark and pandas, and decompress them for runtime computation.

The integration of NVIDIA’s specialized CUDA GPU-accelerated libraries into cloud computing platforms delivers remarkable speed and energy efficiency across a wide range of workloads. This combination drives significant cost savings for businesses and plays a crucial role in advancing sustainable computing, helping billions of users relying on cloud-based workloads to benefit from a more sustainable and cost-effective digital ecosystem.

Learn more about NVIDIA’s sustainable computing efforts and check out the Energy Efficiency Calculator to discover potential energy and emissions savings.

See notice regarding software product information.

Read More

NVIDIA to Present Innovations at Hot Chips That Boost Data Center Performance and Energy Efficiency

NVIDIA to Present Innovations at Hot Chips That Boost Data Center Performance and Energy Efficiency

A deep technology conference for processor and system architects from industry and academia has become a key forum for the trillion-dollar data center computing market.

At Hot Chips 2024 next week, senior NVIDIA engineers will present the latest advancements powering the NVIDIA Blackwell platform, plus research on liquid cooling for data centers and AI agents for chip design.

They’ll share how:

  • NVIDIA Blackwell brings together multiple chips, systems and NVIDIA CUDA software to power the next generation of AI across use cases, industries and countries.
  • NVIDIA GB200 NVL72 — a multi-node, liquid-cooled, rack-scale solution that connects 72 Blackwell GPUs and 36 Grace CPUs — raises the bar for AI system design.
  • NVLink interconnect technology provides all-to-all GPU communication, enabling record high throughput and low-latency inference for generative AI.
  • The NVIDIA Quasar Quantization System pushes the limits of physics to accelerate AI computing.
  • NVIDIA researchers are building AI models that help build processors for AI.

An NVIDIA Blackwell talk, taking place Monday, Aug. 26, will also spotlight new architectural details and examples of generative AI models running on Blackwell silicon.

It’s preceded by three tutorials on Sunday, Aug. 25, that will cover how hybrid liquid-cooling solutions can help data centers transition to more energy-efficient infrastructure and how AI models, including large language model (LLM)-powered agents, can help engineers design the next generation of processors.

Together, these presentations showcase the ways NVIDIA engineers are innovating across every area of data center computing and design to deliver unprecedented performance, efficiency and optimization.

Be Ready for Blackwell

NVIDIA Blackwell is the ultimate full-stack computing challenge. It comprises multiple NVIDIA chips, including the Blackwell GPU, Grace CPU, BlueField data processing unit, ConnectX network interface card, NVLink Switch, Spectrum Ethernet switch and Quantum InfiniBand switch.

Ajay Tirumala and Raymond Wong, directors of architecture at NVIDIA, will provide a first look at the platform and explain how these technologies work together to deliver a new standard for AI and accelerated computing performance while advancing energy efficiency.

The multi-node NVIDIA GB200 NVL72 solution is a perfect example. LLM inference requires low-latency, high-throughput token generation. GB200 NVL72 acts as a unified system to deliver up to 30x faster inference for LLM workloads, unlocking the ability to run trillion-parameter models in real time.

Tirumala and Wong will also discuss how the NVIDIA Quasar Quantization System — which brings together algorithmic innovations, NVIDIA software libraries and tools, and Blackwell’s second-generation Transformer Engine — supports high accuracy on low-precision models, highlighting examples using LLMs and visual generative AI.

Keeping Data Centers Cool

The traditional hum of air-cooled data centers may become a relic of the past as researchers develop more efficient and sustainable solutions that use hybrid cooling, a combination of air and liquid cooling.

Liquid-cooling techniques move heat away from systems more efficiently than air, making it easier for computing systems to stay cool even while processing large workloads. The equipment for liquid cooling also takes up less space and consumes less power than air-cooling systems, allowing data centers to add more server racks — and therefore more compute power — in their facilities.

Ali Heydari, director of data center cooling and infrastructure at NVIDIA, will present several designs for hybrid-cooled data centers.

Some designs retrofit existing air-cooled data centers with liquid-cooling units, offering a quick and easy solution to add liquid-cooling capabilities to existing racks. Other designs require the installation of piping for direct-to-chip liquid cooling using cooling distribution units or by entirely submerging servers in immersion cooling tanks. Although these options demand a larger upfront investment, they lead to substantial savings in both energy consumption and operational costs.

Heydari will also share his team’s work as part of COOLERCHIPS, a U.S. Department of Energy program to develop advanced data center cooling technologies. As part of the project, the team is using the NVIDIA Omniverse platform to create physics-informed digital twins that will help them model energy consumption and cooling efficiency to optimize their data center designs.

AI Agents Chip In for Processor Design

Semiconductor design is a mammoth challenge at microscopic scale. Engineers developing cutting-edge processors work to fit as much computing power as they can onto a piece of silicon a few inches across, testing the limits of what’s physically possible.

AI models are supporting their work by improving design quality and productivity, boosting the efficiency of manual processes and automating some time-consuming tasks. The models include prediction and optimization tools to help engineers rapidly analyze and improve designs, as well as LLMs that can assist engineers with answering questions, generating code, debugging design problems and more.

Mark Ren, director of design automation research at NVIDIA, will provide an overview of these models and their uses in a tutorial. In a second session, he’ll focus on agent-based AI systems for chip design.

AI agents powered by LLMs can be directed to complete tasks autonomously, unlocking broad applications across industries. In microprocessor design, NVIDIA researchers are developing agent-based systems that can reason and take action using customized circuit design tools, interact with experienced designers, and learn from a database of human and agent experiences.

NVIDIA experts aren’t just building this technology — they’re using it. Ren will share examples of how engineers can use AI agents for timing report analysis, cell cluster optimization processes and code generation. The cell cluster optimization work recently won best paper at the first IEEE International Workshop on LLM-Aided Design.

Register for Hot Chips, taking place Aug. 25-27, at Stanford University and online.

Read More