Fast, Low-Cost Inference Offers Key to Profitable AI

Businesses across every industry are rolling out AI services this year. For Microsoft, Oracle, Perplexity, Snap and hundreds of other leading companies, using the NVIDIA AI inference platform — a full stack comprising world-class silicon, systems and software — is the key to delivering high-throughput and low-latency inference and enabling great user experiences while lowering cost.

NVIDIA’s advancements in inference software optimization and the NVIDIA Hopper platform are helping industries serve the latest generative AI models, delivering excellent user experiences while optimizing total cost of ownership. The Hopper platform also helps deliver up to 15x more energy efficiency for inference workloads compared to previous generations.

AI inference is notoriously difficult, as it requires many steps to strike the right balance between throughput and user experience.

But the underlying goal is simple: generate more tokens at a lower cost. Tokens represent words in a large language model (LLM) system — and with AI inference services typically charging for every million tokens generated, this goal offers the most visible return on AI investments and energy used per task.

Full-stack software optimization offers the key to improving AI inference performance and achieving this goal.

Cost-Effective User Throughput

Businesses are often challenged with balancing the performance and costs of inference workloads. While some customers or use cases may work with an out-of-the-box or hosted model, others may require customization. NVIDIA technologies simplify model deployment while optimizing cost and performance for AI inference workloads. In addition, customers can experience flexibility and customizability with the models they choose to deploy.

NVIDIA NIM microservices, NVIDIA Triton Inference Server and the NVIDIA TensorRT library are among the inference solutions NVIDIA offers to suit users’ needs:

NVIDIA NIM inference microservices are prepackaged and performance-optimized for rapidly deploying AI foundation models on any infrastructure — cloud, data centers, edge or workstations.
NVIDIA Triton Inference Server, one of the company’s most popular open-source projects, allows users to package and serve any model regardless of the AI framework it was trained on.
NVIDIA TensorRT is a high-performance deep learning inference library that includes runtime and model optimizations to deliver low-latency and high-throughput inference for production applications.

Available in all major cloud marketplaces, the NVIDIA AI Enterprise software platform includes all these solutions and provides enterprise-grade support, stability, manageability and security.

With the framework-agnostic NVIDIA AI inference platform, companies save on productivity, development, and infrastructure and setup costs. Using NVIDIA technologies can also boost business revenue by helping companies avoid downtime and fraudulent transactions, increase e-commerce shopping conversion rates and generate new, AI-powered revenue streams.

Cloud-Based LLM Inference

To ease LLM deployment, NVIDIA has collaborated closely with every major cloud service provider to ensure that the NVIDIA inference platform can be seamlessly deployed in the cloud with minimal or no code required. NVIDIA NIM is integrated with cloud-native services such as:

Amazon SageMaker AI, Amazon Bedrock Marketplace, Amazon Elastic Kubernetes Service
Google Cloud’s Vertex AI, Google Kubernetes Engine
Microsoft Azure AI Foundry coming soon, Azure Kubernetes Service
Oracle Cloud Infrastructure’s data science tools, Oracle Cloud Infrastructure Kubernetes Engine

Plus, for customized inference deployments, NVIDIA Triton Inference Server is deeply integrated into all major cloud service providers.

For example, using the OCI Data Science platform, deploying NVIDIA Triton is as simple as turning on a switch in the command line arguments during model deployment, which instantly launches an NVIDIA Triton inference endpoint.

Similarly, with Azure Machine Learning, users can deploy NVIDIA Triton either with no-code deployment through the Azure Machine Learning Studio or full-code deployment with Azure Machine Learning CLI. AWS provides one-click deployment for NVIDIA NIM from SageMaker Marketplace and Google Cloud provides a one-click deployment option on Google Kubernetes Engine (GKE). Google Cloud provides a one-click deployment option on Google Kubernetes Engine, while AWS offers NVIDIA Triton on its AWS Deep Learning containers.

The NVIDIA AI inference platform also uses popular communication methods for delivering AI predictions, automatically adjusting to accommodate the growing and changing needs of users within a cloud-based infrastructure.

From accelerating LLMs to enhancing creative workflows and transforming agreement management, NVIDIA’s AI inference platform is driving real-world impact across industries. Learn how collaboration and innovation are enabling the organizations below to achieve new levels of efficiency and scalability.

Serving 400 Million Search Queries Monthly With Perplexity AI

Perplexity AI, an AI-powered search engine, handles over 435 million monthly queries. Each query represents multiple AI inference requests. To meet this demand, the Perplexity AI team turned to NVIDIA H100 GPUs, Triton Inference Server and TensorRT-LLM.

Supporting over 20 AI models, including Llama 3 variations like 8B and 70B, Perplexity processes diverse tasks such as search, summarization and question-answering. By using smaller classifier models to route tasks to GPU pods, managed by NVIDIA Triton, the company delivers cost-efficient, responsive service under strict service level agreements.

Through model parallelism, which splits LLMs across GPUs, Perplexity achieved a threefold cost reduction while maintaining low latency and high accuracy. This best-practice framework demonstrates how IT teams can meet growing AI demands, optimize total cost of ownership and scale seamlessly with NVIDIA accelerated computing.

Reducing Response Times With Recurrent Drafter (ReDrafter)

Open-source research advancements are helping to democratize AI inference. Recently, NVIDIA incorporated Redrafter, an open-source approach to speculative decoding published by Apple , into NVIDIA TensorRT-LLM.

ReDrafter uses smaller “draft” modules to predict tokens in parallel, which are then validated by the main model. This technique significantly reduces response times for LLMs, particularly during periods of low traffic.

Transforming Agreement Management With Docusign

Docusign, a leader in digital agreement management, turned to NVIDIA to supercharge its Intelligent Agreement Management platform. With over 1.5 million customers globally, Docusign needed to optimize throughput and manage infrastructure expenses while delivering AI-driven insights.

NVIDIA Triton provided a unified inference platform for all frameworks, accelerating time to market and boosting productivity by transforming agreement data into actionable insights. Docusign’s adoption of the NVIDIA inference platform underscores the positive impact of scalable AI infrastructure on customer experiences and operational efficiency.

“NVIDIA Triton makes our lives easier,” said Alex Zakhvatov, senior product manager at Docusign. “We no longer need to deploy bespoke, framework-specific inference servers for our AI models. We leverage Triton as a unified inference server for all AI frameworks and also use it to identify the right production scenario to optimize cost- and performance-saving engineering efforts.”

Enhancing Customer Care in Telco With Amdocs

Amdocs, a leading provider of software and services for communications and media providers, built amAIz, a domain-specific generative AI platform for telcos as an open, secure, cost-effective and LLM-agnostic framework. Amdocs is using NVIDIA DGX Cloud and NVIDIA AI Enterprise software to provide solutions based on commercially available LLMs as well as domain-adapted models, enabling service providers to build and deploy enterprise-grade generative AI applications.

Using NVIDIA NIM, Amdocs reduced the number of tokens consumed for deployed use cases by up to 60% in data preprocessing and 40% in inferencing, offering the same level of accuracy with a significantly lower cost per token, depending on various factors and volumes used. The collaboration also reduced query latency by approximately 80%, ensuring that end users experience near real-time responses. This acceleration enhances user experiences across commerce, customer service, operations and beyond.

Revolutionizing Retail With AI on Snap

Shopping for the perfect outfit has never been easier, thanks to Snap’s Screenshop feature. Integrated into Snapchat, this AI-powered tool helps users find fashion items seen in photos. NVIDIA Triton played a pivotal role in enabling Screenshop’s pipeline, which processes images using multiple frameworks, including TensorFlow and PyTorch.

By consolidating its pipeline onto a single inference serving platform, Snap significantly reduced development time and costs while ensuring seamless deployment of updated models. The result is a frictionless user experience powered by AI.

“We didn’t want to deploy bespoke inference serving platforms for our Screenshop pipeline, a TF-serving platform for TensorFlow and a TorchServe platform for PyTorch,” explained Ke Ma, a machine learning engineer at Snap. “Triton’s framework-agnostic design and support for multiple backends like TensorFlow, PyTorch and ONNX was very compelling. It allowed us to serve our end-to-end pipeline using a single inference serving platform, which reduces our inference serving costs and the number of developer days needed to update our models in production.”

Following the successful launch of the Screenshop service on NVIDIA Triton, Ma and his team turned to NVIDIA TensorRT to further enhance their system’s performance. By applying the default NVIDIA TensorRT settings during the compilation process, the Screenshop team immediately saw a 3x surge in throughput, estimated to deliver a staggering 66% cost reduction.

Financial Freedom Powered by AI With Wealthsimple

Wealthsimple, a Canadian investment platform managing over C$30 billion in assets, redefined its approach to machine learning with NVIDIA’s AI inference platform. By standardizing its infrastructure, Wealthsimple slashed model delivery time from months to under 15 minutes, eliminating downtime and empowering teams to deliver machine learning as a service.

By adopting NVIDIA Triton and running its models through AWS, Wealthsimple achieved 99.999% uptime, ensuring seamless predictions for over 145 million transactions annually. This transformation highlights how robust AI infrastructure can revolutionize financial services.

“NVIDIA’s AI inference platform has been the linchpin in our organization’s ML success story, revolutionizing our model deployment, reducing downtime and enabling us to deliver unparalleled service to our clients,” said Mandy Gu, senior software development manager at Wealthsimple.

Elevating Creative Workflows With Let’s Enhance

AI-powered image generation has transformed creative workflows and can be applied to enterprise use cases such as creating personalized content and imaginative backgrounds for marketing visuals. While diffusion models are powerful tools for enhancing creative workflows, the models can be computationally expensive.

To optimize its workflows using the Stable Diffusion XL model in production, Let’s Enhance, a pioneering AI startup, chose the NVIDIA AI inference platform.

Product images with backgrounds created using Let’s Enhance platform powered by SDXL.

Let’s Enhance’s latest product, AI Photoshoot, uses the SDXL model to transform plain product photos into beautiful visual assets for e-commerce websites and marketing campaigns.

With NVIDIA Triton’s robust support for various frameworks and backends, coupled with its dynamic batching feature set, Let’s Enhance was able to seamlessly integrate the SDXL model into existing AI pipelines with minimal involvement from engineering teams, freeing up their time for research and development efforts.

Accelerating Cloud-Based Vision AI With OCI

Oracle Cloud Infrastructure (OCI) integrated NVIDIA Triton to power its Vision AI service, enhancing prediction throughput by up to 76% and reducing latency by 51%. These optimizations improved customer experiences with applications including automating toll billing for transit agencies and streamlining invoice recognition for global businesses.

With Triton’s hardware-agnostic capabilities, OCI has expanded its AI services portfolio, offering robust and efficient solutions across its global data centers.

“Our AI platform is Triton-aware for the benefit of our customers,” said Tzvi Keisar, a director of product management for OCI’s data science service, which handles machine learning for Oracle’s internal and external users.

Real-Time Contextualized Intelligence and Search Efficiency With Microsoft

Azure offers one of the widest and broadest selections of virtual machines powered and optimized by NVIDIA AI. These virtual machines encompass multiple generations of NVIDIA GPUs, including NVIDIA Blackwell and NVIDIA Hopper systems.

Building on this rich history of engineering collaboration, NVIDIA GPUs and NVIDIA Triton now help accelerate AI inference in Copilot for Microsoft 365. Available as a dedicated physical keyboard key on Windows PCs, Microsoft 365 Copilot combines the power of LLMs with proprietary enterprise data to deliver real-time contextualized intelligence, enabling users to enhance their creativity, productivity and skills.

Microsoft Bing also used NVIDIA inference solutions to address challenges including latency, cost and speed. By integrating NVIDIA TensorRT-LLM techniques, Microsoft significantly improved inference performance for its Deep Search feature, which powers optimized web results.

Deep search walkthrough courtesy of Microsoft

Microsoft Bing Visual Search enables people around the world to find content using photographs as queries. The heart of this capability is Microsoft’s TuringMM visual embedding model that maps images and text into a shared high-dimensional space. Because it operates on billions of images across the web, performance is critical.

Microsoft Bing optimized the TuringMM pipeline using NVIDIA TensorRT and NVIDIA acceleration libraries including CV-CUDA and nvImageCodec. These efforts resulted in a 5.13x speedup and significant TCO reduction.

Unlocking the Full Potential of AI Inference With Hardware Innovation

Improving the efficiency of AI inference workloads is a multifaceted challenge that demands innovative technologies across hardware and software.

NVIDIA GPUs are at the forefront of AI enablement, offering high efficiency and performance for AI models. They’re also the most energy efficient: NVIDIA accelerated computing on the NVIDIA Blackwell architecture has cut the energy used per token generation by 100,000x in the past decade for inference of trillion-parameter AI models.

The NVIDIA Grace Hopper Superchip, which combines NVIDIA Grace CPU and Hopper GPU architectures using NVIDIA NVLink-C2C, delivers substantial inference performance improvements across industries.

Meta Andromeda is using the superchip for efficient and high-performing personalized ads retrieval. By creating deep neural networks with increased compute complexity and parallelism, on Facebook and Instagram it has achieved an 8% ad quality improvement on select segments and a 6% recall improvement.

With optimized retrieval models and low-latency, high-throughput and memory-IO aware GPU operators, Andromeda offers a 100x improvement in feature extraction speed compared to previous CPU-based components. This integration of AI at the retrieval stage has allowed Meta to lead the industry in ads retrieval, addressing challenges like scalability and latency for a better user experience and higher return on ad spend.

As cutting-edge AI models continue to grow in size, the amount of compute required to generate each token also grows. To run state-of-the-art LLMs in real time, enterprises need multiple GPUs working in concert. Tools like the NVIDIA Collective Communication Library, or NCCL, enable multi-GPU systems to quickly exchange large amounts of data between GPUs with minimal communication time.

Future AI Inference Innovations

The future of AI inference promises significant advances in both performance and cost.

The combination of NVIDIA software, novel techniques and advanced hardware will enable data centers to handle increasingly complex and diverse workloads. AI inference will continue to drive advancements in industries such as healthcare and finance by enabling more accurate predictions, faster decision-making and better user experiences.

Learn more about how NVIDIA is delivering breakthrough inference performance results and stay up to date with the latest AI inference performance updates.

Vedere AI