India Enterprises Serve Over a Billion Local Language Speakers Using LLMs Built With NVIDIA AI

India Enterprises Serve Over a Billion Local Language Speakers Using LLMs Built With NVIDIA AI

Namaste, vanakkam, sat sri akaal — these are just three forms of greeting in India, a country with 22 constitutionally recognized languages and over 1,500 more recorded by the country’s census. Around 10% of its residents speak English, the internet’s most common language.

As India, the world’s most populous country, forges ahead with rapid digitalization efforts, its enterprises and local startups are developing multilingual AI models that enable more Indians to interact with technology in their primary language. It’s a case study in sovereign AI — the development of domestic AI infrastructure that is built on local datasets and reflects a region’s specific dialects, cultures and practices.

These projects are building language models for Indic languages and English that can power customer service AI agents for businesses, rapidly translate content to broaden access to information, and enable services to more easily reach a diverse population of over 1.4 billion individuals.

To support initiatives like these, NVIDIA has released a small language model for Hindi, India’s most prevalent language with over half a billion speakers. Now available as an NVIDIA NIM microservice, the model, dubbed Nemotron-4-Mini-Hindi-4B, can be easily deployed on any NVIDIA GPU-accelerated system for optimized performance.

Tech Mahindra, an Indian IT services and consulting company, is the first to use the Nemotron Hindi NIM microservice to develop an AI model called Indus 2.0, which is focused on Hindi and dozens of its dialects. Indus 2.0 harnesses Tech Mahindra’s high-quality fine-tuning data to further boost model accuracy, unlocking opportunities for clients in banking, education, healthcare and other industries to deliver localized services.

Tech Mahindra will showcase Indus 2.0 at the NVIDIA AI Summit, taking place Oct. 23-25 in Mumbai. The company also uses NVIDIA NeMo to develop its sovereign large language model (LLM) platform, TeNo.

NVIDIA NIM Makes AI Adoption for Hindi as Easy as Ek, Do, Teen

The Nemotron Hindi model has 4 billion parameters and is derived from Nemotron-4 15B, a 15-billion parameter multilingual language model developed by NVIDIA. The model was pruned, distilled and trained with a combination of real-world Hindi data, synthetic Hindi data and an equal amount of English data using NVIDIA NeMo, an end-to-end, cloud-native framework and suite of microservices for developing generative AI.

The dataset was created with NVIDIA NeMo Curator, which improves generative AI model accuracy by processing high-quality multimodal data at scale for training and customization. NeMo Curator uses NVIDIA RAPIDS libraries to accelerate data processing pipelines on multi-node GPU systems, lowering processing time and total cost of ownership. It also provides pre-built pipelines and building blocks for synthetic data generation, data filtering, classification and deduplication to process high-quality data.

After fine-tuning with NeMo, the final model leads on multiple accuracy benchmarks for AI models with up to 8 billion parameters. Packaged as a NIM microservice, it can be easily harnessed to support use cases across industries such as education, retail and healthcare.

It’s available as part of the NVIDIA AI Enterprise software platform, which gives businesses access to additional resources, including technical support and enterprise-grade security, to streamline AI development for production environments.

Bevy of Businesses Serves Multilingual Population

Innovators, major enterprises and global systems integrators across India are building customized language models using NVIDIA NeMo.

Companies in the NVIDIA Inception program for cutting-edge startups are using NeMo to develop AI models for several Indic languages.

Sarvam AI offers enterprise customers speech-to-text, text-to-speech, translation and data parsing models. The company developed Sarvam 1, India’s first homegrown, multilingual LLM, which was trained from scratch on domestic AI infrastructure powered by NVIDIA H100 Tensor Core GPUs.

Sarvam 1 — developed using NVIDIA AI Enterprise software including NeMo Curator and NeMo Framework — supports English and 10 major Indian languages, including Bengali, Marathi, Tamil and Telugu.

Sarvam AI also uses NVIDIA NIM microservices, NVIDIA Riva for conversational AI, NVIDIA TensorRT-LLM software and NVIDIA Triton Inference Server to optimize and deploy conversational AI agents with sub-second latency.

Another Inception startup, Gnani.ai, built a multilingual speech-to-speech LLM that powers AI customer service assistants that handle around 10 million real-time voice interactions daily for over 150 banking, insurance and financial services companies across India and the U.S. The model supports 14 languages and was trained on over 14 million hours of conversational speech data using NVIDIA Hopper GPUs and NeMo Framework.

Gnani.ai uses TensorRT-LLM, Triton Inference Server and Riva NIM microservices to optimize its AI for virtual customer service assistants and speech analytics.

Large enterprises building LLMs with NeMo include:

  • Flipkart, a major Indian ecommerce company majority-owned by Walmart, is integrating NeMo Guardrails, an open-source toolkit that enables developers to add programmable guardrails to LLMs, to enhance the safety of its conversational AI systems.
  • Krutrim, part of the Ola Group of businesses that includes one of India’s top ride-booking platforms, is developing a multilingual Indic foundation model using Mistral NeMo 12B, a state-of-the-art LLM developed by Mistral AI and NVIDIA.
  • Zoho Corporation, a global technology company based in Chennai, will use NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server to optimize and deliver language models for its over 700,000 customers. The company will use NeMo running on NVIDIA Hopper GPUs to pretrain narrow, small, medium and large models from scratch for over 100 business applications.

India’s top global systems integrators are also offering NVIDIA NeMo-accelerated solutions to their customers.

  • Infosys will work on specific tools and solutions using the NVIDIA AI stack. The company’s center of excellence is also developing AI-powered small language models that will be offered to customers as a service.
  • Tata Consultancy Services has developed AI solutions based on NVIDIA NIM Agent Blueprints for the telecommunications, retail, manufacturing, automotive and financial services industries. TCS’ offerings include NeMo-powered, domain-specific language models that can be customized to address customer queries and answer company-specific questions for employees for all enterprise functions such as IT, HR or field operations.
  • Wipro is using NVIDIA AI Enterprise software including NIM Agent Blueprints and NeMo to help businesses easily develop custom conversational AI solutions such as digital humans to support customer service interactions.

Wipro and TCS also use NeMo Curator’s synthetic data generation pipelines to generate data in languages other than English to customize LLMs for their clients.

To learn more about NVIDIA’s collaboration with businesses and developers in India, watch the replay of company founder and CEO Jensen Huang’s fireside chat at the NVIDIA AI Summit.

Read More

India Enterprises Serve Over a Billion Local Language Speakers Using LLMs Built With NVIDIA AI

India Enterprises Serve Over a Billion Local Language Speakers Using LLMs Built With NVIDIA AI

Namaste, vanakkam, sat sri akaal — these are just three forms of greeting in India, a country with 22 constitutionally recognized languages and over 1,500 more recorded by the country’s census. Around 10% of its residents speak English, the internet’s most common language.

As India, the world’s most populous country, forges ahead with rapid digitalization efforts, its enterprises and local startups are developing multilingual AI models that enable more Indians to interact with technology in their primary language. It’s a case study in sovereign AI — the development of domestic AI infrastructure that is built on local datasets and reflects a region’s specific dialects, cultures and practices.

These projects are building language models for Indic languages and English that can power customer service AI agents for businesses, rapidly translate content to broaden access to information, and enable services to more easily reach a diverse population of over 1.4 billion individuals.

To support initiatives like these, NVIDIA has released a small language model for Hindi, India’s most prevalent language with over half a billion speakers. Now available as an NVIDIA NIM microservice, the model, dubbed Nemotron-4-Mini-Hindi-4B, can be easily deployed on any NVIDIA GPU-accelerated system for optimized performance.

Tech Mahindra, an Indian IT services and consulting company, is the first to use the Nemotron Hindi NIM microservice to develop an AI model called Indus 2.0, which is focused on Hindi and dozens of its dialects. Indus 2.0 harnesses Tech Mahindra’s high-quality fine-tuning data to further boost model accuracy, unlocking opportunities for clients in banking, education, healthcare and other industries to deliver localized services.

Tech Mahindra will showcase Indus 2.0 at the NVIDIA AI Summit, taking place Oct. 23-25 in Mumbai. The company also uses NVIDIA NeMo to develop its sovereign large language model (LLM) platform, TeNo.

NVIDIA NIM Makes AI Adoption for Hindi as Easy as Ek, Do, Teen

The Nemotron Hindi model has 4 billion parameters and is derived from Nemotron-4 15B, a 15-billion parameter multilingual language model developed by NVIDIA. The model was pruned, distilled and trained with a combination of real-world Hindi data, synthetic Hindi data and an equal amount of English data using NVIDIA NeMo, an end-to-end, cloud-native framework and suite of microservices for developing generative AI.

The dataset was created with NVIDIA NeMo Curator, which improves generative AI model accuracy by processing high-quality multimodal data at scale for training and customization. NeMo Curator uses NVIDIA RAPIDS libraries to accelerate data processing pipelines on multi-node GPU systems, lowering processing time and total cost of ownership. It also provides pre-built pipelines and building blocks for synthetic data generation, data filtering, classification and deduplication to process high-quality data.

After fine-tuning with NeMo, the final model leads on multiple accuracy benchmarks for AI models with up to 8 billion parameters. Packaged as a NIM microservice, it can be easily harnessed to support use cases across industries such as education, retail and healthcare.

It’s available as part of the NVIDIA AI Enterprise software platform, which gives businesses access to additional resources, including technical support and enterprise-grade security, to streamline AI development for production environments.

Bevy of Businesses Serves Multilingual Population

Innovators, major enterprises and global systems integrators across India are building customized language models using NVIDIA NeMo.

Companies in the NVIDIA Inception program for cutting-edge startups are using NeMo to develop AI models for several Indic languages.

Sarvam AI offers enterprise customers speech-to-text, text-to-speech, translation and data parsing models. The company developed Sarvam 1, India’s first homegrown, multilingual LLM, which was trained from scratch on domestic AI infrastructure powered by NVIDIA H100 Tensor Core GPUs.

Sarvam 1 — developed using NVIDIA AI Enterprise software including NeMo Curator and NeMo Framework — supports English and 10 major Indian languages, including Bengali, Marathi, Tamil and Telugu.

Sarvam AI also uses NVIDIA NIM microservices, NVIDIA Riva for conversational AI, NVIDIA TensorRT-LLM software and NVIDIA Triton Inference Server to optimize and deploy conversational AI agents with sub-second latency.

Another Inception startup, Gnani.ai, built a multilingual speech-to-speech LLM that powers AI customer service assistants that handle around 10 million real-time voice interactions daily for over 150 banking, insurance and financial services companies across India and the U.S. The model supports 14 languages and was trained on over 14 million hours of conversational speech data using NVIDIA Hopper GPUs and NeMo Framework.

Gnani.ai uses TensorRT-LLM, Triton Inference Server and Riva NIM microservices to optimize its AI for virtual customer service assistants and speech analytics.

Large enterprises building LLMs with NeMo include:

  • Flipkart, a major Indian ecommerce company majority-owned by Walmart, is integrating NeMo Guardrails, an open-source toolkit that enables developers to add programmable guardrails to LLMs, to enhance the safety of its conversational AI systems.
  • Krutrim, part of the Ola Group of businesses that includes one of India’s top ride-booking platforms, is developing a multilingual Indic foundation model using Mistral NeMo 12B, a state-of-the-art LLM developed by Mistral AI and NVIDIA.
  • Zoho Corporation, a global technology company based in Chennai, will use NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server to optimize and deliver language models for its over 700,000 customers. The company will use NeMo running on NVIDIA Hopper GPUs to pretrain narrow, small, medium and large models from scratch for over 100 business applications.

India’s top global systems integrators are also offering NVIDIA NeMo-accelerated solutions to their customers.

  • Infosys will work on specific tools and solutions using the NVIDIA AI stack. The company’s center of excellence is also developing AI-powered small language models that will be offered to customers as a service.
  • Tata Consultancy Services has developed AI solutions based on NVIDIA NIM Agent Blueprints for the telecommunications, retail, manufacturing, automotive and financial services industries. TCS’ offerings include NeMo-powered, domain-specific language models that can be customized to address customer queries and answer company-specific questions for employees for all enterprise functions such as IT, HR or field operations.
  • Wipro is using NVIDIA AI Enterprise software including NIM Agent Blueprints and NeMo to help businesses easily develop custom conversational AI solutions such as digital humans to support customer service interactions.

Wipro and TCS also use NeMo Curator’s synthetic data generation pipelines to generate data in languages other than English to customize LLMs for their clients.

To learn more about NVIDIA’s collaboration with businesses and developers in India, watch the replay of company founder and CEO Jensen Huang’s fireside chat at the NVIDIA AI Summit.

Read More

India Enterprises Serve Over a Billion Local Language Speakers Using LLMs Built With NVIDIA AI

India Enterprises Serve Over a Billion Local Language Speakers Using LLMs Built With NVIDIA AI

Namaste, vanakkam, sat sri akaal — these are just three forms of greeting in India, a country with 22 constitutionally recognized languages and over 1,500 more recorded by the country’s census. Around 10% of its residents speak English, the internet’s most common language.

As India, the world’s most populous country, forges ahead with rapid digitalization efforts, its enterprises and local startups are developing multilingual AI models that enable more Indians to interact with technology in their primary language. It’s a case study in sovereign AI — the development of domestic AI infrastructure that is built on local datasets and reflects a region’s specific dialects, cultures and practices.

These projects are building language models for Indic languages and English that can power customer service AI agents for businesses, rapidly translate content to broaden access to information, and enable services to more easily reach a diverse population of over 1.4 billion individuals.

To support initiatives like these, NVIDIA has released a small language model for Hindi, India’s most prevalent language with over half a billion speakers. Now available as an NVIDIA NIM microservice, the model, dubbed Nemotron-4-Mini-Hindi-4B, can be easily deployed on any NVIDIA GPU-accelerated system for optimized performance.

Tech Mahindra, an Indian IT services and consulting company, is the first to use the Nemotron Hindi NIM microservice to develop an AI model called Indus 2.0, which is focused on Hindi and dozens of its dialects. Indus 2.0 harnesses Tech Mahindra’s high-quality fine-tuning data to further boost model accuracy, unlocking opportunities for clients in banking, education, healthcare and other industries to deliver localized services.

Tech Mahindra will showcase Indus 2.0 at the NVIDIA AI Summit, taking place Oct. 23-25 in Mumbai. The company also uses NVIDIA NeMo to develop its sovereign large language model (LLM) platform, TeNo.

NVIDIA NIM Makes AI Adoption for Hindi as Easy as Ek, Do, Teen

The Nemotron Hindi model has 4 billion parameters and is derived from Nemotron-4 15B, a 15-billion parameter multilingual language model developed by NVIDIA. The model was pruned, distilled and trained with a combination of real-world Hindi data, synthetic Hindi data and an equal amount of English data using NVIDIA NeMo, an end-to-end, cloud-native framework and suite of microservices for developing generative AI.

The dataset was created with NVIDIA NeMo Curator, which improves generative AI model accuracy by processing high-quality multimodal data at scale for training and customization. NeMo Curator uses NVIDIA RAPIDS libraries to accelerate data processing pipelines on multi-node GPU systems, lowering processing time and total cost of ownership. It also provides pre-built pipelines and building blocks for synthetic data generation, data filtering, classification and deduplication to process high-quality data.

After fine-tuning with NeMo, the final model leads on multiple accuracy benchmarks for AI models with up to 8 billion parameters. Packaged as a NIM microservice, it can be easily harnessed to support use cases across industries such as education, retail and healthcare.

It’s available as part of the NVIDIA AI Enterprise software platform, which gives businesses access to additional resources, including technical support and enterprise-grade security, to streamline AI development for production environments.

Bevy of Businesses Serves Multilingual Population

Innovators, major enterprises and global systems integrators across India are building customized language models using NVIDIA NeMo.

Companies in the NVIDIA Inception program for cutting-edge startups are using NeMo to develop AI models for several Indic languages.

Sarvam AI offers enterprise customers speech-to-text, text-to-speech, translation and data parsing models. The company developed Sarvam 1, India’s first homegrown, multilingual LLM, which was trained from scratch on domestic AI infrastructure powered by NVIDIA H100 Tensor Core GPUs.

Sarvam 1 — developed using NVIDIA AI Enterprise software including NeMo Curator and NeMo Framework — supports English and 10 major Indian languages, including Bengali, Marathi, Tamil and Telugu.

Sarvam AI also uses NVIDIA NIM microservices, NVIDIA Riva for conversational AI, NVIDIA TensorRT-LLM software and NVIDIA Triton Inference Server to optimize and deploy conversational AI agents with sub-second latency.

Another Inception startup, Gnani.ai, built a multilingual speech-to-speech LLM that powers AI customer service assistants that handle around 10 million real-time voice interactions daily for over 150 banking, insurance and financial services companies across India and the U.S. The model supports 14 languages and was trained on over 14 million hours of conversational speech data using NVIDIA Hopper GPUs and NeMo Framework.

Gnani.ai uses TensorRT-LLM, Triton Inference Server and Riva NIM microservices to optimize its AI for virtual customer service assistants and speech analytics.

Large enterprises building LLMs with NeMo include:

  • Flipkart, a major Indian ecommerce company majority-owned by Walmart, is integrating NeMo Guardrails, an open-source toolkit that enables developers to add programmable guardrails to LLMs, to enhance the safety of its conversational AI systems.
  • Krutrim, part of the Ola Group of businesses that includes one of India’s top ride-booking platforms, is developing a multilingual Indic foundation model using Mistral NeMo 12B, a state-of-the-art LLM developed by Mistral AI and NVIDIA.
  • Zoho Corporation, a global technology company based in Chennai, will use NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server to optimize and deliver language models for its over 700,000 customers. The company will use NeMo running on NVIDIA Hopper GPUs to pretrain narrow, small, medium and large models from scratch for over 100 business applications.

India’s top global systems integrators are also offering NVIDIA NeMo-accelerated solutions to their customers.

  • Infosys will work on specific tools and solutions using the NVIDIA AI stack. The company’s center of excellence is also developing AI-powered small language models that will be offered to customers as a service.
  • Tata Consultancy Services has developed AI solutions based on NVIDIA NIM Agent Blueprints for the telecommunications, retail, manufacturing, automotive and financial services industries. TCS’ offerings include NeMo-powered, domain-specific language models that can be customized to address customer queries and answer company-specific questions for employees for all enterprise functions such as IT, HR or field operations.
  • Wipro is using NVIDIA AI Enterprise software including NIM Agent Blueprints and NeMo to help businesses easily develop custom conversational AI solutions such as digital humans to support customer service interactions.

Wipro and TCS also use NeMo Curator’s synthetic data generation pipelines to generate data in languages other than English to customize LLMs for their clients.

To learn more about NVIDIA’s collaboration with businesses and developers in India, watch the replay of company founder and CEO Jensen Huang’s fireside chat at the NVIDIA AI Summit.

Read More

India Enterprises Serve Over a Billion Local Language Speakers Using LLMs Built With NVIDIA AI

India Enterprises Serve Over a Billion Local Language Speakers Using LLMs Built With NVIDIA AI

Namaste, vanakkam, sat sri akaal — these are just three forms of greeting in India, a country with 22 constitutionally recognized languages and over 1,500 more recorded by the country’s census. Around 10% of its residents speak English, the internet’s most common language.

As India, the world’s most populous country, forges ahead with rapid digitalization efforts, its enterprises and local startups are developing multilingual AI models that enable more Indians to interact with technology in their primary language. It’s a case study in sovereign AI — the development of domestic AI infrastructure that is built on local datasets and reflects a region’s specific dialects, cultures and practices.

These projects are building language models for Indic languages and English that can power customer service AI agents for businesses, rapidly translate content to broaden access to information, and enable services to more easily reach a diverse population of over 1.4 billion individuals.

To support initiatives like these, NVIDIA has released a small language model for Hindi, India’s most prevalent language with over half a billion speakers. Now available as an NVIDIA NIM microservice, the model, dubbed Nemotron-4-Mini-Hindi-4B, can be easily deployed on any NVIDIA GPU-accelerated system for optimized performance.

Tech Mahindra, an Indian IT services and consulting company, is the first to use the Nemotron Hindi NIM microservice to develop an AI model called Indus 2.0, which is focused on Hindi and dozens of its dialects. Indus 2.0 harnesses Tech Mahindra’s high-quality fine-tuning data to further boost model accuracy, unlocking opportunities for clients in banking, education, healthcare and other industries to deliver localized services.

Tech Mahindra will showcase Indus 2.0 at the NVIDIA AI Summit, taking place Oct. 23-25 in Mumbai. The company also uses NVIDIA NeMo to develop its sovereign large language model (LLM) platform, TeNo.

NVIDIA NIM Makes AI Adoption for Hindi as Easy as Ek, Do, Teen

The Nemotron Hindi model has 4 billion parameters and is derived from Nemotron-4 15B, a 15-billion parameter multilingual language model developed by NVIDIA. The model was pruned, distilled and trained with a combination of real-world Hindi data, synthetic Hindi data and an equal amount of English data using NVIDIA NeMo, an end-to-end, cloud-native framework and suite of microservices for developing generative AI.

The dataset was created with NVIDIA NeMo Curator, which improves generative AI model accuracy by processing high-quality multimodal data at scale for training and customization. NeMo Curator uses NVIDIA RAPIDS libraries to accelerate data processing pipelines on multi-node GPU systems, lowering processing time and total cost of ownership. It also provides pre-built pipelines and building blocks for synthetic data generation, data filtering, classification and deduplication to process high-quality data.

After fine-tuning with NeMo, the final model leads on multiple accuracy benchmarks for AI models with up to 8 billion parameters. Packaged as a NIM microservice, it can be easily harnessed to support use cases across industries such as education, retail and healthcare.

It’s available as part of the NVIDIA AI Enterprise software platform, which gives businesses access to additional resources, including technical support and enterprise-grade security, to streamline AI development for production environments.

Bevy of Businesses Serves Multilingual Population

Innovators, major enterprises and global systems integrators across India are building customized language models using NVIDIA NeMo.

Companies in the NVIDIA Inception program for cutting-edge startups are using NeMo to develop AI models for several Indic languages.

Sarvam AI offers enterprise customers speech-to-text, text-to-speech, translation and data parsing models. The company developed Sarvam 1, India’s first homegrown, multilingual LLM, which was trained from scratch on domestic AI infrastructure powered by NVIDIA H100 Tensor Core GPUs.

Sarvam 1 — developed using NVIDIA AI Enterprise software including NeMo Curator and NeMo Framework — supports English and 10 major Indian languages, including Bengali, Marathi, Tamil and Telugu.

Sarvam AI also uses NVIDIA NIM microservices, NVIDIA Riva for conversational AI, NVIDIA TensorRT-LLM software and NVIDIA Triton Inference Server to optimize and deploy conversational AI agents with sub-second latency.

Another Inception startup, Gnani.ai, built a multilingual speech-to-speech LLM that powers AI customer service assistants that handle around 10 million real-time voice interactions daily for over 150 banking, insurance and financial services companies across India and the U.S. The model supports 14 languages and was trained on over 14 million hours of conversational speech data using NVIDIA Hopper GPUs and NeMo Framework.

Gnani.ai uses TensorRT-LLM, Triton Inference Server and Riva NIM microservices to optimize its AI for virtual customer service assistants and speech analytics.

Large enterprises building LLMs with NeMo include:

  • Flipkart, a major Indian ecommerce company majority-owned by Walmart, is integrating NeMo Guardrails, an open-source toolkit that enables developers to add programmable guardrails to LLMs, to enhance the safety of its conversational AI systems.
  • Krutrim, part of the Ola Group of businesses that includes one of India’s top ride-booking platforms, is developing a multilingual Indic foundation model using Mistral NeMo 12B, a state-of-the-art LLM developed by Mistral AI and NVIDIA.
  • Zoho Corporation, a global technology company based in Chennai, will use NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server to optimize and deliver language models for its over 700,000 customers. The company will use NeMo running on NVIDIA Hopper GPUs to pretrain narrow, small, medium and large models from scratch for over 100 business applications.

India’s top global systems integrators are also offering NVIDIA NeMo-accelerated solutions to their customers.

  • Infosys will work on specific tools and solutions using the NVIDIA AI stack. The company’s center of excellence is also developing AI-powered small language models that will be offered to customers as a service.
  • Tata Consultancy Services has developed AI solutions based on NVIDIA NIM Agent Blueprints for the telecommunications, retail, manufacturing, automotive and financial services industries. TCS’ offerings include NeMo-powered, domain-specific language models that can be customized to address customer queries and answer company-specific questions for employees for all enterprise functions such as IT, HR or field operations.
  • Wipro is using NVIDIA AI Enterprise software including NIM Agent Blueprints and NeMo to help businesses easily develop custom conversational AI solutions such as digital humans to support customer service interactions.

Wipro and TCS also use NeMo Curator’s synthetic data generation pipelines to generate data in languages other than English to customize LLMs for their clients.

To learn more about NVIDIA’s collaboration with businesses and developers in India, watch the replay of company founder and CEO Jensen Huang’s fireside chat at the NVIDIA AI Summit.

Read More

How to Accelerate Larger LLMs Locally on RTX With LM Studio

How to Accelerate Larger LLMs Locally on RTX With LM Studio

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for GeForce RTX PC and NVIDIA RTX workstation users.

Large language models (LLMs) are reshaping productivity. They’re capable of drafting documents, summarizing web pages and, having been trained on vast quantities of data, accurately answering questions about nearly any topic.

LLMs are at the core of many emerging use cases in generative AI, including digital assistants, conversational avatars and customer service agents.

Many of the latest LLMs can run locally on PCs or workstations. This is useful for a variety of reasons: users can keep conversations and content private on-device, use AI without the internet, or simply take advantage of the powerful NVIDIA GeForce RTX GPUs in their system. Other models, because of their size and complexity, do no’t fit into the local GPU’s video memory (VRAM) and require hardware in large data centers.

However, Iit i’s possible to accelerate part of a prompt on a data-center-class model locally on RTX-powered PCs using a technique called GPU offloading. This allows users to benefit from GPU acceleration without being as limited by GPU memory constraints.

Size and Quality vs. Performance

There’s a tradeoff between the model size and the quality of responses and the performance. In general, larger models deliver higher-quality responses, but run more slowly. With smaller models, performance goes up while quality goes down.

This tradeoff isn’t always straightforward. There are cases where performance might be more important than quality. Some users may prioritize accuracy for use cases like content generation, since it can run in the background. A conversational assistant, meanwhile, needs to be fast while also providing accurate responses.

The most accurate LLMs, designed to run in the data center, are tens of gigabytes in size, and may not fit in a GPU’s memory. This would traditionally prevent the application from taking advantage of GPU acceleration.

However, GPU offloading uses part of the LLM on the GPU and part on the CPU. This allows users to take maximum advantage of GPU acceleration regardless of model size.

Optimize AI Acceleration With GPU Offloading and LM Studio

LM Studio is an application that lets users download and host LLMs on their desktop or laptop computer, with an easy-to-use interface that allows for extensive customization in how those models operate. LM Studio is built on top of llama.cpp, so it’s fully optimized for use with GeForce RTX and NVIDIA RTX GPUs.

LM Studio and GPU offloading takes advantage of GPU acceleration to boost the performance of a locally hosted LLM, even if the model can’t be fully loaded into VRAM.

With GPU offloading, LM Studio divides the model into smaller chunks, or “subgraphs,” which represent layers of the model architecture. Subgraphs aren’t permanently fixed on the GPU, but loaded and unloaded as needed. With LM Studio’s GPU offloading slider, users can decide how many of these layers are processed by the GPU.

LM Studio’s interface makes it easy to decide how much of an LLM should be loaded to the GPU.

For example, imagine using this GPU offloading technique with a large model like Gemma 2 27B. “27B” refers to the number of parameters in the model, informing an estimate as to how much memory is required to run the model.

According to 4-bit quantization, a technique for reducing the size of an LLM without significantly reducing accuracy, each parameter takes up a half byte of memory. This means that the model should require about 13.5 billion bytes, or 13.5GB — plus some overhead, which generally ranges from 1-5GB.

Accelerating this model entirely on the GPU requires 19GB of VRAM, available on the GeForce RTX 4090 desktop GPU. With GPU offloading, the model can run on a system with a lower-end GPU and still benefit from acceleration.

The table above shows how to run several popular models of increasing size across a range of GeForce RTX and NVIDIA RTX GPUs. The maximum level of GPU offload is indicated for each combination. Note that even with GPU offloading, users still need enough system RAM to fit the whole model.

In LM Studio, it’s possible to assess the performance impact of different levels of GPU offloading, compared with CPU only. The below table shows the results of running the same query across different offloading levels on a GeForce RTX 4090 desktop GPU.

Depending on the percent of the model offloaded to GPU, users see increasing throughput performance compared with running on CPUs alone. For the Gemma 2 27B model, performance goes from an anemic 2.1 tokens per second to increasingly usable speeds the more the GPU is used. This enables users to benefit from the performance of larger models that they otherwise would’ve been unable to run.

On this particular model, even users with an 8GB GPU can enjoy a meaningful speedup versus running only on CPUs. Of course, an 8GB GPU can always run a smaller model that fits entirely in GPU memory and get full GPU acceleration.

Achieving Optimal Balance

LM Studio’s GPU offloading feature is a powerful tool for unlocking the full potential of LLMs designed for the data center, like Gemma 2 27B, locally on RTX AI PCs. It makes larger, more complex models accessible across the entire lineup of PCs powered by GeForce RTX and NVIDIA RTX GPUs.

Download LM Studio to try GPU offloading on larger models, or experiment with a variety of RTX-accelerated LLMs running locally on RTX AI PCs and workstations.

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More

How to Accelerate Larger LLMs Locally on RTX With LM Studio

How to Accelerate Larger LLMs Locally on RTX With LM Studio

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for GeForce RTX PC and NVIDIA RTX workstation users.

Large language models (LLMs) are reshaping productivity. They’re capable of drafting documents, summarizing web pages and, having been trained on vast quantities of data, accurately answering questions about nearly any topic.

LLMs are at the core of many emerging use cases in generative AI, including digital assistants, conversational avatars and customer service agents.

Many of the latest LLMs can run locally on PCs or workstations. This is useful for a variety of reasons: users can keep conversations and content private on-device, use AI without the internet, or simply take advantage of the powerful NVIDIA GeForce RTX GPUs in their system. Other models, because of their size and complexity, do no’t fit into the local GPU’s video memory (VRAM) and require hardware in large data centers.

However, Iit i’s possible to accelerate part of a prompt on a data-center-class model locally on RTX-powered PCs using a technique called GPU offloading. This allows users to benefit from GPU acceleration without being as limited by GPU memory constraints.

Size and Quality vs. Performance

There’s a tradeoff between the model size and the quality of responses and the performance. In general, larger models deliver higher-quality responses, but run more slowly. With smaller models, performance goes up while quality goes down.

This tradeoff isn’t always straightforward. There are cases where performance might be more important than quality. Some users may prioritize accuracy for use cases like content generation, since it can run in the background. A conversational assistant, meanwhile, needs to be fast while also providing accurate responses.

The most accurate LLMs, designed to run in the data center, are tens of gigabytes in size, and may not fit in a GPU’s memory. This would traditionally prevent the application from taking advantage of GPU acceleration.

However, GPU offloading uses part of the LLM on the GPU and part on the CPU. This allows users to take maximum advantage of GPU acceleration regardless of model size.

Optimize AI Acceleration With GPU Offloading and LM Studio

LM Studio is an application that lets users download and host LLMs on their desktop or laptop computer, with an easy-to-use interface that allows for extensive customization in how those models operate. LM Studio is built on top of llama.cpp, so it’s fully optimized for use with GeForce RTX and NVIDIA RTX GPUs.

LM Studio and GPU offloading takes advantage of GPU acceleration to boost the performance of a locally hosted LLM, even if the model can’t be fully loaded into VRAM.

With GPU offloading, LM Studio divides the model into smaller chunks, or “subgraphs,” which represent layers of the model architecture. Subgraphs aren’t permanently fixed on the GPU, but loaded and unloaded as needed. With LM Studio’s GPU offloading slider, users can decide how many of these layers are processed by the GPU.

LM Studio’s interface makes it easy to decide how much of an LLM should be loaded to the GPU.

For example, imagine using this GPU offloading technique with a large model like Gemma 2 27B. “27B” refers to the number of parameters in the model, informing an estimate as to how much memory is required to run the model.

According to 4-bit quantization, a technique for reducing the size of an LLM without significantly reducing accuracy, each parameter takes up a half byte of memory. This means that the model should require about 13.5 billion bytes, or 13.5GB — plus some overhead, which generally ranges from 1-5GB.

Accelerating this model entirely on the GPU requires 19GB of VRAM, available on the GeForce RTX 4090 desktop GPU. With GPU offloading, the model can run on a system with a lower-end GPU and still benefit from acceleration.

The table above shows how to run several popular models of increasing size across a range of GeForce RTX and NVIDIA RTX GPUs. The maximum level of GPU offload is indicated for each combination. Note that even with GPU offloading, users still need enough system RAM to fit the whole model.

In LM Studio, it’s possible to assess the performance impact of different levels of GPU offloading, compared with CPU only. The below table shows the results of running the same query across different offloading levels on a GeForce RTX 4090 desktop GPU.

Depending on the percent of the model offloaded to GPU, users see increasing throughput performance compared with running on CPUs alone. For the Gemma 2 27B model, performance goes from an anemic 2.1 tokens per second to increasingly usable speeds the more the GPU is used. This enables users to benefit from the performance of larger models that they otherwise would’ve been unable to run.

On this particular model, even users with an 8GB GPU can enjoy a meaningful speedup versus running only on CPUs. Of course, an 8GB GPU can always run a smaller model that fits entirely in GPU memory and get full GPU acceleration.

Achieving Optimal Balance

LM Studio’s GPU offloading feature is a powerful tool for unlocking the full potential of LLMs designed for the data center, like Gemma 2 27B, locally on RTX AI PCs. It makes larger, more complex models accessible across the entire lineup of PCs powered by GeForce RTX and NVIDIA RTX GPUs.

Download LM Studio to try GPU offloading on larger models, or experiment with a variety of RTX-accelerated LLMs running locally on RTX AI PCs and workstations.

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More

How to Accelerate Larger LLMs Locally on RTX With LM Studio

How to Accelerate Larger LLMs Locally on RTX With LM Studio

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for GeForce RTX PC and NVIDIA RTX workstation users.

Large language models (LLMs) are reshaping productivity. They’re capable of drafting documents, summarizing web pages and, having been trained on vast quantities of data, accurately answering questions about nearly any topic.

LLMs are at the core of many emerging use cases in generative AI, including digital assistants, conversational avatars and customer service agents.

Many of the latest LLMs can run locally on PCs or workstations. This is useful for a variety of reasons: users can keep conversations and content private on-device, use AI without the internet, or simply take advantage of the powerful NVIDIA GeForce RTX GPUs in their system. Other models, because of their size and complexity, do no’t fit into the local GPU’s video memory (VRAM) and require hardware in large data centers.

However, Iit i’s possible to accelerate part of a prompt on a data-center-class model locally on RTX-powered PCs using a technique called GPU offloading. This allows users to benefit from GPU acceleration without being as limited by GPU memory constraints.

Size and Quality vs. Performance

There’s a tradeoff between the model size and the quality of responses and the performance. In general, larger models deliver higher-quality responses, but run more slowly. With smaller models, performance goes up while quality goes down.

This tradeoff isn’t always straightforward. There are cases where performance might be more important than quality. Some users may prioritize accuracy for use cases like content generation, since it can run in the background. A conversational assistant, meanwhile, needs to be fast while also providing accurate responses.

The most accurate LLMs, designed to run in the data center, are tens of gigabytes in size, and may not fit in a GPU’s memory. This would traditionally prevent the application from taking advantage of GPU acceleration.

However, GPU offloading uses part of the LLM on the GPU and part on the CPU. This allows users to take maximum advantage of GPU acceleration regardless of model size.

Optimize AI Acceleration With GPU Offloading and LM Studio

LM Studio is an application that lets users download and host LLMs on their desktop or laptop computer, with an easy-to-use interface that allows for extensive customization in how those models operate. LM Studio is built on top of llama.cpp, so it’s fully optimized for use with GeForce RTX and NVIDIA RTX GPUs.

LM Studio and GPU offloading takes advantage of GPU acceleration to boost the performance of a locally hosted LLM, even if the model can’t be fully loaded into VRAM.

With GPU offloading, LM Studio divides the model into smaller chunks, or “subgraphs,” which represent layers of the model architecture. Subgraphs aren’t permanently fixed on the GPU, but loaded and unloaded as needed. With LM Studio’s GPU offloading slider, users can decide how many of these layers are processed by the GPU.

LM Studio’s interface makes it easy to decide how much of an LLM should be loaded to the GPU.

For example, imagine using this GPU offloading technique with a large model like Gemma 2 27B. “27B” refers to the number of parameters in the model, informing an estimate as to how much memory is required to run the model.

According to 4-bit quantization, a technique for reducing the size of an LLM without significantly reducing accuracy, each parameter takes up a half byte of memory. This means that the model should require about 13.5 billion bytes, or 13.5GB — plus some overhead, which generally ranges from 1-5GB.

Accelerating this model entirely on the GPU requires 19GB of VRAM, available on the GeForce RTX 4090 desktop GPU. With GPU offloading, the model can run on a system with a lower-end GPU and still benefit from acceleration.

The table above shows how to run several popular models of increasing size across a range of GeForce RTX and NVIDIA RTX GPUs. The maximum level of GPU offload is indicated for each combination. Note that even with GPU offloading, users still need enough system RAM to fit the whole model.

In LM Studio, it’s possible to assess the performance impact of different levels of GPU offloading, compared with CPU only. The below table shows the results of running the same query across different offloading levels on a GeForce RTX 4090 desktop GPU.

Depending on the percent of the model offloaded to GPU, users see increasing throughput performance compared with running on CPUs alone. For the Gemma 2 27B model, performance goes from an anemic 2.1 tokens per second to increasingly usable speeds the more the GPU is used. This enables users to benefit from the performance of larger models that they otherwise would’ve been unable to run.

On this particular model, even users with an 8GB GPU can enjoy a meaningful speedup versus running only on CPUs. Of course, an 8GB GPU can always run a smaller model that fits entirely in GPU memory and get full GPU acceleration.

Achieving Optimal Balance

LM Studio’s GPU offloading feature is a powerful tool for unlocking the full potential of LLMs designed for the data center, like Gemma 2 27B, locally on RTX AI PCs. It makes larger, more complex models accessible across the entire lineup of PCs powered by GeForce RTX and NVIDIA RTX GPUs.

Download LM Studio to try GPU offloading on larger models, or experiment with a variety of RTX-accelerated LLMs running locally on RTX AI PCs and workstations.

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More

How to Accelerate Larger LLMs Locally on RTX With LM Studio

How to Accelerate Larger LLMs Locally on RTX With LM Studio

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for GeForce RTX PC and NVIDIA RTX workstation users.

Large language models (LLMs) are reshaping productivity. They’re capable of drafting documents, summarizing web pages and, having been trained on vast quantities of data, accurately answering questions about nearly any topic.

LLMs are at the core of many emerging use cases in generative AI, including digital assistants, conversational avatars and customer service agents.

Many of the latest LLMs can run locally on PCs or workstations. This is useful for a variety of reasons: users can keep conversations and content private on-device, use AI without the internet, or simply take advantage of the powerful NVIDIA GeForce RTX GPUs in their system. Other models, because of their size and complexity, do no’t fit into the local GPU’s video memory (VRAM) and require hardware in large data centers.

However, Iit i’s possible to accelerate part of a prompt on a data-center-class model locally on RTX-powered PCs using a technique called GPU offloading. This allows users to benefit from GPU acceleration without being as limited by GPU memory constraints.

Size and Quality vs. Performance

There’s a tradeoff between the model size and the quality of responses and the performance. In general, larger models deliver higher-quality responses, but run more slowly. With smaller models, performance goes up while quality goes down.

This tradeoff isn’t always straightforward. There are cases where performance might be more important than quality. Some users may prioritize accuracy for use cases like content generation, since it can run in the background. A conversational assistant, meanwhile, needs to be fast while also providing accurate responses.

The most accurate LLMs, designed to run in the data center, are tens of gigabytes in size, and may not fit in a GPU’s memory. This would traditionally prevent the application from taking advantage of GPU acceleration.

However, GPU offloading uses part of the LLM on the GPU and part on the CPU. This allows users to take maximum advantage of GPU acceleration regardless of model size.

Optimize AI Acceleration With GPU Offloading and LM Studio

LM Studio is an application that lets users download and host LLMs on their desktop or laptop computer, with an easy-to-use interface that allows for extensive customization in how those models operate. LM Studio is built on top of llama.cpp, so it’s fully optimized for use with GeForce RTX and NVIDIA RTX GPUs.

LM Studio and GPU offloading takes advantage of GPU acceleration to boost the performance of a locally hosted LLM, even if the model can’t be fully loaded into VRAM.

With GPU offloading, LM Studio divides the model into smaller chunks, or “subgraphs,” which represent layers of the model architecture. Subgraphs aren’t permanently fixed on the GPU, but loaded and unloaded as needed. With LM Studio’s GPU offloading slider, users can decide how many of these layers are processed by the GPU.

LM Studio’s interface makes it easy to decide how much of an LLM should be loaded to the GPU.

For example, imagine using this GPU offloading technique with a large model like Gemma 2 27B. “27B” refers to the number of parameters in the model, informing an estimate as to how much memory is required to run the model.

According to 4-bit quantization, a technique for reducing the size of an LLM without significantly reducing accuracy, each parameter takes up a half byte of memory. This means that the model should require about 13.5 billion bytes, or 13.5GB — plus some overhead, which generally ranges from 1-5GB.

Accelerating this model entirely on the GPU requires 19GB of VRAM, available on the GeForce RTX 4090 desktop GPU. With GPU offloading, the model can run on a system with a lower-end GPU and still benefit from acceleration.

The table above shows how to run several popular models of increasing size across a range of GeForce RTX and NVIDIA RTX GPUs. The maximum level of GPU offload is indicated for each combination. Note that even with GPU offloading, users still need enough system RAM to fit the whole model.

In LM Studio, it’s possible to assess the performance impact of different levels of GPU offloading, compared with CPU only. The below table shows the results of running the same query across different offloading levels on a GeForce RTX 4090 desktop GPU.

Depending on the percent of the model offloaded to GPU, users see increasing throughput performance compared with running on CPUs alone. For the Gemma 2 27B model, performance goes from an anemic 2.1 tokens per second to increasingly usable speeds the more the GPU is used. This enables users to benefit from the performance of larger models that they otherwise would’ve been unable to run.

On this particular model, even users with an 8GB GPU can enjoy a meaningful speedup versus running only on CPUs. Of course, an 8GB GPU can always run a smaller model that fits entirely in GPU memory and get full GPU acceleration.

Achieving Optimal Balance

LM Studio’s GPU offloading feature is a powerful tool for unlocking the full potential of LLMs designed for the data center, like Gemma 2 27B, locally on RTX AI PCs. It makes larger, more complex models accessible across the entire lineup of PCs powered by GeForce RTX and NVIDIA RTX GPUs.

Download LM Studio to try GPU offloading on larger models, or experiment with a variety of RTX-accelerated LLMs running locally on RTX AI PCs and workstations.

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More

How to Accelerate Larger LLMs Locally on RTX With LM Studio

How to Accelerate Larger LLMs Locally on RTX With LM Studio

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for GeForce RTX PC and NVIDIA RTX workstation users.

Large language models (LLMs) are reshaping productivity. They’re capable of drafting documents, summarizing web pages and, having been trained on vast quantities of data, accurately answering questions about nearly any topic.

LLMs are at the core of many emerging use cases in generative AI, including digital assistants, conversational avatars and customer service agents.

Many of the latest LLMs can run locally on PCs or workstations. This is useful for a variety of reasons: users can keep conversations and content private on-device, use AI without the internet, or simply take advantage of the powerful NVIDIA GeForce RTX GPUs in their system. Other models, because of their size and complexity, do no’t fit into the local GPU’s video memory (VRAM) and require hardware in large data centers.

However, Iit i’s possible to accelerate part of a prompt on a data-center-class model locally on RTX-powered PCs using a technique called GPU offloading. This allows users to benefit from GPU acceleration without being as limited by GPU memory constraints.

Size and Quality vs. Performance

There’s a tradeoff between the model size and the quality of responses and the performance. In general, larger models deliver higher-quality responses, but run more slowly. With smaller models, performance goes up while quality goes down.

This tradeoff isn’t always straightforward. There are cases where performance might be more important than quality. Some users may prioritize accuracy for use cases like content generation, since it can run in the background. A conversational assistant, meanwhile, needs to be fast while also providing accurate responses.

The most accurate LLMs, designed to run in the data center, are tens of gigabytes in size, and may not fit in a GPU’s memory. This would traditionally prevent the application from taking advantage of GPU acceleration.

However, GPU offloading uses part of the LLM on the GPU and part on the CPU. This allows users to take maximum advantage of GPU acceleration regardless of model size.

Optimize AI Acceleration With GPU Offloading and LM Studio

LM Studio is an application that lets users download and host LLMs on their desktop or laptop computer, with an easy-to-use interface that allows for extensive customization in how those models operate. LM Studio is built on top of llama.cpp, so it’s fully optimized for use with GeForce RTX and NVIDIA RTX GPUs.

LM Studio and GPU offloading takes advantage of GPU acceleration to boost the performance of a locally hosted LLM, even if the model can’t be fully loaded into VRAM.

With GPU offloading, LM Studio divides the model into smaller chunks, or “subgraphs,” which represent layers of the model architecture. Subgraphs aren’t permanently fixed on the GPU, but loaded and unloaded as needed. With LM Studio’s GPU offloading slider, users can decide how many of these layers are processed by the GPU.

LM Studio’s interface makes it easy to decide how much of an LLM should be loaded to the GPU.

For example, imagine using this GPU offloading technique with a large model like Gemma 2 27B. “27B” refers to the number of parameters in the model, informing an estimate as to how much memory is required to run the model.

According to 4-bit quantization, a technique for reducing the size of an LLM without significantly reducing accuracy, each parameter takes up a half byte of memory. This means that the model should require about 13.5 billion bytes, or 13.5GB — plus some overhead, which generally ranges from 1-5GB.

Accelerating this model entirely on the GPU requires 19GB of VRAM, available on the GeForce RTX 4090 desktop GPU. With GPU offloading, the model can run on a system with a lower-end GPU and still benefit from acceleration.

The table above shows how to run several popular models of increasing size across a range of GeForce RTX and NVIDIA RTX GPUs. The maximum level of GPU offload is indicated for each combination. Note that even with GPU offloading, users still need enough system RAM to fit the whole model.

In LM Studio, it’s possible to assess the performance impact of different levels of GPU offloading, compared with CPU only. The below table shows the results of running the same query across different offloading levels on a GeForce RTX 4090 desktop GPU.

Depending on the percent of the model offloaded to GPU, users see increasing throughput performance compared with running on CPUs alone. For the Gemma 2 27B model, performance goes from an anemic 2.1 tokens per second to increasingly usable speeds the more the GPU is used. This enables users to benefit from the performance of larger models that they otherwise would’ve been unable to run.

On this particular model, even users with an 8GB GPU can enjoy a meaningful speedup versus running only on CPUs. Of course, an 8GB GPU can always run a smaller model that fits entirely in GPU memory and get full GPU acceleration.

Achieving Optimal Balance

LM Studio’s GPU offloading feature is a powerful tool for unlocking the full potential of LLMs designed for the data center, like Gemma 2 27B, locally on RTX AI PCs. It makes larger, more complex models accessible across the entire lineup of PCs powered by GeForce RTX and NVIDIA RTX GPUs.

Download LM Studio to try GPU offloading on larger models, or experiment with a variety of RTX-accelerated LLMs running locally on RTX AI PCs and workstations.

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More

How to Accelerate Larger LLMs Locally on RTX With LM Studio

How to Accelerate Larger LLMs Locally on RTX With LM Studio

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for GeForce RTX PC and NVIDIA RTX workstation users.

Large language models (LLMs) are reshaping productivity. They’re capable of drafting documents, summarizing web pages and, having been trained on vast quantities of data, accurately answering questions about nearly any topic.

LLMs are at the core of many emerging use cases in generative AI, including digital assistants, conversational avatars and customer service agents.

Many of the latest LLMs can run locally on PCs or workstations. This is useful for a variety of reasons: users can keep conversations and content private on-device, use AI without the internet, or simply take advantage of the powerful NVIDIA GeForce RTX GPUs in their system. Other models, because of their size and complexity, do no’t fit into the local GPU’s video memory (VRAM) and require hardware in large data centers.

However, Iit i’s possible to accelerate part of a prompt on a data-center-class model locally on RTX-powered PCs using a technique called GPU offloading. This allows users to benefit from GPU acceleration without being as limited by GPU memory constraints.

Size and Quality vs. Performance

There’s a tradeoff between the model size and the quality of responses and the performance. In general, larger models deliver higher-quality responses, but run more slowly. With smaller models, performance goes up while quality goes down.

This tradeoff isn’t always straightforward. There are cases where performance might be more important than quality. Some users may prioritize accuracy for use cases like content generation, since it can run in the background. A conversational assistant, meanwhile, needs to be fast while also providing accurate responses.

The most accurate LLMs, designed to run in the data center, are tens of gigabytes in size, and may not fit in a GPU’s memory. This would traditionally prevent the application from taking advantage of GPU acceleration.

However, GPU offloading uses part of the LLM on the GPU and part on the CPU. This allows users to take maximum advantage of GPU acceleration regardless of model size.

Optimize AI Acceleration With GPU Offloading and LM Studio

LM Studio is an application that lets users download and host LLMs on their desktop or laptop computer, with an easy-to-use interface that allows for extensive customization in how those models operate. LM Studio is built on top of llama.cpp, so it’s fully optimized for use with GeForce RTX and NVIDIA RTX GPUs.

LM Studio and GPU offloading takes advantage of GPU acceleration to boost the performance of a locally hosted LLM, even if the model can’t be fully loaded into VRAM.

With GPU offloading, LM Studio divides the model into smaller chunks, or “subgraphs,” which represent layers of the model architecture. Subgraphs aren’t permanently fixed on the GPU, but loaded and unloaded as needed. With LM Studio’s GPU offloading slider, users can decide how many of these layers are processed by the GPU.

LM Studio’s interface makes it easy to decide how much of an LLM should be loaded to the GPU.

For example, imagine using this GPU offloading technique with a large model like Gemma 2 27B. “27B” refers to the number of parameters in the model, informing an estimate as to how much memory is required to run the model.

According to 4-bit quantization, a technique for reducing the size of an LLM without significantly reducing accuracy, each parameter takes up a half byte of memory. This means that the model should require about 13.5 billion bytes, or 13.5GB — plus some overhead, which generally ranges from 1-5GB.

Accelerating this model entirely on the GPU requires 19GB of VRAM, available on the GeForce RTX 4090 desktop GPU. With GPU offloading, the model can run on a system with a lower-end GPU and still benefit from acceleration.

The table above shows how to run several popular models of increasing size across a range of GeForce RTX and NVIDIA RTX GPUs. The maximum level of GPU offload is indicated for each combination. Note that even with GPU offloading, users still need enough system RAM to fit the whole model.

In LM Studio, it’s possible to assess the performance impact of different levels of GPU offloading, compared with CPU only. The below table shows the results of running the same query across different offloading levels on a GeForce RTX 4090 desktop GPU.

Depending on the percent of the model offloaded to GPU, users see increasing throughput performance compared with running on CPUs alone. For the Gemma 2 27B model, performance goes from an anemic 2.1 tokens per second to increasingly usable speeds the more the GPU is used. This enables users to benefit from the performance of larger models that they otherwise would’ve been unable to run.

On this particular model, even users with an 8GB GPU can enjoy a meaningful speedup versus running only on CPUs. Of course, an 8GB GPU can always run a smaller model that fits entirely in GPU memory and get full GPU acceleration.

Achieving Optimal Balance

LM Studio’s GPU offloading feature is a powerful tool for unlocking the full potential of LLMs designed for the data center, like Gemma 2 27B, locally on RTX AI PCs. It makes larger, more complex models accessible across the entire lineup of PCs powered by GeForce RTX and NVIDIA RTX GPUs.

Download LM Studio to try GPU offloading on larger models, or experiment with a variety of RTX-accelerated LLMs running locally on RTX AI PCs and workstations.

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More