March 2025 – Page 5

NeMo Retriever Llama 3.2 text embedding and reranking NVIDIA NIM microservices now available in Amazon SageMaker JumpStart

Today, we are excited to announce that the NeMo Retriever Llama3.2 Text Embedding and Reranking NVIDIA NIM microservices are available in Amazon SageMaker JumpStart. With this launch, you can now deploy NVIDIA’s optimized reranking and embedding models to build, experiment, and responsibly scale your generative AI ideas on AWS.

In this post, we demonstrate how to get started with these models on SageMaker JumpStart.

About NVIDIA NIM on AWS

NVIDIA NIM microservices integrate closely with AWS managed services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker to enable the deployment of generative AI models at scale. As part of NVIDIA AI Enterprise available in AWS Marketplace, NIM is a set of user-friendly microservices designed to streamline and accelerate the deployment of generative AI. These prebuilt containers support a broad spectrum of generative AI models, from open source community models to NVIDIA AI foundation models (FMs) and custom models. NIM microservices provide straightforward integration into generative AI applications using industry-standard APIs and can be deployed with just a few lines of code, or with a few clicks on the SageMaker JumpStart console. Engineered to facilitate seamless generative AI inferencing at scale, NIM helps you deploy your generative AI applications.

Overview of NVIDIA NeMo Retriever NIM microservices

In this section, we provide an overview of the NVIDIA NeMo Retriever NIM microservices discussed in this post.

NeMo Retriever text embedding NIM

The NVIDIA NeMo Retriever Llama3.2 embedding NIM is optimized for multilingual and cross-lingual text question-answering retrieval with support for long documents (up to 8,192 tokens) and dynamic embedding size (Matryoshka Embeddings). This model was evaluated on 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish. In addition to enabling multilingual and cross-lingual question-answering retrieval, this model reduces the data storage footprint by 35-fold through dynamic embedding sizing and support for longer token length, making it feasible to handle large-scale datasets efficiently.

NeMo Retriever text reranking NIM

The NVIDIA NeMo Retriever Llama3.2 reranking NIM is optimized for providing a logit score that represents how relevant a document is to a given query. The model was fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8,192 tokens). This model was evaluated on the same 26 languages mentioned earlier.

SageMaker JumpStart overview

SageMaker JumpStart is a fully managed service that offers state-of-the-art FMs for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks.

Solution overview

You can now discover and deploy the NeMo Retriever text embedding and reranking NIM microservices in Amazon SageMaker Studio or programmatically through the Amazon SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and in your virtual private cloud (VPC), helping to support data security for enterprise security needs.

In the following sections, we demonstrate how to deploy these microservices and run real-time and batch inference.

Make sure your SageMaker AWS Identity and Access Management (IAM) service role has the AmazonSageMakerFullAccess permission policy attached.

To deploy NeMo Retriever Llama3.2 embedding and reranking microservices successfully, confirm one of the following:

Make sure your IAM role has the following permissions and you have the authority to make AWS Marketplace subscriptions in the AWS account used:
- aws-marketplace:ViewSubscriptions
- aws-marketplace:Unsubscribe
- aws-marketplace:Subscribe
Alternatively, confirm your AWS account has a subscription to the model. If so, you can skip the following deployment instructions and start at the Subscribe to the model package section.

Deploy NeMo Retriever microservices on SageMaker JumpStart

For those new to SageMaker JumpStart, we demonstrate using SageMaker Studio to access models on SageMaker JumpStart. The following screenshot shows the NeMo Retriever text embedding and reranking microservices available on SageMaker JumpStart.

Deployment starts when you choose the Deploy option. You might be prompted to subscribe to this model through AWS Marketplace. If you are already subscribed, then you can move forward with choosing the second Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

Subscribe to the model package

To subscribe to the model package, complete the following steps

Depending on the model you want to deploy, open the model package listing page for Llama-3.2-NV-EmbedQA-1B-v2 or Llama-3.2-NV-RerankQA-1B-v2.
On the AWS Marketplace listing, choose Continue to subscribe.
On the Subscribe to this software page, review and choose Accept Offer if you and your organization agree with EULA, pricing, and support terms.
Choose Continue to configuration and then choose an AWS Region.

A product Amazon Resource Name (ARN) will be displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3.

Deploy NeMo Retriever microservices using the SageMaker SDK

In this section, we walk through deploying the NeMo Retriever text embedding NIM through the SageMaker SDK. A similar process can be followed for deploying the NeMo Retriever text reranking NIM as well.

Define the SageMaker model using the model package ARN

To deploy the model using the SDK, copy the product ARN from the previous step and specify it in the model_package_arn in the following code:

# Define the model details
model_package_arn = "Specify the model package ARN here"
sm_model_name = "nim-llama-3-2-nv-embedqa-1b-v2"

# Create the SageMaker model
create_model_response = sm.create_model(
ModelName=sm_model_name,
PrimaryContainer={
'ModelPackageName': model_package_arn
},
ExecutionRoleArn=role,
EnableNetworkIsolation=True
)
print("Model Arn: " + create_model_response["ModelArn"])

Create the endpoint configuration

Next, we create an endpoint configuration specifying instance type; in this case, we are using an ml.g5.2xlarge instance type accelerated by NVIDIA A10G GPUs. Make sure you have the account-level service limit for using ml.g5.2xlarge for endpoint usage as one or more instances. To request a service quota increase, refer to AWS service quotas. For further performance improvements, you can use NVIDIA Hopper GPUs (P5 instances) on SageMaker.

# Create the endpoint configuration
endpoint_config_name = sm_model_name

create_endpoint_config_response = sm.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
'VariantName': 'AllTraffic',
'ModelName': sm_model_name,
'InitialInstanceCount': 1,
'InstanceType': 'ml.g5.xlarge', 
'InferenceAmiVersion': 'al2-ami-sagemaker-inference-gpu-2',
'RoutingConfig': {'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'},
'ModelDataDownloadTimeoutInSeconds': 3600, # Specify the model download timeout in seconds.
'ContainerStartupHealthCheckTimeoutInSeconds': 3600, # Specify the health checkup timeout in seconds
}
]
)
print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Create the endpoint

Using the preceding endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService after the deployment is successful.

# Create the endpoint
endpoint_name = endpoint_config_name
create_endpoint_response = sm.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Deploy the NIM microservice

Deploy the NIM microservice with the following code:

resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
time.sleep(60)
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

We get the following output:

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:611951037680:endpoint/nim-llama-3-2-nv-embedqa-1b-v2
Status: InService

After you deploy the model, your endpoint is ready for inference. In the following section, we use a sample text to do an inference request. For inference request format, NIM on SageMaker supports the OpenAI API inference protocol (at the time of writing). For an explanation of supported parameters, see Create an embedding vector from the input text.

Inference example with NeMo Retriever text embedding NIM microservice

The NVIDIA NeMo Retriever Llama3.2 embedding model is optimized for multilingual and cross-lingual text question-answering retrieval with support for long documents (up to 8,192 tokens) and dynamic embedding size (Matryoshka Embeddings). In this section, we provide examples of running real-time inference and batch inference.

Real-time inference example

The following code example illustrates how to perform real-time inference using the NeMo Retriever Llama3.2 embedding model:

import pprint
pp1 = pprint.PrettyPrinter(indent=2, width=80, compact=True, depth=3)

input_embedding = '''{
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
"input": ["Sample text 1", "Sample text 2"],
"input_type": "query"
}'''

print("Example input data for embedding model endpoint:")
print(input_embedding)

response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=input_embedding
)

print("nEmbedding endpoint response:")
response = json.load(response["Body"])
pp1.pprint(response)

We get the following output:

Example input data for embedding model endpoint:
{
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2", 
"input": ["Sample text 1", "Sample text 2"],
"input_type": "query"
}

Embedding endpoint response:
{ 'data': [ {'embedding': [...], 'index': 0, 'object': 'embedding'},
            {'embedding': [...], 'index': 1, 'object': 'embedding'}],
  'model': 'nvidia/llama-3.2-nv-embedqa-1b-v2',
  'object': 'list',
  'usage': {'prompt_tokens': 14, 'total_tokens': 14}}

Batch inference example

When you have many documents, you can vectorize each of them with a for loop. This will often result in many requests. Alternatively, you can send requests consisting of batches of documents to reduce the number of requests to the API endpoint. We use the following example with a dataset of 10 documents. Let’s test the model with a number of documents in different languages:

documents = [
"El futuro de la computación cuántica en aplicaciones criptográficas.",
"L’application des réseaux neuronaux dans les systèmes de véhicules autonomes.",
"Analyse der Rolle von Big Data in personalisierten Gesundheitslösungen.",
"L’evoluzione del cloud computing nello sviluppo di software aziendale.",
"Avaliando o impacto da IoT na infraestrutura de cidades inteligentes.",
"Потенциал граничных вычислений для обработки данных в реальном времени.",
"评估人工智能在欺诈检测系统中的有效性。",
"倫理的なAIアルゴリズムの開発における課題と機会。",
"دمج تقنية الجيل الخامس (5G) في تعزيز الاتصال بالإنترنت للأشياء (IoT).",
"सुरक्षित लेनदेन के लिए बायोमेट्रिक प्रमाणीकरण विधियों में प्रगति।"
]

The following code demonstrates how to group the documents into batches and invoke the endpoint repeatedly to vectorize the whole dataset. Specifically, the example code loops over the 10 documents in batches of size 5 (batch_size=5).

pp2 = pprint.PrettyPrinter(indent=2, width=80, compact=True, depth=2)

encoded_data = []
batch_size = 5

# Loop over the documents in increments of the batch size
for i in range(0, len(documents), batch_size):
input = json.dumps({
"input": documents[i:i+batch_size],
"input_type": "passage",
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
})

response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=input,
)

response = json.load(response["Body"])

# Concatenating vectors into a single list; preserve original index
encoded_data.extend({"embedding": data[1]["embedding"], "index": data[0] } for
data in zip(range(i,i+batch_size), response["data"]))

# Print the response data
pp2.pprint(encoded_data)

We get the following output:

[ {'embedding': [...], 'index': 0}, {'embedding': [...], 'index': 1},
  {'embedding': [...], 'index': 2}, {'embedding': [...], 'index': 3},
  {'embedding': [...], 'index': 4}, {'embedding': [...], 'index': 5},
  {'embedding': [...], 'index': 6}, {'embedding': [...], 'index': 7},
  {'embedding': [...], 'index': 8}, {'embedding': [...], 'index': 9}]

Inference example with NeMo Retriever text reranking NIM microservice

The NVIDIA NeMo Retriever Llama3.2 reranking NIM microservice is optimized for providing a logit score that represents how relevant a documents is to a given query. The model was fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8,192 tokens).

In the following example, we create an input payload for a list of emails in multiple languages:

payload_model = "nvidia/llama-3.2-nv-rerankqa-1b-v2"
query = {"text": "What emails have been about returning items?"}
documents = [
    {"text":"Contraseña incorrecta. Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?"},
    {"text":"Confirmation Email Missed. Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?"},
    {"text":"أسئلة حول سياسة الإرجاع. مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب"},
    {"text":"Customer Support is Busy. Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?"},
    {"text":"Falschen Artikel erhalten. Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken."},
    {"text":"Customer Service is Unavailable. Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?"},
    {"text":"Return Policy for Defective Product. Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."},
    {"text":"收到错误物品. 早上好，关于我最近的订单，我有一个问题。我收到了错误的商品，需要退货。"},
    {"text":"Return Defective Product. Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."}
]

payload = {
  "model": payload_model,
  "query": query,
  "passages": documents,
  "truncate": "END"
}

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(payload)
)

output = json.loads(response["Body"].read().decode("utf8"))
print(f'Documents: {response}')
print(json.dumps(output, indent=2))

In this example, the relevance (logit) scores are normalized to be in the range [0, 1]. Scores close to 1 indicate a high relevance to the query, and scores closer to 0 indicate low relevance.

Documents: {'ResponseMetadata': {'RequestId': 'a3f19e06-f468-4382-a927-3485137ffcf6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'a3f19e06-f468-4382-a927-3485137ffcf6', 'x-amzn-invoked-production-variant': 'AllTraffic', 'date': 'Tue, 04 Mar 2025 21:46:39 GMT', 'content-type': 'application/json', 'content-length': '349', 'connection': 'keep-alive'}, 'RetryAttempts': 0}, 'ContentType': 'application/json', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.response.StreamingBody object at 0x7fbb00ff94b0>}
{
  "rankings": [
    {
      "index": 4,
      "logit": 0.0791015625
    },
    {
      "index": 8,
      "logit": -0.1904296875
    },
    {
      "index": 7,
      "logit": -2.583984375
    },
    {
      "index": 2,
      "logit": -4.71484375
    },
    {
      "index": 6,
      "logit": -5.34375
    },
    {
      "index": 1,
      "logit": -5.64453125
    },
    {
      "index": 5,
      "logit": -11.96875
    },
    {
      "index": 3,
      "logit": -12.2265625
    },
    {
      "index": 0,
      "logit": -16.421875
    }
  ],
  "usage": {
    "prompt_tokens": 513,
    "total_tokens": 513
  }
}

Let’s see the top-ranked document for our query:

# 1. Extract the array of rankings
rankings = output["rankings"]  # or output.get("rankings", [])

# 2. Get the top-ranked entry (highest logit)
top_ranked_entry = rankings[0]
top_index = top_ranked_entry["index"]  # e.g. 4 in your example

# 3. Retrieve the corresponding document
top_document = documents[top_index]

print("Top-ranked document:")
print(top_document)

The following is the top-ranked document based on the provided relevance scores:

Top-ranked document:
{'text': 'Falschen Artikel erhalten. Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken.'}

This translates to the following:

"Wrong item received. Hello, I have a question about my last order. I received the wrong item and need to return it."

Based on the preceding results from the model, we see that a higher logit indicates stronger alignment with the query, whereas lower (or more negative) values indicate lower relevance. In this case, the document discussing receiving the wrong item (in German) was ranked first with the highest logit, confirming that the model quickly and effectively identified it as the most relevant passage regarding product returns.

Clean up

To clean up your resources, use the following commands:

sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

Conclusion

The NVIDIA NeMo Retriever Llama 3.2 NIM microservices bring powerful multilingual capabilities to enterprise search and retrieval systems. These models excel in diverse use cases, including cross-lingual search applications, enterprise knowledge bases, customer support systems, and content recommendation engines. The text embedding NIM’s dynamic embedding size (Matryoshka Embeddings) reduces storage footprint by 35-fold while supporting 26 languages and documents up to 8,192 tokens. The reranking NIM accurately scores document relevance across languages, enabling precise information retrieval even for multilingual content. For organizations managing global knowledge bases or customer-facing search experiences, these NVIDIA-optimized microservices provide a significant advantage in latency, accuracy, and efficiency, allowing developers to quickly deploy sophisticated search capabilities without compromising on performance or linguistic diversity.

SageMaker JumpStart provides a straightforward way to use state-of-the-art large language FMs for text embedding and reranking. Through the UI or just a few lines of code, you can deploy a highly accurate text embedding model to generate dense vector representations that capture semantic meaning and a reranking model to find semantic matches and retrieve the most relevant information from various data stores at scale and cost-efficiently.

About the Authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s in Computer Science and Bioinformatics.

Greeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within cloud platforms and enhancing user experience on accelerated computing.

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the machine learning and generative AI hub provided by Amazon SageMaker. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Chase Pinkerton is a Startups Solutions Architect at Amazon Web Services. He holds a Bachelor’s in Computer Science with a minor in Economics from Tufts University. He’s passionate about helping startups grow and scale their businesses. When not working, he enjoys road cycling, hiking, playing volleyball, and photography.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA, empowering Amazon’s AI MLOps, DevOps, scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, and tennis and poker player.

Accelerating AI Development With NVIDIA RTX PRO Blackwell Series GPUs and NVIDIA NIM Microservices for RTX

As generative AI capabilities expand, NVIDIA is equipping developers with the tools to seamlessly integrate AI into creative projects, applications and games to unlock groundbreaking experiences on NVIDIA RTX AI PCs and workstations.

At the NVIDIA GTC global AI conference this week, NVIDIA introduced the NVIDIA RTX PRO Blackwell series, a new generation of workstation and server GPUs built for complex AI-driven workloads, technical computing and high-performance graphics.

Alongside the new hardware, NVIDIA announced a suite of AI-powered tools, libraries and software development kits designed to accelerate AI development on PCs and workstations. With NVIDIA CUDA-X libraries for data science, developers can significantly accelerate data processing and machine learning tasks, enabling faster exploratory data analysis, feature engineering and model development with zero code changes. And with NVIDIA NIM microservices, developers can more seamlessly build AI assistants, productivity plug-ins and advanced content-creation workflows with peak performance.

AI at the Speed of NIM With RTX PRO Series GPUs

The RTX PRO Blackwell series is built to handle the most demanding AI-driven workflows, powering applications like AI agents, simulation, extended reality, 3D design and high-end visual effects. Whether for designing and engineering complex systems or creating sophisticated and immersive content, RTX PRO GPUs deliver the performance, efficiency and scalability professionals need.

The new lineup includes:

Desktop GPUs: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, NVIDIA RTX PRO 5000 Blackwell, NVIDIA RTX PRO 4500 Blackwell and NVIDIA RTX PRO 4000 Blackwell
Laptop GPUs: NVIDIA RTX PRO 5000 Blackwell, NVIDIA RTX PRO 4000 Blackwell, NVIDIA RTX PRO 3000 Blackwell, NVIDIA RTX PRO 2000 Blackwell, NVIDIA RTX PRO 1000 Blackwell and NVIDIA RTX PRO 500 Blackwell Laptop GPUs
Data center GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition

As AI and data science evolve, the ability to rapidly process and analyze massive datasets will become a key differentiator to enable breakthroughs across industries.

NVIDIA CUDA-X libraries, built on CUDA, is a collection of libraries that deliver dramatically higher performance compared with CPU-only alternatives. With cuML 25.02 — now available in open beta — data scientists and researchers can accelerate scikit-learn, UMAP and HDBSCAN algorithms with zero code changes, unlocking new levels of performance and efficiency in machine learning tasks. This release extends the zero-code-change acceleration paradigm established by cuDF-pandas for DataFrame operations to machine learning, reducing iterations from hours to seconds.

Optimized AI software unlocks even greater possibilities. NVIDIA NIM microservices are prepackaged, high-performance AI models optimized across NVIDIA GPUs, from RTX-powered PCs and workstations to the cloud. Developers can use NIM microservices to build AI-powered app assistants, productivity tools and content-creation workflows that seamlessly integrate with RTX PRO GPUs. This makes AI more accessible and powerful than ever.

NIM microservices integrate top community and NVIDIA-built models, spanning capabilities and modalities important for PC and workstation use cases, including large language models (LLMs), images, speech and retrieval-augmented generation (RAG).

Announced at the CES trade show in January, NVIDIA AI Blueprints are advanced AI reference workflows built on NVIDIA NIM. With AI Blueprints, developers can create podcasts from PDF documents, generate stunning 4K images controlled and guided by 3D scenes, and incorporate digital humans into AI-powered use cases.

Coming soon to build.nvidia.com, the blueprints are extensible and provide everything needed to build and customize them for different use cases. These resources include source code, sample data, a demo application and documentation.

From cutting-edge hardware to optimized AI models and reference workflows, the RTX PRO series is redefining AI-powered computing — enabling professionals to push the limits of creativity, productivity and innovation. Learn about all the GTC announcements and the RTX PRO Blackwell series GPUs for laptops and workstations.

Create NIMble AI Chatbots With ChatRTX

AI-powered chatbots are changing how people interact with their content.

ChatRTX is a demo app that personalizes a LLM connected to a user’s content, whether documents, notes, images or other data. Using RAG, the NVIDIA TensorRT-LLM library and RTX acceleration, a user can query a custom chatbot to get contextually relevant answers. And because it all runs locally on Windows RTX PCs or RTX PRO workstations, they get fast and private results.

Today, the latest version of ChatRTX introduces support for NVIDIA NIM microservices, giving users access to new foundational models. NIM is expected to soon be available in additional top AI ecosystem apps. Download ChatRTX today.

Game On

Half-Life 2 owners can now download a free Half-Life 2 RTX demo from Steam, built with RTX Remix and featuring the latest neural rendering enhancements. RTX Remix supports a host of AI tools, including NVIDIA DLSS 4, RTX Neural Radiance Cache and the new community-published AI model PBRFusion 3, which upscales textures and generates high-quality normal, roughness and height maps for physically based materials.

The March NVIDIA Studio Driver is also now available for download, supporting recent app updates including last week’s RTX Remix launch. For automatic Studio Driver notifications, download the NVIDIA app.

In addition, NVIDIA RTX Kit, a suite of neural rendering technologies for game developers, is receiving major updates with Unreal Engine 5 support for the RTX Mega Geometry and RTX Hair features.

Learn more about the NVIDIA RTX PRO Blackwell GPUs by watching a replay of NVIDIA founder and CEO Jensen Huang’s GTC keynote and register to attend sessions from NVIDIA and industry leaders at the show, which runs through March 21.

Follow NVIDIA AI PC on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter.

Follow NVIDIA Workstation on LinkedIn and X.

See notice regarding software product information.

How Google and NVIDIA are teaming up to solve real-world problems with AI

An overview of the collaboration between Google and NVIDIA and a preview of the announcements at GTC this week.Read More

AI Factories Are Redefining Data Centers and Enabling the Next Era of AI

AI is fueling a new industrial revolution — one driven by AI factories.

Unlike traditional data centers, AI factories do more than store and process data — they manufacture intelligence at scale, transforming raw data into real-time insights. For enterprises and countries around the world, this means dramatically faster time to value — turning AI from a long-term investment into an immediate driver of competitive advantage. Companies that invest in purpose-built AI factories today will lead in innovation, efficiency and market differentiation tomorrow.

While a traditional data center typically handles diverse workloads and is built for general-purpose computing, AI factories are optimized to create value from AI. They orchestrate the entire AI lifecycle — from data ingestion to training, fine-tuning and, most critically, high-volume inference.

For AI factories, intelligence isn’t a byproduct but the primary one. This intelligence is measured by AI token throughput — the real-time predictions that drive decisions, automation and entirely new services.

While traditional data centers aren’t disappearing anytime soon, whether they evolve into AI factories or connect to them depends on the enterprise business model.

Regardless of how enterprises choose to adapt, AI factories powered by NVIDIA are already manufacturing intelligence at scale, transforming how AI is built, refined and deployed.

The Scaling Laws Driving Compute Demand

Over the past few years, AI has revolved around training large models. But with the recent proliferation of AI reasoning models, inference has become the main driver of AI economics. Three key scaling laws highlight why:

Pretraining scaling: Larger datasets and model parameters yield predictable intelligence gains, but reaching this stage demands significant investment in skilled experts, data curation and compute resources. Over the last five years, pretraining scaling has increased compute requirements by 50 million times. However, once a model is trained, it significantly lowers the barrier for others to build on top of it.
Post-training scaling: Fine-tuning AI models for specific real-world applications requires 30x more compute during AI inference than pretraining. As organizations adapt existing models for their unique needs, cumulative demand for AI infrastructure skyrockets.
Test-time scaling (aka long thinking): Advanced AI applications such as agentic AI or physical AI require iterative reasoning, where models explore multiple possible responses before selecting the best one. This consumes up to 100x more compute than traditional inference.

Traditional data centers aren’t designed for this new era of AI. AI factories are purpose-built to optimize and sustain this massive demand for compute, providing an ideal path forward for AI inference and deployment.

Reshaping Industries and Economies With Tokens

Across the world, governments and enterprises are racing to build AI factories to spur economic growth, innovation and efficiency.

The European High Performance Computing Joint Undertaking recently announced plans to build seven AI factories in collaboration with 17 European Union member nations.

This follows a wave of AI factory investments worldwide, as enterprises and countries accelerate AI-driven economic growth across every industry and region:

India: Yotta Data Services has partnered with NVIDIA to launch the Shakti Cloud Platform, helping democratize access to advanced GPU resources. By integrating NVIDIA AI Enterprise software with open-source tools, Yotta provides a seamless environment for AI development and deployment.
Japan: Leading cloud providers — including GMO Internet, Highreso, KDDI, Rutilea and SAKURA internet — are building NVIDIA-powered AI infrastructure to transform industries such as robotics, automotive, healthcare and telecom.
Norway: Telenor has launched an NVIDIA-powered AI factory to accelerate AI adoption across the Nordic region, focusing on workforce upskilling and sustainability.

These initiatives underscore a global reality: AI factories are quickly becoming essential national infrastructure, on par with telecommunications and energy.

Inside an AI Factory: Where Intelligence Is Manufactured

Foundation models, secure customer data and AI tools provide the raw materials for fueling AI factories, where inference serving, prototyping and fine-tuning shape powerful, customized models ready to be put into production.

As these models are deployed into real-world applications, they continuously learn from new data, which is stored, refined and fed back into the system using a data flywheel. This cycle of optimization ensures AI remains adaptive, efficient and always improving — driving enterprise intelligence at an unprecedented scale.

AI factories powered by NVIDIA for manufacturing enterprise intelligence at scale.

An AI Factory Advantage With Full-Stack NVIDIA AI

NVIDIA delivers a complete, integrated AI factory stack where every layer — from the silicon to the software — is optimized for training, fine-tuning, and inference at scale. This full-stack approach ensures enterprises can deploy AI factories that are cost effective, high-performing and future-proofed for the exponential growth of AI.

With its ecosystem partners, NVIDIA has created building blocks for the full-stack AI factory, offering:

Powerful compute performance
Advanced networking
Infrastructure management and workload orchestration
The largest AI inference ecosystem
Storage and data platforms
Blueprints for design and optimization
Reference architectures
Flexible deployment for every enterprise

Powerful Compute Performance

The heart of any AI factory is its compute power. From NVIDIA Hopper to NVIDIA Blackwell, NVIDIA provides the world’s most powerful accelerated computing for this new industrial revolution. With the NVIDIA Blackwell Ultra-based GB300 NVL72 rack-scale solution, AI factories can achieve up to 50X the output for AI reasoning, setting a new standard for efficiency and scale.

The NVIDIA DGX SuperPOD is the exemplar of the turnkey AI factory for enterprises, integrating the best of NVIDIA accelerated computing. NVIDIA DGX Cloud provides an AI factory that delivers NVIDIA accelerated compute with high performance in the cloud.

Global systems partners are building full-stack AI factories for their customers based on NVIDIA accelerated computing — now including the NVIDIA GB200 NVL72 and GB300 NVL72 rack-scale solutions.

Advanced Networking

Moving intelligence at scale requires seamless, high-performance connectivity across the entire AI factory stack. NVIDIA NVLink and NVLink Switch enable high-speed, multi-GPU communication, accelerating data movement within and across nodes.

AI factories also demand a robust network backbone. The NVIDIA Quantum InfiniBand, NVIDIA Spectrum-X Ethernet, and NVIDIA BlueField networking platforms reduce bottlenecks, ensuring efficient, high-throughput data exchange across massive GPU clusters. This end-to-end integration is essential for scaling out AI workloads to million-GPU levels, enabling breakthrough performance in training and inference.

Infrastructure Management and Workload Orchestration

Businesses need a way to harness the power of AI infrastructure with the agility, efficiency and scale of a hyperscaler, but without the burdens of cost, complexity and expertise placed on IT.

With NVIDIA Run:ai, organizations can benefit from seamless AI workload orchestration and GPU management, optimizing resource utilization while accelerating AI experimentation and scaling workloads. NVIDIA Mission Control software, which includes NVIDIA Run:ai technology, streamlines AI factory operations from workloads to infrastructure while providing full-stack intelligence that delivers world-class infrastructure resiliency.

NVIDIA Mission Control streamlines workflows across the AI factory stack.

The Largest AI Inference Ecosystem

AI factories need the right tools to turn data into intelligence. The NVIDIA AI inference platform, spanning the NVIDIA TensorRT ecosystem, NVIDIA Dynamo and NVIDIA NIM microservices — all part (or soon to be part) of the NVIDIA AI Enterprise software platform — provides the industry’s most comprehensive suite of AI acceleration libraries and optimized software. It delivers maximum inference performance, ultra-low latency and high throughput.

Storage and Data Platforms

Data fuels AI applications, but the rapidly growing scale and complexity of enterprise data often make it too costly and time-consuming to harness effectively. To thrive in the AI era, enterprises must unlock the full potential of their data.

The NVIDIA AI Data Platform is a customizable reference design to build a new class of AI infrastructure for demanding AI inference workloads. NVIDIA-Certified Storage partners are collaborating with NVIDIA to create customized AI data platforms that can harness enterprise data to reason and respond to complex queries.

Blueprints for Design and Optimization

To design and optimize AI factories, teams can use the NVIDIA Omniverse Blueprint for AI factory design and operations. The blueprint enables engineers to design, test and optimize AI factory infrastructure before deployment using digital twins. By reducing risk and uncertainty, the blueprint helps prevent costly downtime — a critical factor for AI factory operators.

For a 1 gigawatt-scale AI factory, every day of downtime can cost over $100 million. By solving complexity upfront and enabling siloed teams in IT, mechanical, electrical, power and network engineering to work in parallel, the blueprint accelerates deployment and ensures operational resilience.

Reference Architectures

NVIDIA Enterprise Reference Architectures and NVIDIA Cloud Partner Reference Architectures provide a roadmap for partners designing and deploying AI factories. They help enterprises and cloud providers build scalable, high-performance and secure AI infrastructure based on NVIDIA-Certified Systems with the NVIDIA AI software stack and partner ecosystem.

Every layer of the AI factory stack relies on efficient computing to meet growing AI demands. NVIDIA accelerated computing serves as the foundation across the stack, delivering the highest performance per watt to ensure AI factories operate at peak energy efficiency. With energy-efficient architecture and liquid cooling, businesses can scale AI while keeping energy costs in check.

Flexible Deployment for Every Enterprise

With NVIDIA’s full-stack technologies, enterprises can easily build and deploy AI factories, aligning with customers’ preferred IT consumption models and operational needs.

Some organizations opt for on-premises AI factories to maintain full control over data and performance, while others use cloud-based solutions for scalability and flexibility. Many also turn to their trusted global systems partners for pre-integrated solutions that accelerate deployment.

The NVIDIA DGX GB300 is the highest-performing, largest-scale AI factory infrastructure available for enterprises that are built for the era of AI reasoning.

On Premises

NVIDIA DGX SuperPOD is a turnkey AI factory infrastructure solution that provides accelerated infrastructure with scalable performance for the most demanding AI training and inference workloads. It features a design-optimized combination of AI compute, network fabric, storage and NVIDIA Mission Control software, empowering enterprises to get AI factories up and running in weeks instead of months — and with best-in-class uptime, resiliency and utilization.

AI factory solutions are also offered through the NVIDIA global ecosystem of enterprise technology partners with NVIDIA-Certified Systems. They deliver leading hardware and software technology, combined with data center systems expertise and liquid-cooling innovations, to help enterprises de-risk their AI endeavors and accelerate the return on investment of their AI factory implementations.

These global systems partners are providing full-stack solutions based on NVIDIA reference architectures — integrated with NVIDIA accelerated computing, high-performance networking and AI software — to help customers successfully deploy AI factories and manufacture intelligence at scale.

In the Cloud

For enterprises looking to use a cloud-based solution for their AI factory, NVIDIA DGX Cloud delivers a unified platform on leading clouds to build, customize and deploy AI applications. Every layer of DGX Cloud is optimized and fully managed by NVIDIA, offering the best of NVIDIA AI in the cloud, and features enterprise-grade software and large-scale, contiguous clusters on leading cloud providers, offering scalable compute resources ideal for even the most demanding AI training workloads.

DGX Cloud also includes a dynamic and scalable serverless inference platform that delivers high throughput for AI tokens across hybrid and multi-cloud environments, significantly reducing infrastructure complexity and operational overhead.

By providing a full-stack platform that integrates hardware, software, ecosystem partners and reference architectures, NVIDIA is helping enterprises build AI factories that are cost effective, scalable and high-performing — equipping them to meet the next industrial revolution.

Learn more about NVIDIA AI factories.

See notice regarding software product information.

NVIDIA Accelerated Quantum Research Center to Bring Quantum Computing Closer

As quantum computers continue to develop, they will integrate with AI supercomputers to form accelerated quantum supercomputers capable of solving some of the world’s hardest problems.

Integrating quantum processing units (QPUs) into AI supercomputers is key for developing new applications, helping unlock breakthroughs critical to running future quantum hardware and enabling developments in quantum error correction and device control.

The NVIDIA Accelerated Quantum Research Center, or NVAQC, announced today at the NVIDIA GTC global AI conference, is where these developments will happen. With an NVIDIA GB200 NVL72 system and the NVIDIA Quantum-2 InfiniBand networking platform, the facility will house a supercomputer with 576 NVIDIA Blackwell GPUs dedicated to quantum computing research.

“The NVAQC draws on much-needed and long-sought-after tools for scaling quantum computing to next-generation devices,” said Tim Costa, senior director of computer-aided engineering, quantum and CUDA-X at NVIDIA. “The center will be a place for large-scale simulations of quantum algorithms and hardware, tight integration of quantum processors, and both training and deployment of AI models for quantum.”

The NVAQC will host a GB200 NVL72 system.

Quantum computing innovators like Quantinuum, QuEra and Quantum Machines, along with academic partners from the Harvard Quantum Initiative and the Engineering Quantum Systems group at the MIT Center for Quantum Engineering, will work on projects with NVIDIA at the center to explore how AI supercomputing can accelerate the path toward quantum computing.

“The NVAQC is a powerful tool that will be instrumental in ushering in the next generation of research across the entire quantum ecosystem,” said William Oliver, professor of electrical engineering and computer science, and of physics, leader of the EQuS group and director of the MIT Center for Quantum Engineering. “NVIDIA is a critical partner for realizing useful quantum computing.”

There are several key quantum computing challenges where the NVAQC is already set to have a dramatic impact.

Protecting Qubits With AI Supercomputing

Qubit interactions are a double-edged sword. While qubits must interact with their surroundings to be controlled and measured, these same interactions are also a source of noise — unwanted disturbances that affect the accuracy of quantum calculations. Quantum algorithms can only work if the resulting noise is kept in check.

Quantum error correction provides a solution, encoding noiseless, logical qubits within many noisy, physical qubits. By processing the outputs from repeated measurements on these noisy qubits, it’s possible to identify, track and correct qubit errors — all without destroying the delicate quantum information needed by a computation.

The process of figuring out where errors occurred and what corrections to apply is called decoding. Decoding is an extremely difficult task that must be performed by a conventional computer within a narrow time frame to prevent noise from snowballing out of control.

A key goal of the NVAQC will be exploring how AI supercomputing can accelerate decoding. Studying how to collocate quantum hardware within the center will allow the development of low-latency, parallelized and AI-enhanced decoders, running on NVIDIA GB200 Grace Blackwell Superchips.

The NVAQC will also tackle other challenges in quantum error correction. QuEra will work with NVIDIA to accelerate its search for new, improved quantum error correction codes, assessing the performance of candidate codes through demanding simulations of complex quantum circuits.

“The NVAQC will be an essential tool for discovering, testing and refining new quantum error correction codes and decoders capable of bringing the whole industry closer to useful quantum computing,” said Mikhail Lukin, Joshua and Beth Friedman University Professor at Harvard and a codirector of the Harvard Quantum Initiative.

Developing Applications for Accelerated Quantum Supercomputers

The majority of useful quantum algorithms draw equally from classical and quantum computing resources, ultimately requiring an accelerated quantum supercomputer that unifies both kinds of hardware.

For example, the output of classical supercomputers is often needed to prime quantum computations. The NVAQC provides the heterogeneous compute infrastructure needed for research on developing and improving such hybrid algorithms.

A diagram of an accelerated quantum supercomputer connecting classical and quantum processors. — Accelerated quantum supercomputers will connect quantum and classical processors to execute hybrid algorithms.

New AI-based compilation techniques will also be explored at the NVAQC, with the potential to accelerate the runtime of all quantum algorithms, including through work with Quantinuum. Quantinuum will build on its previous integration work with NVIDIA, offering its hardware and emulators through the NVIDIA CUDA-Q platform. Users of CUDA-Q are currently offered unrestricted access to Quantinuum’s QNTM H1-1 hardware and emulator for 90 days.

“We’re excited to deepen our work with NVIDIA via this center,” said Rajeeb Hazra, president and CEO of Quantinuum. “By combining Quantinuum’s powerful quantum systems with NVIDIA’s cutting-edge accelerated computing, we’re pushing the boundaries of hybrid quantum-classical computing and unlocking exciting new possibilities.”

QPU Integration

Integrating quantum hardware with AI supercomputing is the one of the major remaining hurdles on the path to running useful quantum hardware.

The requirements of such an integration can be extremely demanding. The decoding required by quantum error correction can only function if data from millions of qubits can be sent between quantum and classical hardware at ultralow latencies.

Quantum Machines will work with NVIDIA at the NVAQC to develop and hone new controller technologies supporting rapid, high-bandwidth interfaces between quantum processors and GB200 superchips.

“We’re excited to see NVIDIA’s growing commitment to accelerating the realization of useful quantum computers, providing researchers with the most advanced infrastructure to push the boundaries of quantum-classical computing,” said Itamar Sivan, CEO of Quantum Machines.

Depiction of the NVIDIA DGX Quantum system, which comprises an NVIDIA GH200 superchip coupled with Quantum Machines’ OPX1000 control system. — The NVIDIA DGX Quantum system comprises an NVIDIA GH200 superchip and Quantum Machines’ OPX1000 control system.

Key to integrating quantum and classical hardware is a platform that lets researchers and developers quickly shift context between these two disparate computing paradigms within a single application. The NVIDIA CUDA-Q platform will be the entry point for researchers to harness the NVAQC’s quantum-classical integration.

Building on tools like NVIDIA DGX Quantum — a reference architecture for integrating quantum and classical hardware — and CUDA-Q, the NVAQC is set to be an epicenter for next-generation developments in quantum computing, seeding the evolution of qubits into impactful quantum computers.

Learn more about NVIDIA quantum computing.

Full Steam Ahead: NVIDIA-Certified Program Expands to Enterprise Storage for Faster AI Factory Deployment

AI deployments thrive on speed, data and scale. That’s why NVIDIA is expanding NVIDIA-Certified Systems to include enterprise storage certification — for streamlined AI factory deployments in the enterprise with accelerated computing, networking, software and storage.

As enterprises build AI factories, access to high-quality data is imperative to ensure optimal performance and reliability for AI models. The new NVIDIA-Certified Storage program announced today at the NVIDIA GTC global AI conference validates that enterprise storage systems meet stringent performance and scalability data requirements for AI and high-performance computing workloads.

Leading enterprise data platform and storage providers are already onboard, ensuring businesses have trusted options from day one. These include DDN, Dell Technologies, Hewlett Packard Enterprise, Hitachi Vantara, IBM, NetApp, Nutanix, Pure Storage, VAST Data and WEKA.

Building Blocks for a New Class of Enterprise Infrastructure

At GTC, NVIDIA also announced the NVIDIA AI Data Platform, a customizable reference design to build a new class of enterprise infrastructure for demanding agentic AI workloads.

The NVIDIA-Certified Storage designation is a prerequisite for partners developing agentic AI infrastructure solutions built on the NVIDIA AI Data Platform. Each of these NVIDIA-Certified Storage partners will deliver customized AI data platforms, in collaboration with NVIDIA, that can harness enterprise data to reason and respond to complex queries.

NVIDIA-Certified was created more than four years ago as the industry’s first certification program dedicated to tuning and optimizing AI systems to ensure optimal performance, manageability and scalability. Each NVIDIA-Certified system is rigorously tested and validated to deliver enterprise-grade AI performance.

There are now 50+ partners providing 500+ NVIDIA-Certified systems, helping enterprises reduce time, cost and complexity by giving them a wide selection of performance-optimized systems to power their accelerated computing workloads.

NVIDIA Enterprise Reference Architectures (RAs) were introduced last fall to provide partners with AI infrastructure best practices and configuration guidance for deploying NVIDIA-Certified servers, NVIDIA Spectrum-X networking and NVIDIA AI Enterprise software.

Solutions based on NVIDIA Enterprise RAs are available from the world’s leading systems providers to reduce the time, cost and complexity of enterprise AI deployments. Enterprise RAs are now available for a wide range of NVIDIA Hopper and NVIDIA Blackwell platforms, including NVIDIA HGX B200 systems and the new NVIDIA RTX PRO 6000 Blackwell Server Edition GPU.

These NVIDIA technologies and partner solutions are the building blocks for enterprise AI factories, representing a new class of enterprise infrastructure for high-performance AI deployments at scale.

Enterprise AI Needs Scalable Storage

As the pace of AI innovation and adoption accelerates, secure and reliable access to high-quality enterprise data is becoming more important than ever. Data is the fuel for the AI factory. With enterprise data creation projected to reach 317 zettabytes annually by 2028*, AI workloads require storage architectures built to handle massive, unstructured and multimodal datasets.

NVIDIA’s expanded storage certification program is designed to meet this need and help enterprises build AI factories with a foundation of high-performance, reliable data storage solutions. The program includes performance testing as well as validation that partner storage systems adhere to design best practices, optimizing performance and scalability for enterprise AI workloads.

NVIDIA-Certified Storage will be incorporated into NVIDIA Enterprise RAs, providing enterprise-grade data storage for AI factory deployments with full-stack solutions from global systems partners.

Certified Storage for Every Deployment

This certification builds on existing NVIDIA DGX systems and NVIDIA Cloud Partner (NCP) storage programs, expanding the data ecosystem for AI infrastructure.

These storage certification programs are aligned with their deployment models and architectures:

NVIDIA DGX BasePOD and DGX SuperPOD Storage Certification — designed for enterprise AI factory deployments with NVIDIA DGX systems.
NCP Storage Certification — designed for large-scale NCP Reference Architecture AI factory deployments with cloud providers.
NVIDIA-Certified Storage Certification — designed for enterprise AI factory deployments with NVIDIA-Certified servers available from global partners, based on NVIDIA Enterprise RA guidelines.

With this framework, organizations of all sizes — from cloud hyperscalers to enterprises — can build AI factories that process massive amounts of data, train models faster and drive more accurate, reliable AI outcomes.

Learn more about how NVIDIA-Certified Systems deliver seamless, high-speed performance and attend these related sessions at GTC:

*Source: IDC, Worldwide IDC Global DataSphere Forecast, 2024–2028: AI Everywhere, But Upsurge in Data Will Take Time, doc #US52076424, May 2024

From AT&T to the United Nations, AI Agents Redefine Work With NVIDIA AI Enterprise

AI agents are transforming work, delivering time and cost savings by helping people resolve complex challenges in new ways.

Whether developed for humanitarian aid, customer service or healthcare, AI agents built with the NVIDIA AI Enterprise software platform make up a new digital workforce helping professionals accomplish their goals faster — at lower costs and for greater impact.

AI Agents Enable Growth and Education

AI can instantly translate, summarize and process multimodal content in hundreds of languages. Integrated into agentic systems, the technology enables international organizations to engage and educate global stakeholders more efficiently.

The United Nations (UN) is working with Accenture to develop a multilingual research agent to support over 150 languages to promote local economic sustainability. The agent will act like a researcher, answering questions about the UN’s Sustainable Development Goals and fostering awareness and engagement toward its agenda of global peace and prosperity.

Mercy Corps, in collaboration with Cloudera, has deployed an AI-driven Methods Matcher tool that supports humanitarian aid experts in more than 40 countries by providing research, summaries, best-practice guidelines and data-driven crisis responses, providing faster aid delivery in disaster situations.

Wikimedia Deutschland, using the DataStax AI Platform, built with NVIDIA AI, can process and embed 10 million Wikidata items in just three days, with 30x faster ingestion performance.

AI Agents Provide Tailored Customer Service Across Industries

Agentic AI enhances customer service with real-time, highly accurate insights for more effective user experiences. AI agents provide 24/7 support, handling common inquiries with more personalized responses while freeing human agents to address more complex issues.

Intelligent-routing capabilities categorize and prioritize requests so customers can be quickly directed to the right specialists. Plus, AI agents’ predictive-analytics capabilities enable proactive support by anticipating issues and empowering human agents with data-driven insights.

Companies across industries including telecommunications, finance, healthcare and sports are already tapping into AI agents to achieve massive benefits.

AT&T, in collaboration with Quantiphi, developed and deployed a new Ask AT&T AI agent to its call center, leading to a 84% decrease in call center analytics costs.

Southern California Edison, working with WWT, is driving Project Orca to enhance data processing and predictions for 100,000+ network assets using agents to reduce downtime, enhance network reliability and enable faster, more efficient ticket resolution.

With the adoption of ServiceNow Dispute Management, built with Visa, banks can use AI agents with the solution to achieve up to a 28% reduction in call center volumes and a 30% decrease in time to resolution.

The Ottawa Hospital, working with Deloitte, deployed a team of 24/7 patient-care agents to provide preoperative support and answer patient questions regarding upcoming procedures for over 1.2 million people in eastern Ontario, Canada.

With the VAST Data Platform, the National Hockey League can unlock over 550,000 hours of historical game footage. This supports sponsorship analysis, helps video producers quickly create broadcast clips and enhances personalized fan content.

State-of-the-Art AI Agents Built With NVIDIA AI Enterprise

AI agents have emerged as versatile tools that can be adapted and adopted across a wide range of industries. These agents connect to organizational knowledge bases to understand the business context they’re deployed in. Their core functionalities — such as question-answering, translation, data processing, predictive analytics and automation — can be tailored to improve productivity and save time and costs, by any organization, in any industry.

NVIDIA AI Enterprise provides the building blocks for enterprise AI agents. It includes NVIDIA NIM microservices for efficient inference of state-of-the-art models — including the new NVIDIA Llama Nemotron reasoning model family — and NVIDIA NeMo tools to streamline data processing, model customization, system evaluation, retrieval-augmented generation and guardrailing.

NVIDIA Blueprints are reference workflows that showcase best practices for developing high-performance agentic systems. With the AI-Q NVIDIA AI Blueprint, developers can build AI agents into larger agentic systems that can reason, then connect these systems to enterprise data to tackle complex problems, harness other tools, collaborate and operate with greater autonomy.

Learn more about AI agent development by watching the NVIDIA GTC keynote and register for sessions from NVIDIA and industry leaders at the show, which runs through March 21.

See notice regarding software product information.

NVIDIA Aerial Expands With New Tools for Building AI-Native Wireless Networks

The telecom industry is increasingly embracing AI to deliver seamless connections — even in conditions of poor signal strength — while maximizing sustainability and spectral efficiency, the amount of information that can be transmitted per unit of bandwidth.

Advancements in AI-RAN technology have set the course toward AI-native wireless networks for 6G, built using AI and accelerated computing from the start, to meet the demands of billions of AI-enabled connected devices, sensors, robots, cameras and autonomous vehicles.

To help developers and telecom leaders pioneer these networks, NVIDIA today unveiled new tools in the NVIDIA Aerial Research portfolio.

The expanded portfolio of solutions include the Aerial Omniverse Digital Twin on NVIDIA DGX Cloud, the Aerial Commercial Test Bed on NVIDIA MGX, the NVIDIA Sionna 1.0 open-source library and the Sionna Research Kit on NVIDIA Jetson — helping accelerate AI-RAN and 6G research.

Industry leaders like Amdocs, Ansys, Capgemini, DeepSig, Fujitsu, Keysight, Kyocera, MathWorks, Mediatek, Samsung Research, SoftBank and VIAVI Solutions and more than 150 higher education and research institutions from U.S. and around the world — including Northeastern University, Rice University, The University of Texas at Austin, ETH Zurich, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, HHI, Singapore University of Technology and Design, and University of Oulu — are harnessing the NVIDIA Aerial Research portfolio to develop, train, simulate and deploy groundbreaking AI-native wireless innovations.

New Tools for Research and Development

The Aerial Research portfolio provides exceptional flexibility and ease of use for developers at every stage of their research — from early experimentation to commercial deployment. Its offerings include:

Aerial Omniverse Digital Twin (AODT): A simulation platform to test and fine-tune algorithms in physically precise digital replicas of entire wireless systems, now available on NVIDIA DGX Cloud. Developers can now access AODT everywhere, whether on premises, on laptops, via the public cloud or on an NVIDIA cloud service.
Aerial Commercial Test Bed (aka ARC-OTA): A full-stack AI-RAN deployment system that enables developers to deploy new AI models over the air and test them in real time, now available on NVIDIA MGX and available through manufacturers including Supermicro or as a managed offering via Sterling Skywave. ARC-OTA integrates commercial-grade Aerial CUDA-accelerated RAN software with open-source L2+ and 5G core from OpenAirInterface (OAI) and O-RAN-compliant 7.2 split open radio units from WNC and LITEON Technology to enable an end-to-end system for AI-RAN commercial testing.
Sionna 1.0: The most widely used GPU-accelerated open-source library for research in communication systems, with more than 135,000 downloads. The latest release of Sionna features a lightning-fast ray tracer for radio propagation, a versatile link-level simulator and new system-level simulation capabilities.
Sionna Research Kit: Powered by the NVIDIA Jetson platform, it integrates accelerated computing for AI and machine learning workloads and a software-defined RAN built on OAI. With the kit, researchers can connect 5G equipment and begin prototyping AI-RAN algorithms for next-generation wireless networks in just a few hours.

NVIDIA Aerial Research Ecosystem for AI-RAN and 6G

The NVIDIA Aerial Research portfolio includes the NVIDIA 6G Developer Program, an open community that serves more than 2,000 members, representing leading technology companies, academia, research institutions and telecom operators using NVIDIA technologies to complement their AI-RAN and 6G research.

Testing and simulation will play an essential role in developing AI-native wireless networks. Companies such as Amdocs, Ansys, Keysight, MathWorks and VIAVI are enhancing their simulation solutions with NVIDIA AODT, while operators have created digital twins of their radio access networks to optimize performance with changing traffic scenarios.

Nine out of 10 demonstrations chosen by the AI-RAN Alliance for Mobile World Congress were developed using the NVIDIA Aerial Research portfolio, leading to breakthrough results.

SoftBank and Fujitsu demonstrated an up to 50% throughput gain in poor radio environments using AI-based uplink channel interpolation.

DeepSig developed OmniPHY, an AI-native air interface that eliminates traditional pilot overhead, harnessing neural networks to achieve up to 70% throughput gains in certain scenarios. Using the NVIDIA AI Aerial platform, OmniPHY integrates machine learning into modulation, reception and demodulation to optimize spectral efficiency, reduce power consumption and enhance wireless network performance.

“AI-native signal processing is transforming wireless networks, delivering real-world results,” said Jim Shea, cofounder and CEO of DeepSig. “By integrating deep learning to the air interface and leveraging NVIDIA’s tools, we’re redefining how AI-native wireless networks are designed and built.”

In addition to the Aerial Research portfolio, using the open ecosystem of NVIDIA CUDA-X libraries, built on CUDA, developers can build applications that deliver dramatically higher performance.

Join the NVIDIA 6G Developer Program to access NVIDIA Aerial Research platform tools.

See notice regarding software product information.

Telecom Leaders Call Up Agentic AI to Improve Network Operations

Global telecommunications networks can support millions of user connections per day, generating more than 3,800 terabytes of data per minute on average.

That massive, continuous flow of data generated by base stations, routers, switches and data centers — including network traffic information, performance metrics, configuration and topology — is unstructured and complex. Not surprisingly, traditional automation tools have often fallen short on handling massive, real-time workloads involving such data.

To help address this challenge, NVIDIA today announced at the GTC global AI conference that its partners are developing new large telco models (LTMs) and AI agents custom-built for the telco industry using NVIDIA NIM and NeMo microservices within the NVIDIA AI Enterprise software platform. These LTMs and AI agents enable the next generation of AI in network operations.

LTMs — customized, multimodal large language models (LLMs) trained specifically on telco network data — are core elements in the development of network AI agents, which automate complex decision-making workflows, improve operational efficiency, boost employee productivity and enhance network performance.

SoftBank and Tech Mahindra have built new LTMs and AI agents, while Amdocs, BubbleRAN and ServiceNow, are dialing up their network operations and optimization with new AI agents, all using NVIDIA AI Enterprise.

It’s important work at a time when 40% of respondents in a recent NVIDIA-run telecom survey noted they’re deploying AI into their network planning and operations.

LTMs Understand the Language of Networks

Just as LLMs understand and generate human language, and NVIDIA BioNeMo NIM microservices understand the language of biological data for drug discovery, LTMs now enable AI agents to master the language of telecom networks.

The new partner-developed LTMs powered by NVIDIA AI Enterprise are:

Specialized in network intelligence — the LTMs can understand real-time network events, predict failures and automate resolutions.
Optimized for telco workloads — tapping into NVIDIA NIM microservices, the LTMs are optimized for efficiency, accuracy and low latency.
Suited for continuous learning and adaptation — with post-training scalability, the LTMs can use NVIDIA NeMo to learn from new events, alerts and anomalies to enhance future performance.

NVIDIA AI Enterprise provides additional tools and blueprints to build AI agents that simplify network operations and deliver cost savings and operational efficiency, while improving network key performance indicators (KPIs), such as:

Reduced downtime — AI agents can predict failures before they happen, delivering network resilience.
Improved customer experiences — AI-driven optimizations lead to faster networks, fewer outages and seamless connectivity.
Enhanced security — as it continuously scans for threats, AI can help mitigate cyber risks in real time.

Industry Leaders Launch LTMs and AI Agents

Leading companies across telecommunications are using NVIDIA AI Enterprise to advance their latest technologies.

SoftBank has developed a new LTM based on a large-scale LLM base model, trained on its own network data. Initially focused on network configuration, the model — which is available as an NVIDIA NIM microservice — can automatically reconfigure the network to adapt to changes in network traffic, including during mass events at stadiums and other venues. SoftBank is also introducing network agent blueprints to help accelerate AI adoption across telco operations.

Tech Mahindra has developed an LTM with the NVIDIA agentic AI tools to help address critical network operations. Tapping into this LTM, the company’s Adaptive Network Insights Studio provides a 360-degree view of network issues, generating automated reports at various levels of detail to inform and assist IT teams, network engineers and company executives.

In addition, Tech Mahindra’s Proactive Network Anomaly Resolution Hub is powered by the LTM to automatically resolve a significant portion of its network events, lightening engineers’ workloads and enhancing their productivity.

Amdocs’ Network Assurance Agent, powered by amAIz Agents, automates repetitive tasks such as fault prediction. It also conducts impact analysis and prevention methods for network issues, providing step-by-step guidance on resolving any problems that occur. Plus, the company’s Network Deployment Agent simplifies open radio access network (RAN) adoption by automating integration, deployment tasks and interoperability testing, and providing insights to network engineers.

BubbleRAN is developing an autonomous multi-agent RAN intelligence platform on a cloud-native infrastructure, where LTMs can observe the network state, configuration, availability and KPIs to facilitate monitoring and troubleshooting. The platform also automates the process of network reconfiguration and policy enforcement through a high-level set of action tools. The company’s AI agents satisfy user needs by tapping into advanced retrieval-augmented generation pipelines and telco-specific application programming interfaces, answering real-time, 5G deployment-specific questions.

ServiceNow’s AI agents in telecom — built with NVIDIA AI Enterprise on NVIDIA DGX Cloud — drive productivity by generating resolution playbooks and predicting potential network disruptions before they occur. This helps communications service providers reduce resolution time and improve customer satisfaction. The new, ready-to-use AI agents also analyze network incidents, identifying root causes of disruptions so they can be resolved faster and avoided in the future.

Learn more about the latest agentic AI advancements at NVIDIA GTC, running through Friday, March 21, in San Jose, California.

AI on the Menu: Yum! Brands and NVIDIA Partner to Accelerate Restaurant Industry Innovation

The quick-service restaurant industry is a marvel of modern logistics, where speed, teamwork and kitchen operations are key ingredients for every order. Yum! Brands is now introducing AI-powered agents at select Pizza Hut and Taco Bell locations to assist and enhance the team member experience.

Today at the NVIDIA GTC conference, Yum! Brands announced a strategic partnership with NVIDIA with a goal of deploying multiple AI solutions using NVIDIA technology in 500 restaurants this year.

World’s Largest Restaurant Company Advances AI Adoption

Spanning more than 61,000 locations, Yum! operates more restaurants than any other company in the world. Globally, customers are drawn to the food, value, service and digital convenience from iconic brands like KFC, Taco Bell, Pizza Hut and Habit Burger & Grill.

Yum!’s industry-leading digital technology team continues to pioneer the company’s AI-accelerated strategy with the recent announcement of Byte by Yum!, Yum!’s proprietary and digital AI-driven restaurant technology platform.

Generative AI-powered customer-facing experiences like automated ordering can help speed operations — but they’re often difficult to scale because of complexity and costs.

To manage that complexity, developers at Byte by Yum! harnessed NVIDIA NIM microservices and NVIDIA Riva to build new AI-accelerated voice ordering agents in under four months. The voice AI is deployed on Amazon EC2 P4d instances accelerated by NVIDIA A100 GPUs, which enables the agents to understand natural speech, process complex menu orders and suggest add-ons — increasing accuracy and customer satisfaction and helping reduce bottlenecks in high-volume locations.

The new collaboration with NVIDIA will help Yum! advance its ongoing efforts to have its engineering and data science teams in control of their own intelligence — and deliver scalable inference costs, making large-scale deployments possible.

“At Yum, we have a bold vision to deliver leading-edge, AI-powered technology capabilities to our customers and team members globally,” said Joe Park, chief digital and technology officer of Yum! Brands, Inc. and president of Byte by Yum!. “We are thrilled to partner with a pioneering company like NVIDIA to help us accelerate this ambition. This partnership will enable us to harness the rich consumer and operational datasets on our Byte by Yum! integrated platform to build smarter AI engines that will create easier experiences for our customers and team members.”

Rollout of AI Solutions Underway

Yum!’s voice AI agents are already being deployed across its brands, including in call centers to handle phone orders when demand surges during events like game days. An expanded rollout of AI solutions at up to 500 restaurants is expected this year.

Computer Vision and Restaurant Intelligence

Beyond AI-accelerated ordering, Yum! is also testing NVIDIA computer vision software to analyze drive-thru traffic and explore new use cases for AI to perceive, alert and adjust staffing, with the goal of optimizing service speed.

Another initiative focuses on NVIDIA AI-accelerated restaurant operational intelligence. Using NIM microservices, Yum! can deploy applications analyzing performance metrics across thousands of locations to generate customized recommendations for managers, identifying what top-performing stores do differently and applying those insights system-wide.

With the NVIDIA AI Enterprise software platform — available on AWS Marketplace — Byte by Yum! is streamlining AI development and deployment through scalable NVIDIA infrastructure in the cloud.

The bottom line: AI is making restaurant operations and dining experiences easier, faster and more personal for the world’s largest restaurant company.

About NVIDIA NIM on AWS

Overview of NVIDIA NeMo Retriever NIM microservices

NeMo Retriever text embedding NIM

NeMo Retriever text reranking NIM

SageMaker JumpStart overview

Solution overview

Deploy NeMo Retriever microservices on SageMaker JumpStart

Subscribe to the model package

Deploy NeMo Retriever microservices using the SageMaker SDK

Define the SageMaker model using the model package ARN

Create the endpoint configuration

Create the endpoint

Deploy the NIM microservice

Inference example with NeMo Retriever text embedding NIM microservice

Real-time inference example

Batch inference example

Inference example with NeMo Retriever text reranking NIM microservice

Clean up

Conclusion

About the Authors

AI at the Speed of NIM With RTX PRO Series GPUs

Create NIMble AI Chatbots With ChatRTX

Game On

The Scaling Laws Driving Compute Demand

Reshaping Industries and Economies With Tokens

Inside an AI Factory: Where Intelligence Is Manufactured

An AI Factory Advantage With Full-Stack NVIDIA AI

Powerful Compute Performance

Advanced Networking

Infrastructure Management and Workload Orchestration

The Largest AI Inference Ecosystem

Storage and Data Platforms

Blueprints for Design and Optimization

Reference Architectures

Flexible Deployment for Every Enterprise

On Premises

In the Cloud

Protecting Qubits With AI Supercomputing

Developing Applications for Accelerated Quantum Supercomputers

QPU Integration

Building Blocks for a New Class of Enterprise Infrastructure

Enterprise AI Needs Scalable Storage

Certified Storage for Every Deployment

AI Agents Enable Growth and Education

AI Agents Provide Tailored Customer Service Across Industries

State-of-the-Art AI Agents Built With NVIDIA AI Enterprise

New Tools for Research and Development

NVIDIA Aerial Research Ecosystem for AI-RAN and 6G

LTMs Understand the Language of Networks

Industry Leaders Launch LTMs and AI Agents

World’s Largest Restaurant Company Advances AI Adoption

Rollout of AI Solutions Underway

Computer Vision and Restaurant Intelligence

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.