March 2025 – Page 13

Benchmarking Amazon Nova and GPT-4o models with FloTorch

Based on original post by Dr. Hemant Joshi, CTO, FloTorch.ai

A recent evaluation conducted by FloTorch compared the performance of Amazon Nova models with OpenAI’s GPT-4o.

Amazon Nova is a new generation of state-of-the-art foundation models (FMs) that deliver frontier intelligence and industry-leading price-performance. The Amazon Nova family of models includes Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro, which support text, image, and video inputs while generating text-based outputs. These models offer enterprises a range of capabilities, balancing accuracy, speed, and cost-efficiency.

Using its enterprise software, FloTorch conducted an extensive comparison between Amazon Nova models and OpenAI’s GPT-4o models with the Comprehensive Retrieval Augmented Generation (CRAG) benchmark dataset. FloTorch’s evaluation focused on three critical factors—latency, accuracy, and cost—across five diverse topics.

Key findings from the benchmark study:

GPT-4o demonstrated a slight advantage in accuracy over Amazon Nova Pro
Amazon Nova Pro outperformed GPT-4o in efficiency, operating 97% faster while being 65.26% more cost-effective
Amazon Nova Micro and Amazon Nova Lite outperformed GPT-4o-mini by 2 percentage points in accuracy
In terms of affordability, Amazon Nova Micro and Amazon Nova Lite were 10% and 56.59% cheaper than GPT-4o-mini, respectively
Amazon Nova Micro and Amazon Nova Lite also demonstrated faster response times, with 48% and 26.60% improvements, respectively

In this post, we discuss the findings from this benchmarking in more detail.

The growing need for cost-effective AI models

The landscape of generative AI is rapidly evolving. OpenAI launched GPT-4o in May 2024, and Amazon introduced Amazon Nova models at AWS re:Invent in December 2024. Although GPT-4o has gained traction in the AI community, enterprises are showing increased interest in Amazon Nova due to its lower latency and cost-effectiveness.

Large language models (LLMs) are generally proficient in responding to user queries, but they sometimes generate overly broad or inaccurate responses. Additionally, LLMs might provide answers that extend beyond the company-specific context, making them unsuitable for certain enterprise use cases.

One of the most critical applications for LLMs today is Retrieval Augmented Generation (RAG), which enables AI models to ground responses in enterprise knowledge bases such as PDFs, internal documents, and structured data. This is a crucial requirement for enterprises that want their AI systems to provide responses strictly within a defined scope.

To better serve the enterprise customers, the evaluation aimed to answer three key questions:

How does Amazon Nova Pro compare to GPT-4o in terms of latency, cost, and accuracy?
How do Amazon Nova Micro and Amazon Nova Lite perform against GPT-4o mini in these same metrics?
How well do these models handle RAG use cases across different industry domains?

By addressing these questions, the evaluation provides enterprises with actionable insights into selecting the right AI models for their specific needs—whether optimizing for speed, accuracy, or cost-efficiency.

Overview of the CRAG benchmark dataset

The CRAG dataset was released by Meta for testing with factual queries across five domains with eight question types and a large number of question-answer pairs. Five domains in CRAG dataset are Finance, Sports, Music, Movie, and Open (miscellaneous). The eight different question types are simple, simple_w_condition, comparison, aggregation, set, false_premise, post-processing, and multi-hop. The following table provides example questions with their domain and question type.

Domain	Question	Question Type
Sports	Can you carry less than the maximum number of clubs during a round of golf?	`simple`
Music	Can you tell me how many grammies were won by arlo guthrie until 60th grammy (2017)?	`simple_w_condition`
Open	Can i make cookies in an air fryer?	`simple`
Finance	Did meta have any mergers or acquisitions in 2022?	`simple_w_condition`
Movie	In 2016, which movie was distinguished for its visual effects at the oscars?	`simple_w_condition`

The evaluation considered 200 queries from this dataset representing five domains and two question types, simple and simple_w_condition. Both types of questions are common from users, and a typical Google search for the query such as “Can you tell me how many grammies were won by arlo guthrie until 60th grammy (2017)?” will not give you the correct answer (one Grammy). FloTorch used these queries and their ground truth answers to create a subset benchmark dataset. The CRAG dataset also provides top five search result pages for each query. These five webpages act as a knowledge base (source data) to limit the RAG model’s response. The goal is to index these five webpages dynamically using a common embedding algorithm and then use a retrieval (and reranking) strategy to retrieve chunks of data from the indexed knowledge base to infer the final answer.

Evaluation setup

The RAG evaluation pipeline consists of the several key components, as illustrated in the following diagram.

In this section, we explore each component in more detail.

Knowledge base

FloTorch used the top five HTML webpages provided with the CRAG dataset for each query as the knowledge base source data. HTML pages were parsed to extract text for the embedding stage.

Chunking strategy

FloTorch used a fixed chunking strategy with a chunk size of 512 tokens (four characters is usually around one token) and a 10% overlap between chunks. Further experiments with different chunking strategies, chunk sizes, and percent overlap will be done in coming weeks and will update this post.

Embedding strategy

FloTorch used the Amazon Titan Text Embeddings V2 model on Amazon Bedrock with an output vector size of 1024. With a maximum input token limit of 8,192 for the model, the system successfully embedded chunks from the knowledge base source data as well as short queries from the CRAG dataset efficiently. Amazon Bedrock APIs make it straightforward to use Amazon Titan Text Embeddings V2 for embedding data.

Vector database

FloTorch selected Amazon OpenSearch Service as a vector database for its high-performance metrics. The implementation included a provisioned three-node sharded OpenSearch Service cluster. Each provisioned node was r7g.4xlarge, selected for its availability and sufficient capacity to meet the performance requirements. FloTorch used HSNW indexing in OpenSearch Service.

Retrieval (and reranking) strategy

FloTorch used a retrieval strategy with a k-nearest neighbor (k-NN) of five for retrieved chunks. The experiments excluded reranking algorithms to make sure retrieved chunks remained consistent for both models when inferring the answer to the provided query. The following code snippet embeds the given query and passes the embeddings to the search function:

def search_results(interaction_ids: List[str], queries: List[str], k: int):
   """Retrieve search results for queries."""
   results = []
   embedding_max_length = int(os.getenv("EMBEDDING_MAX_LENGTH", 1024))
   normalize_embeddings = os.getenv("NORMALIZE_EMBEDDINGS", "True").lower() == "true"

   for interaction_id, query in zip(interaction_ids, queries):
       try:
           _, _, embedding = create_embeddings_with_titan_bedrock(query, embedding_max_length, normalize_embeddings)
           results.append(search(interaction_id + '_titan', embedding, k))
       except Exception as e:
           logger.error(f"Error processing query {query}: {e}")
           results.append(None)
   return results

Inferencing

FloTorch used the GPT-4o model from OpenAI using the API key available and used the Amazon Nova Pro model with conversation APIs. GPT-4o supports a context window of 128,000 compared to Amazon Nova Pro with a context window of 300,000. The maximum output token limit of GPT-4o is 16,384 vs. the Amazon Nova Pro maximum output token limit of 5,000. The benchmarking experiments were conducted without Amazon Bedrock Guardrails functionality. The implementation used the universal gateway provided by the FloTorch enterprise version to enable consistent API calls using the same function and to track token count and latency metrics uniformly. The inference function code is as follows:

def generate_responses(dataset_path: str, model_name: str, batch_size: int, api_endpoint: str, auth_header: str,
                        max_tokens: int, search_k: int, system_prompt: str):
   """Generate response for queries."""
   results = []

   for batch in tqdm(load_data_in_batches(dataset_path, batch_size), desc="Generating responses"):
       interaction_ids = [item["interaction_id"] for item in batch]
       queries = [item["query"] for item in batch]
       search_results_list = search_results(interaction_ids, queries, search_k)

       for i, item in enumerate(batch):
           item["search_results"] = search_results_list[i]

       responses = send_batch_request(batch, model_name, api_endpoint, auth_header, max_tokens, system_prompt)

       for i, response in enumerate(responses):
           results.append({
               "interaction_id": interaction_ids[i],
               "query": queries[i],
               "prediction": response.get("choices", [{}])[0].get("message", {}).get("content") if response else None,
               "response_time": response.get("response_time") if response else None,
               "response": response,
           })

   return results

Evaluation

Both models were evaluated by running batch queries. A batch of eight was selected to comply with Amazon Bedrock quota limits as well as GPT-4o rate limits. The query function code is as follows:

def send_batch_request(batch: List[Dict], model_name: str, api_endpoint: str, auth_header: str, max_tokens: int,
                      system_prompt: str):
   """Send batch queries to the API."""
   headers = {"Authorization": auth_header, "Content-Type": "application/json"}
   responses = []

   for item in batch:
       query = item["query"]
       query_time = item["query_time"]
       retrieval_results = item.get("search_results", [])

       references = "# References n" + "n".join(
           [f"Reference {_idx + 1}:n{res['text']}n" for _idx, res in enumerate(retrieval_results)])
       user_message = f"{references}n------nnUsing only the references listed above, answer the following question:nQuestion: {query}n"

       payload = {
           "model": model_name,
           "messages": [{"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_message}],
           "max_tokens": max_tokens,
       }

       try:
           start_time = time.time()
           response = requests.post(api_endpoint, headers=headers, json=payload, timeout=25000)
           response.raise_for_status()
           response_json = response.json()
           response_json['response_time'] = time.time() - start_time
           responses.append(response_json)
       except requests.RequestException as e:
           logger.error(f"API request failed for query: {query}. Error: {e}")
           responses.append(None)

   return responses

Benchmarking on the CRAG dataset

In this section, we discuss the latency, accuracy, and cost measurements of benchmarking on the CRAG dataset.

Latency

Latency measurements for each query response were calculated as the difference between two timestamps: the timestamp when the API call is made to the inference LLM, and a second timestamp when the entire response is received from the inference endpoint. The difference between these two timestamps determines the latency. A lower latency indicates a faster-performing LLM, making it suitable for applications requiring rapid response times. The study indicates that latency can be further reduced for both models through optimizations and caching techniques; however, the evaluation focused on measuring out-of-the-box latency performance for both models.

Accuracy

FloTorch used a modified version of the local_evaluation.py script provided with the CRAG benchmark for accuracy evaluations. The script was enhanced to provide proper categorization of correct, incorrect, and missing responses. The default GPT-4o evaluation LLM in the evaluation script was replaced with the mixtral-8x7b-instruct-v0:1 model API. Additional modifications to the script enabled monitoring of input and output tokens and latency as described earlier.

Cost

Cost calculations were straightforward because both Amazon Nova Pro and GPT-4o have published price per million input and output tokens separately. The calculation methodology involved multiplying input tokens by corresponding rates and applying the same process for output tokens. The total cost for running 200 queries was determined by combining input token and output token costs. OpenSearch Service provisioned cluster costs were excluded from this analysis because the cost comparison focused solely on the inference level between Amazon Nova Pro and GPT-4o LLMs.

Results

The following table summarizes the results.

.	Amazon Nova Pro	GPT-4o	Observation
Accuracy on subset of the CRAG dataset	51.50% (103 correct responses out of 200)	53.00% (106 correct responses out of 200)	GPT-4o outperforms Amazon Nova Pro by 1.5% on accuracy
Cost for running inference for 200 queries	$0.00030205	$0.000869537	Amazon Nova Pro saves 65.26% in costs compared to GPT-4o
Average latency (seconds)	1.682539835	2.15615045	Amazon Nova Pro is 21.97% faster than GPT-4o
Average of input and output tokens	1946.621359	1782.707547	Typical GPT-4o responses are shorter than Amazon Nova responses

For simple queries, Amazon Nova Pro and GPT-4o have similar accuracies (55 and 56 correct responses, respectively) but for simple queries with conditions, GPT-4o performs slightly better than Amazon Nova Pro (50 vs. 48 correct answers). Imagine you are part of an organization running an AI assistant service that handles 1,000 questions per month from 10,000 users (10,000,000 queries per month). Amazon Nova Pro will save your organization $5,674.88 per month ($68,098 per year) compared to GPT-4o.

Let’s look at similar results for Amazon Nova Micro, Amazon Nova Lite, and GPT-4o mini models on the same dataset.

	Amazon Nova Lite	Nove Micro	GPT-4o mini	Observation
Accuracy on subset of the CRAG dataset	52.00% (104 correct responses out of 200)	54.00% (108 correct responses out of 200)	50.00% (100 correct responses out of 200)	Both Amazon Nova Lite and Amazon Nova Micro outperform GPT-4o mini by 2 and 4 points, respectively
Cost for running inference for 200 queries	$0.00002247 (56.59% cheaper than GPT-4o mini)	$0.000013924 (73.10% cheaper than GPT-4o mini)	$0.000051768	Amazon Nova Lite and Amazon Nova Micro are cheaper than GPT-4o mini by 56.59% and 73.10%, respectively
Average latency (seconds)	1.553371465 (26.60% faster than GPT-4o mini)	1.6828564 (20.48% faster than GPT-4o mini)	2.116291895	Amazon Nova models are at least 20% faster than GPT-4o mini
Average of input and output tokens	1930.980769	1940.166667	1789.54	GPT-4o mini returns shorter answers

Amazon Nova Micro is significantly faster and less expensive compared to GPT-4o mini while providing more accurate answers. If you are running a service that handles about 10 million queries each month, it will save you on average 73% of what you will be paying for slightly less accurate results from the GPT-4o mini model.

Conclusion

Based on these tests for RAG cases, Amazon Nova models produce comparable or higher accuracy at significantly lower cost and latency compared to GPT-4o and GPT-4o mini models. FloTorch is continuing further experimentation with other relevant LLMs for comparison. Future research will include additional experiments with various query types such as comparison, aggregation, set, false_premise, post-processing, and multi-hop queries.

Get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.

About FloTorch

FloTorch.ai is helping enterprise customers design and manage agentic workflows in a secure and scalable manner. FloTorch’s mission is to help enterprises make data-driven decisions in the end-to-end generative AI pipeline, including but not limited to model selection, vector database selection, and evaluation strategies. FloTorch offers an open source version for customers with scalable experimentation with different chunking, embedding, retrieval, and inference strategies. The open source version works on a customer’s AWS account so you can experiment on your AWS account with your proprietary data. Interested users are invited to try out FloTorch from AWS Marketplace or from GitHub. FloTorch also offers an enterprise version of this product for scalable experimentation with LLM models and vector databases on cloud platforms. The enterprise version also includes a universal gateway with model registry to custom define new LLMs and recommendation engine to suggest ew LLMs and agent workflows. For more information, contact us at info@flotorch.ai.

About the author

Prasanna Sridharan is a Principal Gen AI/ML Architect at AWS, specializing in designing and implementing AI/ML and Generative AI solutions for enterprise customers. With a passion for helping AWS customers build innovative Gen AI applications, he focuses on creating scalable, cutting-edge AI solutions that drive business transformation. You can connect with Prasanna on LinkedIn.

Dr. Hemant Joshi has over 20 years of industry experience building products and services with AI/ML technologies. As CTO of FloTorch, Hemant is engaged with customers to implement State of the Art GenAI solutions and agentic workflows for enterprises.

Deploy DeepSeek-R1 distilled models on Amazon SageMaker using a Large Model Inference container

DeepSeek-R1 is a large language model (LLM) developed by DeepSeek AI that uses reinforcement learning to enhance reasoning capabilities through a multi-stage training process from a DeepSeek-V3-Base foundation. A key distinguishing feature is its reinforcement learning step, which was used to refine the model’s responses beyond the standard pre-training and fine-tuning process. By incorporating RL, DeepSeek-R1 can adapt more effectively to user feedback and objectives, ultimately enhancing both relevance and clarity. In addition, DeepSeek-R1 employs a chain-of-thought (CoT) approach, meaning it’s equipped to break down complex queries and reason through them in a step-by-step manner. This guided reasoning process allows the model to produce more accurate, transparent, and detailed answers. This model combines RL-based fine-tuning with CoT capabilities, aiming to generate structured responses while focusing on interpretability and user interaction. With its wide-ranging capabilities, DeepSeek-R1 has captured the industry’s attention as a versatile text-generation model that can be integrated into various workflows such as agents, logical reasoning, and data interpretation tasks.

DeepSeek-R1 uses a Mixture of Experts (MoE) architecture and is 671 billion parameters in size. The MoE architecture allows activation of 37 billion parameters, enabling efficient inference by routing queries to the most relevant expert clusters. This approach allows the model to specialize in different problem domains while maintaining overall efficiency.

DeepSeek-R1 distilled models bring the reasoning capabilities of the main R1 model to more efficient architectures based on popular open models like Meta’s Llama (8B and 70B) and Hugging Face’s Qwen (1.5B, 7B, 14B, and 32B). Distillation refers to a process of training smaller, more efficient models to mimic the behavior and reasoning patterns of the larger DeepSeek-R1 model, using it as a teacher model. For example, DeepSeek-R1-Distill-Llama-8B offers an excellent balance of performance and efficiency. By integrating this model with Amazon SageMaker AI, you can benefit from the AWS scalable infrastructure while maintaining high-quality language model capabilities.

In this post, we show how to use the distilled models in SageMaker AI, which offers several options to deploy the distilled versions of the R1 model.

Solution overview

You can use DeepSeek’s distilled models within the AWS managed machine learning (ML) infrastructure. We demonstrate how to deploy these models on SageMaker AI inference endpoints.

SageMaker AI offers a choice of which serving container to use for deployments:

LMI container – A Large Model Inference (LMI) container with different backends (vLLM, TensortRT-LLM, and Neuron). See the following GitHub repo for more details.
TGI container – A Hugging Face Text Generation Interface (TGI) container. You can find more details in the following GitHub repo.

In the following code snippets, we use the LMI container example. See the following GitHub repo for more deployment examples using TGI, TensorRT-LLM, and Neuron.

LMI containers

LMI containers are a set of high-performance Docker containers purpose built for LLM inference. With these containers, you can use high-performance open source inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX to deploy LLMs on SageMaker endpoints. These containers bundle together a model server with open source inference libraries to deliver an all-in-one LLM serving solution.

LMI containers provide many features, including:

Optimized inference performance for popular model architectures like Meta Llama, Mistral, Falcon, and more
Integration with open source inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX
Continuous batching for maximizing throughput at high concurrency
Token streaming
Quantization through AWQ, GPTQ, FP8, and more
Multi-GPU inference using tensor parallelism
Serving LoRA fine-tuned models
Text embedding to convert text data into numeric vectors
Speculative decoding support to decrease latency

LMI containers provide these features through integrations with popular inference libraries. A unified configuration format enables you to use the latest optimizations and technologies across libraries. To learn more about the LMI components, see Components of LMI.

Prerequisites

To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created. For details, refer to Create an AWS account.

If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you might need to request a service quota increase for the corresponding SageMaker hosting instances. In this example, you host the base model and multiple adapters on the same SageMaker endpoint, so you will use an ml.g5.2xlarge SageMaker hosting instance.

Deploy DeepSeek-R1 for inference

The following is a step-by-step example that demonstrates how to programmatically deploy DeepSeek-R1-Distill-Llama-8B for inference. The code for deploying the model is provided in the GitHub repo. You can clone the repo and run the notebook from SageMaker AI Studio.

Configure the SageMaker execution role and import the necessary libraries:

!pip install --force-reinstall --no-cache-dir sagemaker==2.235.2

import json
import boto3
import sagemaker

# Set up IAM Role
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

There are two ways to deploy an LLM like DeepSeek-R1 or its distilled variants on SageMaker:

Deploy uncompressed model weights from an Amazon S3 bucket – In this scenario, you need to set the HF_MODEL_ID variable to the Amazon Simple Storage Service (Amazon S3) prefix that has model artifacts. This method is generally much faster, with the model typically downloading in just a couple of minutes from Amazon S3.
Deploy directly from Hugging Face Hub (requires internet access) – To do this, set HF_MODEL_ID to the Hugging Face repository or model ID (for example, “deepseek-ai/DeepSeek-R1-Distill-Llama-8B”). However, this method tends to be slower and can take significantly longer to download the model compared to using Amazon S3. This approach will not work if enable_network_isolation is enabled, because it requires internet access to retrieve model artifacts from the Hugging Face Hub.

In this example, we deploy the model directly from the Hugging Face Hub:

vllm_config = {
    "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ROLLING_BATCH": "vllm",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
}

The OPTION_MAX_ROLLING_BATCH_SIZE parameter limits number of concurrent requests that can be processed by the endpoint. We set it to 16 to limit GPU memory requirements. You should adjust it based on your latency and throughput requirements.

Create and deploy the model:

# Create a Model object
lmi_model = sagemaker.Model(
    image_uri = inference_image_uri,
    env = vllm_config,
    role = role,
    name = model_name,
    enable_network_isolation=True, # Ensures model is isolated from the internet
    vpc_config={
        "Subnets": ["subnet-xxxxxxxx", "subnet-yyyyyyyy"],
        "SecurityGroupIds": ["sg-zzzzzzzz"]
    }
)
# Deploy to SageMaker
lmi_model.deploy(
    initial_instance_count = 1,
    instance_type = "ml.g5.2xlarge",
    container_startup_health_check_timeout = 1600,
    endpoint_name = endpoint_name,
)

Make inference requests:

sagemaker_client = boto3.client('sagemaker-runtime', region_name='us-east-1')
endpoint_name = predictor.endpoint_name

input_payload = {
    "inputs": "What is Amazon SageMaker? Answer concisely.",
    "parameters": {"max_new_tokens": 250, "temperature": 0.1}
}

serialized_payload = json.dumps(input_payload)

response = sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=serialized_payload
)

Performance and cost considerations

The ml.g5.2xlarge instance provides a good balance of performance and cost. For large-scale inference, use larger batch sizes for real-time inference to optimize cost and performance. You can also use batch transform for offline, large-volume inference to reduce costs. Monitor endpoint usage to optimize costs.

Clean up

Clean up your resources when they’re no longer needed:

predictor.delete_endpoint()

Security

You can configure advanced security and infrastructure settings for the DeepSeek-R1 model, including virtual private cloud (VPC) networking, service role permissions, encryption settings, and EnableNetworkIsolation to restrict internet access. For production deployments, it’s essential to review these settings to maintain alignment with your organization’s security and compliance requirements.

By default, the model runs in a shared AWS managed VPC with internet access. To enhance security and control access, you should explicitly configure a private VPC with appropriate security groups and IAM policies based on your requirements.

SageMaker AI provides enterprise-grade security features to help keep your data and applications secure and private. We do not share your data with model providers, unless you direct us to, providing you full control over your data. This applies to all models—both proprietary and publicly available, including DeepSeek-R1 on SageMaker.

For more details, see Configure security in Amazon SageMaker AI.

Logging and monitoring

You can monitor SageMaker AI using Amazon CloudWatch, which collects and processes raw data into readable, near real-time metrics. These metrics are retained for 15 months, allowing you to analyze historical trends and gain deeper insights into your application’s performance and health.

Additionally, you can configure alarms to monitor specific thresholds and trigger notifications or automated actions when those thresholds are met, helping you proactively manage your deployment.

For more details, see Metrics for monitoring Amazon SageMaker AI with Amazon CloudWatch.

Best practices

It’s always recommended to deploy your LLMs endpoints inside your VPC and behind a private subnet, without internet gateways, and preferably with no egress. Ingress from the internet should also be blocked to minimize security risks.

Always apply guardrails to make sure incoming and outgoing model responses are validated for safety, bias, and toxicity. You can guard your SageMaker endpoints model responses with Amazon Bedrock Guardrails. See DeepSeek-R1 model now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart for more details.

Inference performance evaluation

In this section, we focus on inference performance of DeepSeek-R1 distilled variants on SageMaker AI. Evaluating the performance of LLMs in terms of end-to-end latency, throughput, and resource efficiency is crucial for providing responsiveness, scalability, and cost-effectiveness in real-world applications. Optimizing these metrics directly impacts user experience, system reliability, and deployment feasibility at scale. For this post, we test all DeepSeek-R1 distilled variants—1.5B, 7B, 8B, 14B, 32B, and 70B—along four performance metrics:

End-to-end latency (time between sending a request and receiving the response)
Throughput tokens
Time to first token
Inter-token latency

The main purpose of this performance evaluation is to give you an indication about relative performance of distilled R1 models on different hardware for generic traffic patterns. We didn’t try to optimize the performance for each model/hardware/use case combination. These results should not be treated like a best possible performance of a particular model on a particular instance type. You should always perform your own testing using your own datasets and traffic patterns as well as I/O sequence length.

Scenarios

We tested the following scenarios:

Container/model configuration – We used LMI container v14 with default parameters, except MAX_MODEL_LEN, which was set to 10000 (no chunked prefix and no prefix caching). On instances with multiple accelerators, we sharded the model across all available GPUs.
Tokens – We evaluated SageMaker endpoint hosted DeepSeek-R1 distilled variants on performance benchmarks using two sample input token lengths. We ran both tests 50 times each before measuring the average across the different metrics. Then we repeated the test with concurrency 10.
- Short-length test – 512 input tokens and 256 output tokens.
- Medium-length test – 3072 input tokens and 256 output tokens.
Hardware – We tested the distilled variants on a variety of instance types ranging from 1, 4, or 8 GPUs per instance. In the following table, a green cell indicates that a model was tested on that particular instance type, and red indicates that a model wasn’t tested with that instance type, either because the instance was excessive for a given model size or too small to fit the model in memory.

Box plots

In the following sections, we use a box plot to visualize model performance. A box is a concise visual summary that displays a dataset’s median, interquartile range (IQR), and potential outliers using a box for the middle 50% of the data, with whiskers extending to the smallest and largest non-outlier values. By examining the median’s placement within the box, the box’s size, and the whiskers’ lengths, you can quickly assess the data’s central tendency, variability, and skewness, as illustrated in the following figure.

DeepSeek-R1-Distill-Qwen-1.5B

This model can be deployed on a single GPU instance. The results indicate that the ml.g5.xlarge instance outperforms the ml.g6.xlarge instance across all measured performance criteria and concurrency settings.