This post was written by Claudiu Bota, Oleg Yurchenko, and Vladyslav Melnyk of AWS Partner Automat-it.
As organizations adopt AI and machine learning (ML), they’re using these technologies to improve processes and enhance products. AI use cases include video analytics, market predictions, fraud detection, and natural language processing, all relying on models that analyze data efficiently. Although models achieve impressive accuracy with little latency, they often demand computational resources with significant computing power, including GPUs, to run inferences. Therefore, maintaining the right balance between performance and cost is essential, especially when deploying models at scale.
One of our customers encountered this exact challenge. To address this issue, they engaged Automat-it, an AWS Premier Tier Partner, to design and implement their platform on AWS, specifically using Elastic Kubernetes Service (Amazon EKS). Automat-it specializes in helping startups and scaleups grow through hands-on cloud DevOps, MLOps and FinOps services. The collaboration aimed to achieve scalability and performance while optimizing costs. Their platform requires highly accurate models with low latency, and the costs for such demanding tasks escalate quickly without proper optimization.
In this post, we explain how Automat-it helped this customer achieve a more than twelvefold cost savings while keeping AI model performance within the required performance thresholds. This was accomplished through careful tuning of architecture, algorithm selection, and infrastructure management.
Customer challenge
Our customer specializes in developing AI models for video intelligence solutions using YOLOv8 and the Ultralytics library. An end-to-end YoloV8 deployment consists of three stages:
- Preprocessing – Prepares raw video frames through resizing, normalization, and format conversion
- Inference – In which the YOLOv8 model generates predictions by detecting and classifying objects in the curated video frames
- Postprocessing – In which predictions are refined using techniques such as non-maximum suppression (NMS), filtering, and output formatting.
They provide their clients with models that analyze live video streams and extract valuable insights from the captured frames, each customized to a specific use case. Initially, the solution required each model to run on a dedicated GPU at runtime, needing GPU instances per customer. This setup led to underutilized GPU resources and elevated operational costs.
Therefore, our primary objective was to optimize GPU utilization while lowering overall platform costs and keeping data processing time as minimal as possible. Specifically, we aimed to limit AWS infrastructure costs to $30 per camera per month while keeping the total processing time (preprocessing, inference, and postprocessing) under 500 milliseconds. Achieving these savings without lowering the model performance—particularly by maintaining low inference latency—remains essential to providing the desired level of service for each customer.
Initial approach
Our initial approach followed a client-server architecture, splitting the YOLOv8 end-to-end deployment into two components. The client component, running on CPU instances, handled the preprocessing and postprocessing stages. Meanwhile, the server component, running on GPU instances, was dedicated to inference and responded to requests from the client. This functionality was implemented using a custom gRPC wrapper, providing efficient communication between the components.
The goal of this approach was to reduce costs by using GPUs exclusively for the inference stage rather than for the entire end-to-end deployment. Additionally, we assumed that client-server communication latency would have a minimal impact on the overall inference time. To assess the effectiveness of this architecture, we conducted performance tests using the following baseline parameters:
- Inference was performed on
g4dn.xlarge
GPU-based instances because the customer’s models were optimized to run on T4 GPUs NVIDIA - Customer’s models used the YOLOv8n model with Ultralytics version 8.2.71
The results were evaluated based on the following key performance indicators (KPIs):
- Preprocessing time – The amount of time required to prepare the input data for the model
- Inference time – The duration taken by the YoloV8 model to process the input and produce results
- Postprocessing time – The time needed to finalize and format the model’s output for use
- Network communication time – The duration of communication between the client component running on CPU instances and the server component running on GPU instances
- Total time – The overall duration from when an image is sent to the YoloV8 model until results are received, including all processing stages
The findings were as follows:
Preprocess (ms) | Inference (ms) | Postprocess (ms) | Network communication (ms) | Total (ms) | |
Custom gRPC | 2.7 | 7.9 | 1.1 | 10.26 | 21.96 |
The GPU-based instance completed inference in 7.9 ms. However, the network communication overhead of 10.26 ms increased the total processing time. Although the total processing time was acceptable, each model required a dedicated GPU-based instance to run, resulting in unacceptable costs for the customer. Specifically, the inference cost per camera was $353.03 monthly, exceeding the customer’s budget.
Finding a better solution
Although the performance results were promising, even with the added latency from network communication, costs per camera were still too high, so our solution needed further optimization. Additionally, the custom gRPC wrapper lacked an automatic scaling mechanism to accommodate the addition of new models and required ongoing maintenance, adding to its operational complexity.
To address these challenges, we moved away from the client-server approach and implemented GPU time-slicing (fractionalization), which involves dividing GPU access into discrete time intervals. This approach allows AI models to share a single GPU, each utilizing a virtual GPU during its assigned slice. It’s similar to CPU time-slicing between processes, optimizing resource allocation without lowering the performance. This approach was inspired by several AWS blog posts that can be found in the references section.
We implemented GPU time-slicing in the EKS cluster by using the NVIDIA Kubernetes device plugin. This equipped us to use Kubernetes’s native scaling mechanisms, simplifying the scaling process to accommodate new models and reducing operational overhead. Moreover, by relying on the plugin, we avoided the need to maintain custom code, streamlining both implementation and long-term maintenance.
In this configuration, the GPU instance was set to split into 60 time-sliced virtual GPUs. We used the same KPIs as in the previous setup to measure efficiency and performance under these optimized conditions, making sure that cost reduction aligned with our service quality benchmarks.
We conducted the tests in three stages, as described in the following sections.
Stage 1
In this stage, we ran one pod on a g4dn.xlarge
GPU-based instance. Each pod runs the three phases of the end-to-end YOLOv8 deployment on the GPU and processes video frames from a single camera. The findings are shown in the following graph and table.
Preprocess (ms) | Inference (ms) | Postprocess (ms) | Total (ms) | |
1 pod | 2 | 7.8 | 1 | 10.8 |
We successfully achieved an inference time of 7.8 ms and a total processing time of 10.8 ms, which aligned with the project’s requirements. The GPU memory usage for a single pod was 247MiB, and the GPU processor utilization was 12%. The memory usage per pod indicated we could run approximately 60 processes (or pods) on a 16GiB GPU.
Stage 2
In this stage, we ran 20 pods on a g4dn.2xlarge
GPU-based instance. We changed the instance type from g4dn.xlarge
to g4dn.2xlarge
due to CPU overload associated with data processing and loading. The findings are shown in the following graph and table.
Preprocess (ms) | Inference (ms) | Postprocess (ms) | Total (ms) | |
20 pods | 11 | 42 | 55 | 108 |
At this stage, GPU memory usage reached 7,244 MiB, with GPU processor utilization peaking between 95% and 99%. A total of 20 pods utilized half of the GPU’s 16 GiB memory and fully consumed the GPU processor, leading to increased processing times. Although both inference and total processing times rose, this outcome was anticipated and deemed acceptable. The next objective was determining the maximum number of pods the GPU could support on its memory capacity.
Stage 3
At this stage, we aimed to run 60 pods on a g4dn.2xlarge
GPU-based instance. Subsequently, we changed the instance type from g4dn.2xlarge
to g4dn.4xlarge
and then to g4dn.8xlarge
.
The goal was to maximize GPU memory utilization. However, data processing and loading overloaded the instance’s CPU. This prompted us to switch to instances that still had one GPU but offered more CPUs.
The findings are shown in the following graph and table.
Preprocess (ms) | Inference (ms) | Postprocess (ms) | Total (ms) | |
54 pods | 21 | 56 | 128 | 205 |
The GPU memory usage was 14780MiB, and the GPU processor utilization was 99–100%. Despite these adjustments, we encountered GPU out-of-memory errors that prevented us from scheduling all 60 pods. Ultimately, we could accommodate 54 pods, representing the maximum number of AI models that could fit on a single GPU.
In this scenario, the inference costs per camera associated with GPU usage were $27.81 per month per camera, a twelvefold reduction compared to the initial approach. By adopting this approach, we successfully met the customer’s cost requirements per camera per month while maintaining acceptable performance levels.
Conclusion
In this post, we explored how Automat-it helped one of our customers achieve a twelvefold cost reduction while maintaining the performance of their YOLOV8-based AI models within acceptable ranges. The test results demonstrate that GPU time-slicing enables the maximum number of AI models to operate efficiently on a single GPU, significantly reducing costs while providing high performance. Furthermore, this method necessitates minimal maintenance and modifications to the model code, enhancing scalability and ease of use.
References
To learn more, refer to the following resources:
AWS
- GPU sharing on Amazon EKS with NVIDIA time-slicing and accelerated EC2 instances
- Delivering video content with fractional GPUs in containers on Amazon EKS
Community
Disclaimer
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
About the authors
Claudiu Bota is a Senior Solutions Architect at Automat-it, helping customers across the entire EMEA region migrate to AWS and optimize their workloads. He specializes in containers, serverless technologies, and microservices, focusing on building scalable and efficient cloud solutions. Outside of work, Claudiu enjoys reading, traveling, and playing chess.
Oleg Yurchenko is the DevOps Director at Automat-it, where he spearheads the company’s expertise in DevOps best practices and solutions. His focus areas include containers, Kubernetes, serverless, Infrastructure as Code, and CI/CD. With over 20 years of hands-on experience in system administration, DevOps, and cloud technologies, Oleg is a passionate advocate for his customers, guiding them in building modern, scalable, and cost-effective cloud solutions.
Vladyslav Melnyk is a Senior MLOps Engineer at Automat-it. He is a Seasoned Deep Learning enthusiast with a passion for Artificial Intelligence, taking care of AI products through their lifecycle, from experimentation to production. With over 9 years of experience in AI within AWS environments, he is also a big fan of leveraging cool open-source tools. Result-oriented and ambitious, with a strong focus on MLOps, Vladyslav ensures smooth transitions and efficient model deployment. He is skilled in delivering deep learning models, always learning and adapting to stay ahead in the field.