Amazon Search’s vision is to enable customers to search effortlessly. Our spelling correction helps you find what you want even if you don’t know the exact spelling of the intended words. In the past, we used classical machine learning (ML) algorithms with manual feature engineering for spelling correction. To make the next generational leap in spelling correction performance, we are embracing a number of deep-learning approaches, including sequence-to-sequence models. Deep learning (DL) models are compute-intensive both in training and inference, and these costs have historically made DL models impractical in a production setting at Amazon’s scale. In this post, we present the results of an inference optimization experimentation where we overcome those obstacles and achieve 534% inference speed-up for the popular Hugging Face T5 Transformer.
Challenge
The Text-to-Text Transfer Transformer (T5, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Reffel et al) is the state-of-the-art natural language processing (NLP) model architecture. T5 is a promising architecture for spelling correction, that we found to perform well in our experiments. T5 models are easy to research, develop, and train, thanks to open-source deep learning frameworks and ongoing academic and enterprise research.
However, it’s difficult to achieve production-grade, low-latency inference with a T5. For example, a single inference with a PyTorch T5 takes 45 milliseconds on one of the four NVIDIA V100 Tensor Core GPUs equipping an Amazon Elastic Compute Cloud (EC2) p3.8xlarge instance. (All inference numbers reported are for an input of 9 tokens and output of 11 tokens. The latency of T5 architectures is sensitive to both input and output lengths.)
Low-latency, cost-efficient T5 inference at scale is a known difficulty that has been reported by several AWS customers beyond Amazon Search, which boosts our motivation to contribute this post. To go from an offline, scientific achievement to a customer-facing production service, Amazon Search faces the following challenges:
- Latency – How to realize T5 inference in less than 50-millisecond P99 latency
- Throughput – How to handle large-scale concurrent inference requests
- Cost efficiency – How to keep costs under control
In the rest of this post, we explain how the NVIDIA inference optimization stack—namely the NVIDIA TensorRT compiler and the open source NVIDIA Triton Inference Server—solves those challenges. Read NVIDIA’s press release to learn about the updates.
NVIDIA TensorRT: Reducing costs and latency with inference optimization
Deep learning frameworks are convenient to iterate fast on the science, and come with numerous functionalities for scientific modeling, data loading, and training optimization. However, most of those tools are suboptimal for inference, which only requires a minimal set of operators for matrix multiplication and activation functions. Therefore, significant gains can be realized by using a specialized, prediction-only application instead of running inference in the deep learning development framework.
NVIDIA TensorRT is an SDK for high-performance deep learning inference. TensorRT delivers both an optimized runtime, using low-level optimized kernels available on NVIDIA GPUs, and an inference-only model graph, which rearranges inference computation in an optimized order.
In the following section, we will talk about the details happening behind TensorRT and how it speeds performance.
- Reduced Precision maximizes throughput with FP16 or INT8 by quantizing models while maintaining correctness.
- Layer and Tensor Fusion optimizes use of GPU memory and bandwidth by fusing nodes in a kernel to avoid kernel launch latency.
- Kernel Auto-Tuning selects best data layers and algorithms based on the target GPU platform and data kernel shapes.
- Dynamic Tensor Memory minimizes memory footprint by freeing unnecessary memory consumption of intermediate results and reuses memory for tensors efficiently.
- Multi-Stream Execution uses a scalable design to process multiple input streams in parallel with dedicated CUDA streams.
- Time Fusion optimizes recurrent neural networks over time steps with dynamically generated kernels.
T5 uses transformer layers as building blocks for its architectures. The latest release of NVIDIA TensorRT 8.2 introduces new optimizations for the T5 and GPT-2 models for real-time inference. In the following table, we can see the speedup with TensorRT on some public T5 models running on Amazon EC2G4dn instances, powered by NVIDIA T4 GPUs and EC2 G5 instances, powered by NVIDIA A10G GPUs.
Model | Instance | Baseline Pytorch Latency (ms) | TensorRT 8.2 Latency (ms) | Speedup vs. the HF baseline | ||||||||
FP32 | FP32 | FP16 | FP32 | FP16 | ||||||||
Encoder | Decoder | End to End | Encoder | Decoder | End to End | Encoder | Decoder | End to End | End to End | End to End | ||
t5-small | g4dn.xlarge | 5.98 | 9.74 | 30.71 | 1.28 | 2.25 | 7.54 | 0.93 | 1.59 | 5.91 | 407.40% | 519.34% |
g5.xlarge | 4.63 | 7.56 | 24.22 | 0.61 | 1.05 | 3.99 | 0.47 | 0.80 | 3.19 | 606.66% | 760.01% | |
t5-base | g4dn.xlarge | 11.61 | 19.05 | 78.44 | 3.18 | 5.45 | 19.59 | 3.15 | 2.96 | 13.76 | 400.48% | 569.97% |
g5.xlarge | 8.59 | 14.23 | 59.98 | 1.55 | 2.47 | 11.32 | 1.54 | 1.65 | 8.46 | 530.05% | 709.20% |
For more information about optimizations and replication of the attached performance, refer to Optimizing T5 and GPT-2 for Real-Time Inference with NVIDIA TensorRT.
It is important to note that compilation preserves model accuracy, as it operates on the inference environment and the computation scheduling, leaving the model science unaltered – unlike weight removal compression such as distillation or pruning. NVIDIA TensorRT allows to combine compilation with quantization for further gains. Quantization has double benefits on recent NVIDIA hardware: it reduces memory usage, and enables the use of NVIDIA Tensor Cores, DL-specific cells that run a fused matrix-multiply-add in mixed precision.
In the case of the Amazon Search experimentation with Hugging Face T5 model, replacing PyTorch with TensorRT for model inference increases speed by 534%.
NVIDIA Triton: Low-latency, high-throughput inference serving
Modern model serving solutions can transform offline trained models into customer-facing ML-powered products. To maintain reasonable costs at such a scale, it’s important to keep serving overhead low (HTTP handling, preprocessing and postprocessing, CPU-GPU communication), and fully take advantage of the parallel processing ability of GPUs.
NVIDIA Triton is an inference serving software proposing wide support of model runtimes (NVIDIA TensorRT, ONNX, PyTorch, XGBoost among others) and infrastructure backends, including GPUs, CPU and AWS Inferentia.
ML practitioners love Triton for multiple reasons. Its dynamic batching ability allows to accumulate inference requests during a user-defined delay and within a maximal user-defined batch size, so that GPU inference is batched, amortizing the CPU-GPU communication overhead. Note that dynamic batching happens server-side and within very short time frames, so that the requesting client still has a synchronous, near-real-time invocation experience. Triton users also enjoy its concurrent model execution capacity. GPUs are powerful multitaskers that excel in executing compute-intensive workloads in parallel. Triton maximize the GPU utilization and throughput by using CUDA streams to run multiple model instances concurrently. These model instances can be different models from different frameworks for different use cases, or a direct copy of the same model. This translates to direct throughput improvement when you have enough idle GPU memory. Also, as Triton is not tied to a specific DL development framework, it allows scientist to fully express themselves, in the tool of their choice.
With Triton on AWS, Amazon Search expects to better serve Amazon.com customers and meet latency requirements at low cost. The tight integration between the TensorRT runtime and the Triton server facilitates the development experience. Using AWS cloud infrastructure allows to scale up or down in minutes based on throughput requirements, while maintaining the bar high or reliability and security.
How AWS lowers the barrier to entry
While Amazon Search conducted this experiment on Amazon EC2 infrastructure, other AWS services exist to facilitate the development, training and hosting of state-of-the-art deep learning solutions.
For example, AWS and NVIDIA have collaborated to release a managed implementation of Triton Inference Server in Amazon SageMaker ; for more information, see Deploy fast and scalable AI with NVIDIA Triton Inference Server in Amazon SageMaker. AWS also collaborated with Hugging Face to develop a managed, optimized integration between Amazon SageMaker and Hugging Face Transformers, the open-source framework from which Amazon Search T5 model is derived ; read more at https://aws.amazon.com/machine-learning/hugging-face/.
We encourage customers with latency-sensitive CPU and GPU deep learning serving applications to consider NVIDIA TensorRT and Triton on AWS. Let us know what you build!
Passionate about deep learning and building deep learning-based solutions for Amazon Search? Check out our careers page.
About the Authors
RJ is an engineer in Search M5 team leading the efforts for building large scale deep learning systems for training and inference. Outside of work he explores different cuisines of food and plays racquet sports.
Hemant Pugaliya is an Applied Scientist at Search M5. He works on applying latest natural language processing and deep learning research to improve customer experience on Amazon shopping worldwide. His research interests include natural language processing and large-scale machine learning systems. Outside of work, he enjoys hiking, cooking and reading.
Andy Sun is a Software Engineer and Technical Lead for Search Spelling Correction. His research interests include optimizing deep learning inference latency, and building rapid experimentation platforms. Outside of work, he enjoys filmmaking, and acrobatics.
Le Cai is a Software Engineer at Amazon Search. He works on improving Search Spelling Correction performance to help customers with their shopping experience. He is focusing on high-performance online inference and distributed training optimization for deep learning model. Outside of work, he enjoys skiing, hiking and cycling.
Anthony Ko is currently working as a software engineer at Search M5 Palo Alto, CA. He works on building tools and products for model deployment and inference optimization. Outside of work, he enjoys cooking and playing racquet sports.
Olivier Cruchant is a Machine Learning Specialist Solutions Architect at AWS, based in France. Olivier helps AWS customers – from small startups to large enterprises – develop and deploy production-grade machine learning applications. In his spare time, he enjoys reading research papers and exploring the wilderness with friends and family.
Anish Mohan is a Machine Learning Architect at NVIDIA and the technical lead for ML and DL engagements with its customers in the greater Seattle region.
Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.
Eliuth Triana is a Developer Relations Manager at NVIDIA. He connects Amazon and AWS product leaders, developers, and scientists with NVIDIA technologists and product leaders to accelerate Amazon ML/DL workloads, EC2 products, and AWS AI services. In addition, Eliuth is a passionate mountain biker, skier, and poker player.