This is a guest post co-written by Juan Francisco Fernandez, ML Engineer in Adevinta Spain, and AWS AI/ML Specialist Solutions Architects Antonio Rodriguez and João Moura.
InfoJobs, a subsidiary company of the Adevinta group, provides the perfect match between candidates looking for their next job position and employers looking for the best hire for the openings they need to fill. For this goal, we use natural language processing (NLP) models such as BERT through PyTorch to automatically extract relevant information from users’ CVs at the moment they upload these to our portal.
Performing inference with NLP models can take several seconds when hosted on typical CPU-based instances given the complexity and variety of the fields. This affects the user experience in the job listing web portal. Alternatively, hosting these models on GPU-based instances can prove costly, which makes the solution not feasible for our business. For this solution, we were looking for a way to optimize the latency of predictions, while keeping the costs at a minimum.
To solve this challenge, we initially considered some possible solutions along two axes:
- Vertical scaling by using bigger general-purpose instances as well as GPU-powered instances.
- Optimizing our models using openly available techniques such as quantization or open tools such as ONNX.
Neither option, whether individually or combined, was able to provide the needed performance at an affordable cost. After benchmarking our full range of options with the help of AWS AI/ML Specialists, we found that compiling our PyTorch models with AWS Neuron and using AWS Inferentia to host them on Amazon SageMaker endpoints offered a reduction of up to 92% in prediction latency, at 75% lower cost when compared to our best initial alternatives. It was, in other words, like having the best of GPU power at CPU cost.
Amazon Comprehend is a plug-and-play managed NLP service that uses machine learning to automatically uncover valuable insights and connections in text. However, in this particular case we wanted to use fine-tuned models for the task.
In this post, we share a summary of the benchmarks performed and an example of how to use AWS Inferentia with SageMaker to compile and host NLP models. We also describe how InfoJobs is using this solution to optimize the inference performance of NLP models, extracting key information from users’ CVs in a cost-efficient way.
Overview of solution
First, we had to evaluate the different options available on AWS to find the best balance between performance and cost to host our NLP models. The following diagram summarizes the most common alternatives for real-time inference, most of which were explored during our collaboration with AWS.
Hosting options benchmark on SageMaker
We started our tests with a publicly available pre-trained model from the Hugging Face model hub bert-base-multilingual-uncased
. This is the same base model used by InfoJobs’s CV key value extraction model. For this purpose, we deployed this model to a SageMaker endpoint using different combinations of instance types: CPU-based, GPU-based, or AWS Inferentia-based. We also explored optimization with Amazon SageMaker Neo and compilation with AWS Neuron where appropriate.
In this scenario, deploying our model to a SageMaker endpoint with an AWS Inferentia instance yielded 96% faster inference times compared to CPU instances and 44% faster inference times compared to GPU instances in the same range of cost and specs. This allows us to respond to 15 times more inferences than using CPU instances, or 4 times more inferences than using GPU instances at the same cost.
Based on the encouraging first results, our next step was to validate our tests on the actual model used by InfoJobs. This is a more complex model that requires PyTorch quantization for performance improvement, so we expected worse results compared to the previous standard case with bert-base-multilingual-uncased
. The results of our tests for this model are summarized in the following table (based on public pricing in Region us-east-1 as of February 20, 2022).
Category | Mode | Instance type example | p50 Inference latency (ms) | TPS | Cost per hour (USD) | Inferences per hour | Cost per million inferences (USD) |
---|---|---|---|---|---|---|---|
CPU | Normal | m5.xlarge | 1400 | 2 | 0.23 | 5606 | 41.03 |
CPU | Optimized | m5.xlarge | 1105 | 2 | 0.23 | 7105 | 32.37 |
GPU | Normal | g4dn.xlarge | 800 | 18 | 0.736 | 64800 | 11.36 |
GPU | Optimized | g4dn.xlarge | 700 | 21 | 0.736 | 75600 | 9.74 |
AWS Inferentia | Compiled | inf1.xlarge | 57 | 33 | 0.297 | 120000 | 2.48 |
The following graph shows real-time inference response times for the InfoJobs model (less is better). In this case, inference latency is 75-92% faster when compared to both CPU or GPU options.
This also means between 4-13 times less cost for running inferences compared to both CPU or GPU options, as shown in the following graph of cost per million inferences.
We must highlight that no further optimizations were made to the inference code during these non-extensive tests. However, the performance and cost benefits we saw from using AWS Inferentia exceeded our initial expectations, and enabled us to proceed to production. In the future, we will continue to optimize with other features of Neuron, such as NeuronCore Pipeline or the PyTorch-specific DataParallel API. We encourage you to explore and compare the results for your specific use case and model.
Compiling for AWS Inferentia with SageMaker Neo
You don’t need to use the Neuron SDK directly to compile your model and be able to host it on AWS Inferentia instances.
SageMaker Neo automatically optimizes machine learning (ML) models for inference on cloud instances and edge devices to run faster with no loss in accuracy. In particular, Neo is capable of compiling a wide variety of transformer-based models, making use of the Neuron SDK in the background. This allows you to get the benefit of AWS Inferentia by using APIs that are integrated with the familiar SageMaker SDK, with no required context switch.
In this section, we go through an example in which we show you how to compile a BERT model with Neo for AWS Inferentia. We then deploy that model to a SageMaker endpoint. You can find a sample notebook describing the whole process in detail on GitHub.
First, we need to create a sample input to trace our model with PyTorch and create a tar.gz file, with the model being its only content. This is a required step to have Neo compile our model artifact (for more information, see Prepare Model for Compilation). For demonstration purposes, the model is initialized as a mock model for sequence classification that hasn’t been fine-tuned on the task at all. In reality, you would replace the model identifier with your selected model from the Hugging Face model hub or a locally saved model artifact. See the following code:
It’s important to set the return_dict
parameter to False
when loading a pre-trained model, because Neuron compilation does not support dictionary-based model outputs. We upload our model.tar.gz file to Amazon Simple Storage Service (Amazon S3), saving its location in a variable named traced_model_url
.
We then use the PyTorchModel
SageMaker API to instantiate and compile our model:
Compilation may take a few minutes. As you can see, our entry_point
to model inference is our inference_inf1.py
script. It determines how our model is loaded, how input and output are preprocessed, and how the model is used for prediction. Check out the full script on GitHub.
Finally, we can deploy our model to a SageMaker endpoint on an AWS Inferentia instance, and get predictions from it in real time:
As you can see, we were able to get all the benefits of using AWS Inferentia instances on SageMaker by using simple APIs that complement the standard flow of the SageMaker SDK.
Final solution
The following architecture illustrates the solution deployed in AWS.
All the testing and evaluation analysis described in this post were done with the help of AWS AI/ML Specialist Solutions Architects in under 3 weeks, thanks for the ease of use of SageMaker and AWS Inferentia.
Conclusion
In this post, we shared how InfoJobs (Adevinta) uses AWS Inferentia with SageMaker endpoints to optimize the performance of NLP model inference in a cost-effective way, reducing inference times up to 92% with a 75% lower cost than the initial best alternative. You can follow the process and code shared for compiling and deploying your own models easily using SageMaker, the Neuron SDK for PyTorch, and AWS Inferentia.
The results of the benchmarking tests performed between AWS AI/ML Specialist Solutions Architects and InfoJobs engineers were also validated in InfoJobs’s environment. This solution is now being deployed in production, handling the processing of all the CVs uploaded by users to the InfoJobs portal in real time.
As a next step, we will be exploring ways to optimize model training and our ML pipeline with SageMaker by relying on the Hugging Face integration with SageMaker and SageMaker Training Compiler, among other features.
We encourage you to try out AWS Inferentia with SageMaker, and connect with AWS to discuss your specific ML needs. For more examples on SageMaker and AWS Inferentia, you can also check out SageMaker examples on GitHub and AWS Neuron tutorials.
About the Authors
Juan Francisco Fernandez is an ML Engineer with Adevinta Spain. He joined InfoJobs to tackle the challenge of automating model development, thereby providing more time for data scientists to think about new experiments and models and freeing them of the burden of engineering tasks. In his spare time, he enjoys spending time with his son, playing basketball and video games, and learning languages.
Antonio Rodriguez is an AI & ML Specialist Solutions Architect at Amazon Web Services. He helps companies solve their challenges through innovation with the AWS Cloud and AI/ML services. Apart from work, he loves to spend time with his family and play sports with his friends.
João Moura is an AI & ML Specialist Solutions Architect at Amazon Web Services. He focuses mostly on NLP use cases and helping customers optimize deep learning model deployments.