Hebrew University of Jerusalem Professor Sergiu Hart discusses the research shared in two papers that were awarded the ACM SIGecom Test of Time and Doctoral Dissertation awards.Read More
SIGIR: How information retrieval and natural-language processing overcame their rivalry
Alexa principal scientist Alessandro Moschitti describes the changes that have swept both fields in the 20 years since he first attended the conference.Read More
Extracting custom entities from documents with Amazon Textract and Amazon Comprehend
Amazon Textract is a machine learning (ML) service that makes it easy to extract text and data from scanned documents. Textract goes beyond simple optical character recognition (OCR) to identify the contents of fields in forms and information stored in tables. This allows you to use Amazon Textract to instantly “read” virtually any type of document and accurately extract text and data without needing any manual effort or custom code.
Amazon Textract has multiple applications in a variety of fields. For example, talent management companies can use Amazon Textract to automate the process of extracting a candidate’s skill set. Healthcare organizations can extract patient information from documents to fulfill medical claims.
When your organization processes a variety of documents, you sometimes need to extract entities from unstructured text in the documents. A contract document, for example, can have paragraphs of text where names and other contract terms are listed in the paragraph of text instead of as a key/value or form structure. Amazon Comprehend is a natural language processing (NLP) service that can extract key phrases, places, names, organizations, events, sentiment from unstructured text, and more. With custom entity recognition, you can to identify new entity types not supported as one of the preset generic entity types. This allows you to extract business-specific entities to address your needs.
In this post, we show how to extract custom entities from scanned documents using Amazon Textract and Amazon Comprehend.
Use case overview
For this post, we process resume documents from the Resume Entities for NER dataset to get insights such as candidates’ skills by automating this workflow. We use Amazon Textract to extract text from these resumes and Amazon Comprehend custom entity recognition to detect skills such as AWS, C, and C++ as custom entities. The following screenshot shows a sample input document.
The following screenshot shows the corresponding output generated using Amazon Textract and Amazon Comprehend.
Solution overview
The following diagram shows a serverless architecture that processes incoming documents for custom entity extraction using Amazon Textract and custom model trained using Amazon Comprehend. As documents are uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, it triggers an AWS Lambda function. The function calls the Amazon Textract DetectDocumentText
API to extract the text and calls Amazon Comprehend with the extracted text to detect custom entities.
The solution consists of two parts:
- Training:
- Extract text from PDF documents using Amazon Textract
- Label the resulting data using Amazon SageMaker Ground Truth
- Train custom entity recognition using Amazon Comprehend with the labeled data
- Inference:
- Send the document to Amazon Textract for data extraction
- Send the extracted data to the Amazon Comprehend custom model for entity extraction
Launching your AWS CloudFormation stack
For this post, we use an AWS CloudFormation stack to deploy the solution and create the resources it needs. These resources include an S3 bucket, Amazon SageMaker instance, and the necessary AWS Identity and Access Management (IAM) roles. For more information about stacks, see Walkthrough: Updating a stack.
- Download the following CloudFormation template and save to your local disk.
- Sign in to the AWS Management Console with your IAM user name and password.
- On the AWS CloudFormation console, choose Create Stack.
Alternatively, you can choose Launch Stack directly.
- On the Create Stack page, choose Upload a template file and upload the CloudFormation template you downloaded.
- Choose Next.
- On the next page, enter a name for the stack.
- Leave everything else at their default setting.
- On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
- Choose Create stack.
- Wait for the stack to finish running.
You can examine various events from the stack creation process on the Events tab. After the stack creation is complete, look at the Resources tab to see all the resources the template created.
- On the Outputs tab of the CloudFormation stack, record the Amazon SageMaker instance URL.
Running the workflow on a Jupyter notebook
To run your workflow, complete the following steps:
- Open the Amazon SageMaker instance URL that you saved from the previous step.
- Under the New drop-down menu, choose Terminal.
- On the terminal, clone the GitHub
cd Sagemaker; git clone
URL.
You can check the folder structure (see the following screenshot).
- Open
Textract_Comprehend_Custom_Entity_Recognition.ipynb
. - Run the cells.
Code walkthrough
Upload the documents to your S3 bucket.
The PDFs are now ready for Amazon Textract to perform OCR. Start the process with a StartDocumentTextDetection asynchronous API call.
For this post, we process two resumes in PDF format for demonstration, but you can process all 220 if needed. The results have all been processed and are ready for you to use.
Because we need to train a custom entity recognition model with Amazon Comprehend (as with any ML model), we need training data. In this post, we use Ground Truth to label our entities. By default, Amazon Comprehend can recognize entities like person, title, and organization. For more information, see Detect Entities. To demonstrate custom entity recognition capability, we focus on candidate skills as entities inside these resumes. We have the labeled data from Ground Truth. The data is available in the GitHub repo <(see: entity_list.csv)>. For instructions on labeling your data, see Developing NER models with Amazon SageMaker Ground Truth and Amazon Comprehend.
Now we have our raw and labeled data and are ready to train our model. To start the process, use the create_entity_recognizer
API call. When the training job is submitted, you can see the recognizer being trained on the Amazon Comprehend console.
In the training, Amazon Comprehend sets aside some data for testing. When the recognizer is trained, you can see the performance of each entity and the recognizer overall.
We have prepared a small sample of text to test out the newly trained custom entity recognizer. We run the same step to perform OCR, then upload the Amazon Textract output to Amazon S3 and start a custom recognizer job.
When the job is submitted, you can see the progress on the Amazon Comprehend console under Analysis Jobs.
When the analysis job is complete, you can download the output and see the results. For this post, we converted the JSON result into table format for readability.
Conclusion
ML and artificial intelligence allow organizations to be agile. It can automate manual tasks to improve efficiency. In this post, we demonstrated an end-to-end architecture for extracting entities such as a candidate’s skills on their resume by using Amazon Textract and Amazon Comprehend. This post showed you how to use Amazon Textract to do data extraction and use Amazon Comprehend to train a custom entity recognizer from your own dataset and recognize custom entities. You can apply this process to a variety of industries, such as healthcare and financial services.
To learn more about different text and data extraction features of Amazon Textract, see How Amazon Textract Works.
About the Authors
Yuan Jiang is a Solution Architect with a focus on machine learning. He is a member of the Amazon Computer Vision Hero program.
Sonali Sahu is a Solution Architect and a member of Amazon Machine Learning Technical Field Community. She is also a member of the Amazon Computer Vision Hero program.
Kashif Imran is a Principal Solution Architect and the leader of Amazon Computer Vision Hero program.
Increasing engagement with personalized online sports content
This is a guest post by Mark Wood at Pulselive. In their own words, “Pulselive, based out of the UK, is the proud digital partner to some of the biggest names in sports.”
At Pulselive, we create experiences sports fans can’t live without; whether that’s the official Cricket World Cup website or the English Premier League’s iOS and Android apps.
One of the key things our customers measure us on is fan engagement with digital content such as videos. But until recently, the videos each fan saw were based on a most recently published list, which wasn’t personalized.
Sports organizations are trying to understand who their fans are and what they want. The wealth of digital behavioral data that can be collected for each fan tells a story of how unique they are and how they engage with our content. Based on the increase of available data and the increasing presence of machine learning (ML), Pulselive was asked by customers to provide tailored content recommendations.
In this post, we share our experience of adding Amazon Personalize to our platform as our new recommendation engine and how we increased video consumption by 20%.
Implementing Amazon Personalize
Before we could start, Pulselive had two main challenges: we didn’t have any data scientists on staff and we needed to find a solution that our engineers with minimal ML experience would understand and would still produce measurable results. We considered using external companies to assist (expensive), using tools such as Amazon SageMaker (still quite the learning curve), or Amazon Personalize.
We ultimately chose to use Amazon Personalize for several reasons:
- The barrier to entry was low, both technically and financially.
- We could quickly conduct an A/B test to demonstrate the value of a recommendation engine.
- We could create a simple proof of concept (PoC) with minimal disruption to the existing site.
- We were more concerned about the impact and improving the results than having a clear understanding of what was going on under the hood of Amazon Personalize.
Like any other business, we couldn’t afford to have an adverse impact on our daily operations, but still needed the confidence that the solution would work for our environment. Therefore, we started out with A/B testing in a PoC that we could spin up and execute in a matter of days.
Working with the Amazon Prototyping team, we narrowed down a range of options for our first integration to one that would require minimal changes to the website and be easily A/B tested. After examining all locations where a user is presented with a list of videos, we decided that re-ranking the list of videos to watch next would be the quickest to implement personalized content. For this prototype, we used an AWS Lambda function and Amazon API Gateway to provide a new API that would intercept the request for more videos and re-rank them using the Amazon Personalize GetPersonalizedRanking API.
To be considered successful, the experiment needed to demonstrate that statistically significant improvements had been made to either total video views or completion percentage. To make this possible, we needed to test across a sufficiently long enough period of time to make sure that we covered days with multiple sporting events and quieter days with no matches. We hoped to eliminate any behavior that would be dependent on the time of day or whether a match had recently been played by testing across different usage patterns. We set a time frame of 2 weeks to gather initial data. All users were part of the experiment and randomly assigned to either the control group or the test group. To keep the experiment as simple as possible, all videos were part of the experiment. The following diagram illustrates the architecture of our solution.
To get started, we needed to build an Amazon Personalize solution that provided us with the starting point for the experiment. Amazon Personalize requires a user-item interactions dataset to be able to define a solution and create a campaign to recommend videos to a user. We satisfied these requirements by creating a CSV file that contains a timestamp, user ID, and video ID for each video view across several weeks of usage. Uploading the interaction history to Amazon Personalize was a simple process, and we could immediately test the recommendations on the AWS Management Console. To train the model, we used a dataset of 30,000 recent interactions.
To compare metrics for total videos viewed and video completion percentage, we built a second API to record all video interactions in Amazon DynamoDB. This second API solved the problem of telling Amazon Personalize about new interactions via the PutEvents API, which helped keep the ML model up to date.
We tracked all video views and what prompted video views for all users in the experiment. Video prompts included direct linking (for example, from social media), linking from another part of the website, and linking from a list of videos. Each time a user viewed a video page, they were presented with the current list of videos or the new re-ranked list, depending on whether they were in the control or test group. We started our experiment with 5% of total users in the test group. When our approach showed no problems (no obvious drop in video consumption or increase in API errors), we increased this to 50%, with the remaining users acting as the control group, and started to collect data.
Learning from our experiment
After two weeks of A/B testing, we pulled the KPIs we collected from DynamoDB and compared the two variants we tested across several KPIs. We opted to use a few simple KPIs for this initial experiment, but other organizations’ KPIs may vary.
Our first KPI was the number of video views per user per session. Our initial hypothesis was that we wouldn’t see meaningful change given that we were re-ranking a list of videos; however, we measured an increase in views per user by 20%. The following graph summarizes our video views for each group.
In addition to measuring total view count, we wanted to make sure that users were watching videos in full. We tracked this by sending an event for each 25% of the video a user viewed. For each video, we found that the average completion percentage didn’t change very much based on whether the video was recommended by Amazon Personalize or by the original list view. In combination with the number of videos viewed, we concluded that overall viewing time had increased for each user when presented with a personalized list of recommended videos.
We also tracked the position of each video in users’ “recommended video” bar and which item they selected. This allowed us to compare the ranking of a personalized list vs. a publication ordered list. We found that this didn’t make much difference between the two variants, which suggested that our users would most likely select a video that was visible on their screen rather than scrolling to see the entire list.
After we analyzed the results of the experiment, we presented them to the customer with the recommendation that we enable Amazon Personalize as the default method of ranking videos in the future.
Lessons learned
We learned the following lessons on our journey, which may help you when implementing your own solution:
- Gather your historical data of user-item interactions; we used about 30,000 interactions.
- Focus on recent historical data. Although your immediate position is to get as much historical data as you can, recent interactions are more valuable than older interactions. If you have a very large dataset of historical interactions, you can filter out older interactions to reduce the size of the dataset and training time.
- Make sure you can give all users a consistent and unique ID, either by using your SSO solution or by generating session IDs.
- Find a spot in your site or app where you can run an A/B test either re-ranking an existing list or displaying a list of recommended items.
- Update your API to call Amazon Personalize and fetch the new list of items.
- Deploy the A/B test and gradually increase the percentage of users in the experiment.
- Instrument and measure so that you can understand the outcome of your experiment.
Conclusion and future steps
We were thrilled by our first foray into the world of ML with Amazon Personalize. We found the entire process of integrating a trained model into our workflow was incredibly simple; and we spent far more time making sure that we had the right KPIs and data capture to prove the usefulness of the experiment than we did implementing Amazon Personalize.
In the future, we will be developing the following enhancements:
- Integrating Amazon Personalize throughout our workflow much more frequently by providing our development teams the opportunity to use Amazon Personalize everywhere a list of content is provided.
- Expanding the use cases beyond re-ranking to include recommended items. This should allow us to surface older items that are likely to be more popular with each user.
- Experiment with how often the model should be retrained—inserting new interactions into the model in real time is a great way to keep things fresh, but the models still needs daily retraining to be most effective.
- Exploring options for how we can use Amazon Personalize with all of our customers to help improve fan engagement by recommending the most relevant content in all forms.
- Using recommendation filters to expand the range of parameters available for each request. We will soon be targeting additional options such as filtering to include videos of your favorite players.
About the Author
Mark Wood is the Product Solutions Director at Pulselive. Mark has been at Pulselive for over 6 years and has held both Technical Director as well as Software Engineer roles during his tenure with the company. Prior to Pulselive, Mark was a Senior Engineer at Roke and a Developer at Querix. Mark is a graduate from the University of Southampton with a degree in Mathematics with Computer Science.
Deploying custom models built with Gluon and Apache MXNet on Amazon SageMaker
When you build models with the Apache MXNet deep learning framework, you can take advantage of the expansive model zoo provided by GluonCV to quickly train state-of-the-art computer vision algorithms for image and video processing. A typical development environment for training consists of a Jupyter notebook hosted on a compute instance configured by the operating data scientist. To make sure this environment is replicated during use in production, the environment is wrapped inside a Docker container, which is launched and scaled according to the expected load. Hosting the deep learning model is a challenge that generally involves knowledge of server hosting, cluster management, web API protocols, and network security.
In this post, we demonstrate how Amazon SageMaker supports these libraries and how their integration simplifies the deployment of complex algorithms without having to build expertise in web app infrastructure. Whether inference constraints require real-time predictions with low latency, or irregularly-timed batch jobs with a large number of samples, optimal hosting solutions are available and easy to build.
With Amazon SageMaker, most of the undifferentiated heavy lifting is already done. There is no need to build out a container image from scratch or set up a REST API. Instead, you only need to specify various model functions to processes inference data in a manner consistent to the training pipeline. You can follow this post with an end-to-end example, in which we train an object detection model using open-source Apache tools.
Creating a notebook instance
You can run the example code we provide in this post. It’s recommended to run the code inside an Amazon SageMaker instance type of ml.p3.2xlarge
or larger to accelerate training time. To create a notebook instance, complete the following steps:
- On the Amazon SageMaker console, choose Notebook instances.
- Choose Create notebook instance.
- Enter the name of your notebook instance, such as
mxnet-gluon-deployment
. - Set the instance type to p3.2xlarge.
- Choose Additional configuration.
- Set the volume size to 20 GB.
- Choose Create notebook instance.
- When the instance is ready, choose Open in JupyterLab.
- From the launcher, you can open a terminal and run the provided code.
Generating the model
For this use case, you build an object detection model using a pretrained Faster R-CNN architecture from the GluonCV model zoo on the Pascal VOC dataset. The first step is to obtain the data, which you can do by running the data preparation script pascal_voc.py for use with GluonCV. The script downloads 8.4 GB of annotated images to ~/.mxnet/datasets/voc/
. With the dataset in place, run the training script train_faster_rcnn.py from this GluonCV example.
Model parameters are saved after each epoch, with the best performing model indicated by the suffix _best.params
.
Preparing the inference container image
To make sure that the compute environment for the inference instance is set according to our needs, run the model within a Docker container that specifies the required configuration. Containers provide a portable, efficient, standalone package of software for flexible deployment. In most cases, using the default MXNet inference container image in Amazon SageMaker is sufficient for hosting Apache MXNet models. However, we built a computer vision model using GluonCV, which isn’t included in the default image. You can now modify the MXNet inference container image to include GluonCV, which you use for deployment.
Our instance requires Docker for the following steps, which is included in Amazon SageMaker instances. First clone the Amazon SageMaker MXNet serving container GitHub repository:
git clone https://github.com/aws/sagemaker-mxnet-serving-container.git
cd sagemaker-mxnet-serving-container
Included in the repo is a Dockerfile that serves our configuration with MXNet 1.6.0, GluonCV 0.6.0, and Python 3.6.8. You can verify the software versions in ./docker/1.6.0/py3/Dockerfile.gpu
:
...
ARG MX_URL=https://aws-mxnet-pypi.s3-us-west-2.amazonaws.com/1.6.0/aws_mxnet_cu101mkl-1.6.0-py2.py3-none-manylinux1_x86_64.whl
...
RUN ${PIP} install --no-cache-dir
${MX_URL}
git+git://github.com/dmlc/gluon-nlp.git@v0.9.0
gluoncv==0.6.0
mxnet-model-server==$MMS_VERSION
keras-mxnet==2.2.4.1
numpy==1.17.4
onnx==1.4.1
"sagemaker-mxnet-inference<2"
...
There is no need to edit this file for this post, but you can add additional packages to the preceding code as needed.
Now you build the container image. Before executing the docker build command, copy the necessary artifacts to the ./docker/1.6.0/py3
directory. In the following example code, we use gluoncv-mxnet-serving:1.6.0-gpu-py3
as the name and the tag. Note the .
at the end of the last command:
cp -r docker/artifacts/* docker/1.6.0/py3
cd docker/1.6.0/py3
docker build -t gluoncv-mxnet-serving:1.6.0-gpu-py3 -f Dockerfile.gpu .
To test the container was built successfully, you can run the container locally. In the following code, replace <docker image id> and <container id> with the output from the commands docker images
and docker ps
:
# find docker image id
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
gluoncv-mxnet-serving 1.6.0-gpu-py3 0012f8ebdcab 24 hours ago 6.56GB
nvidia/cuda 10.1-cudnn7-runtime-ubuntu16.04 e11e11484e2e 3 months ago 1.71GB
# start the docker container
$docker run <docker image id>
In a separate terminal, access the shell of the running container:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
af357bce0c53 0012f8ebdcab "python /usr/local/b…" 7 hours ago Up 7 hours 8080-8081/tcp musing_napier
# access shell of the running docker
$ docker exec -it <container id> /bin/bash
To escape the terminals and tear down the resources, enter exit in the shell accessing the container and enter CTRL+C in the terminal running the container.
Now you’re ready to upload the new MXNet inference container image to Amazon Elastic Container Registry (Amazon ECR) so you can point to this container image when you deploy the model on Amazon SageMaker. For more information, see Pushing an image.
You first authenticate Docker to the Amazon ECR registry with get-login
. Assuming the AWS Command Line Interface (AWC CLI) version is prior to 1.17.0, enter the following code to get the authenticated docker login
command:
$ aws ecr get-login --region <AWS Region> --no-include-email
For instructions on using AWS CLI version 1.17.0 or higher, see Using an Authorization Token.
Copy the output of the command, then paste and execute it to authenticate your Docker installation into Amazon ECR. Replace with the appropriate Region. For example, to use the US East (N. Virginia) Region, replace with us-east-1
.
Create a repository in Amazon ECR using the AWS CLI by running aws ecr create-repository
. For this use case, use gluconcv
for <repository name>:
$ aws ecr create-repository --repository-name <repository name> --region <AWS Region>
Before pushing the local image to Amazon ECR, tag it with the name of the target repository. The image ID is retrieved with the docker images
command and named with the docker tag
command and the repository URI, which you can also retrieve on the Amazon ECR console. See the following code:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
gluoncv-mxnet-serving 1.6.0-gpu-py3 cb0a03065295 7 minutes ago 4.09GB
nvidia/cuda 10.1-cudnn7-runtime-ubuntu16.04 e11e11484e2e 3 months ago 1.71GB
$ docker tag <image id> <AWS account ID>.dkr.ecr.<AWS Region>.amazonaws.com/<repository name>
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
<AWS account id>.dkr.ecr.<AWS Region>.amazonaws.com/gluoncv latest cb0a03065295 9 minutes ago 4.09GB
gluoncv-mxnet-serving 1.6.0-gpu-py3 cb0a03065295 9 minutes ago 4.09GB
nvidia/cuda 10.1-cudnn7-runtime-ubuntu16.04 e11e11484e2e 3 months ago 1.71GB
To push the image to the Amazon ECR repository so that it’s available for hosting on Amazon SageMaker endpoints, use the docker push command. You can confirm that the image is successfully pushed using the aws ecr list-images
AWS CLI command:
$ docker push <AWS acconut ID>.dkr.ecr.<AWS Region>.amazonaws.com/<repository name>
$ aws ecr list-images --repository-name gluoncv
{
"imageIds": [
{
"imageDigest": "sha256:66bc1759a4d2e94daff4dd02446024a11c5af29d9259175f11701a0b9ee2d2d1",
"imageTag": "latest"
}
]
}
Alternatively, you can verify the image exists in the repository by checking on the Amazon ECR console.
When deploying the model, use the image URI as the argument to image. You can run the code to set up the image programmatically from a Jupyter notebook:
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name
ecr_repository = 'mxnet-gluoncv'
tag = ':latest'
image_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region, ecr_repository + tag)
# Create ECR repository and push docker image
!docker build -t $ecr_repository -f ./docker/Dockerfile.gpu ./docker -q
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $image_uri
!docker push $image_uri
Deploying the model
You can optimize compute resources according to inference requirements based on your use case. If you collect batches of data intermittently and don’t need predictions, you can run batch jobs over the data acquired by spinning up a compute instance when necessary, then process the mass of data, store the predictions, and tear down the instance.
Alternatively, you may require that calls for inference be answered immediately. In this case, spin up a compute instance for real-time inference at an endpoint that consumes data over an API call and returns the model output. You only pay for time when the compute instance is running. We provide details for both use cases in this section.
Prepare the model artifacts by compressing them into a tarball and uploading to Amazon S3, from which the deployed model is read. Because you’re using an architecture that already exists in the GluonCV model, you only need to upload the weights. The .params
file from the previous step should ultimately live in s3://<bucket_name>/<prefix>/model.tar.gz
. You execute deployment via the Amazon SageMaker SDK. See the following code:
import sagemaker
from sagemaker.mxnet import MXNetModel
model = MXNetModel(
entry_point='./source_directory/entrypoint.py',
model_data='s3://{}/{}/{}'.format(bucket_name, s3_prefix, tar_file_name),
framework_version='1.6.0',
py_version='py3',
source_dir='./source_directory/',
image='<AWS account id>.dkr.ecr.<AWS Region>.amazonaws.com/<repository name>:latest',
role=sagemaker.get_execution_role()
)
The image ARN argument is the URI of the image you uploaded to the Amazon ECR repository in the preceding section. Make sure that the Region of the Amazon ECR repository and Amazon SageMaker model are the same. Most of the processing, inference, and configuration resides in the following entry_point.py
script, which defines the model and the steps necessary to decode the payload so that the MXNet backend properly interprets the data:
entrypoint.py
## import packages ##
import base64
import json
import mxnet as mx
from mxnet import gpu
import numpy as np
import sys
import gluoncv as gcv
from gluoncv import data as gdata
## SageMaker loading function ##
def model_fn(model_dir):
"""
Load the pretrained model
Args:
model_dir (str): directory where model artifacts are saved/loaded
"""
model = gcv.model_zoo.get_model('faster_rcnn_resnet50_v1b_voc', pretrained_base=False)
ctx = mx.gpu(0)
model.load_parameters(f'{model_dir}/faster_rcnn_resnet50_v1b_voc_best.params', ctx, ignore_extra=True)
print('Loaded gluoncv model')
return model, ctx
## SageMaker inference function ##
def transform_fn(net, data, input_content_type, output_content_type):
## retrive model and contxt from the first parameter, net
model, ctx = net
## decode image ##
# for endpoint API calls
if type(data) == str:
parsed = json.loads(data)
img = mx.nd.array(parsed)
# for batch transform jobs
else:
img = mx.img.imdecode(data)
## preprocess ##
# normalization values taken from gluoncv
# https://gluon-cv.mxnet.io/_modules/gluoncv/data/transforms/presets/rcnn.html
mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)
img = gdata.transforms.image.imresize(img, 800, 600)
img = mx.nd.image.to_tensor(img)
img = mx.nd.image.normalize(img, mean=mean, std=std)
nda = img.expand_dims(0)
nda = nda.copyto(ctx)
## inference ##
cid, score, bbox = model(nda)
# predictions to lists
cid = cid.asnumpy().tolist()
score = score.asnumpy().tolist()
bbox = bbox.asnumpy().tolist()
# format predictions
response = []
for x,y,z in zip(cid[0], score[0], bbox[0]):
if x[0] == -1.0:
continue
response.append([x[0], y[0], z[0]/800, z[1]/600, z[2]/800, z[3]/600])
predictions = {'prediction':response}
predictionslist = [predictions]
return predictionslist
After you import the supporting libraries for model inference and data processing, define the model in model_fn()
by loading the Faster R-CNN architecture and the trained weights you uploaded to Amazon S3. The file name passed in the net.load_parameters()
must match the name of the parameters file that you trained and uploaded to Amazon S3 earlier in the tarball. For this use case, the parameters are stored in faster_rcnn_resnet50_v1b_voc_best.params
. To utilize the GPU, you must explicitly set the context as such when loading the parameters.
Instructions to run predictions over the model are written in transform_fn()
. You can call inference from a living endpoint API or launch it on schedule for batch jobs. The corresponding data type sent to the model varies between these two options. When sent for a real-time prediction over the endpoint API, the transform function receives a string that you can load and interpret according to its underlying data type. Batch transform jobs, on the other hand, send the data directly as a serialized image, which you need to decode with MXNet utilities. You can handle both cases by checking the type of the data object.
The loaded data is normalized according to the default preprocessing steps that GluonCV implements, as enforced in the normalize()
function in the entry point script. Lastly, the data is passed through the neural network for inference with the output formatted such that the return payload includes the predicted class ID, confidence of the bounding box, and bounding box attributes.
With all the setup in place, you’re now ready to deploy. See the following code:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.p3.2xlarge')
Testing
With the deployed endpoint up and running, you can make a real-time inference with the returned object from the preceding step. After loading an image into a NumPy array, fire it off for inference:
## inference via endpoint API
home_path = os.path.expanduser('~')
test_image = home_path + '/.mxnet/datasets/voc/VOC2012/JPEGImages/2010_001453.jpg'
# load as as numpy array
test_image_data = np.asarray(imageio.imread(test_image))
# Serializes data and makes a prediction request to the SageMaker endpoint
endpoint_response = predictor.predict(test_image_data)
To visualize the output, draw from the metadata included in the response. See the following code:
## visulize on a test image
img = mpimg.imread(test_image)
fig,ax = plt.subplots(1, dpi=120)
ax.imshow(img)
for box in endpoint_response[0]['prediction']:
class_id, confidence, xmin, ymin, xmax, ymax = box
xmin = xmin*img.shape[1]
xmax = xmax*img.shape[1]
ymin = ymin*img.shape[0]
ymax = ymax*img.shape[0]
if confidence > 0.9:
height = ymax-ymin
width = xmax-xmin
rect = patches.Rectangle(
(xmin,ymin), width, height, linewidth=1, edgecolor='yellow', facecolor='none')
ax.add_patch(rect)
ax.axis('off')
plt.show()
After 20 epochs of training, you can see bounding boxes that accurately identifying various objects in the model response. See the following screenshot.
The purpose of maintaining an endpoint API is to support a model to be available for real-time predictions. It’s unnecessary to pay for a running endpoint instance if inference jobs are scheduled in advance. For this use case, you send a list of images for prediction to a batch transform job, which spins up a compute instance to run the model and tears it down upon completion. You only pay for the runtime of the instance, which saves costs on downtime. Set up and launch a batch transform job by uploading images to Amazon S3 and defining the data and model paths, along with a few other settings, to a dictionary. See the following code:
## inference via batch transform
# upload a sample of images to SageMaker
test_images = ['/.mxnet/datasets/voc/VOC2012/JPEGImages/2010_003939.jpg',
'/.mxnet/datasets/voc/VOC2012/JPEGImages/2008_004205.jpg',
'/.mxnet/datasets/voc/VOC2012/JPEGImages/2009_001139.jpg',
'/.mxnet/datasets/voc/VOC2012/JPEGImages/2010_001453.jpg',
'/.mxnet/datasets/voc/VOC2012/JPEGImages/2011_000148.jpg',
'/.mxnet/datasets/voc/VOC2012/JPEGImages/2011_005806.jpg',
'/.mxnet/datasets/voc/VOC2012/JPEGImages/2012_004299.jpg']
s3_test_prefix = 'test_images'
for test_image in test_images:
test_image = home_path + test_image
s3_client.upload_file(test_image, bucket_name, s3_test_prefix+'/'+test_image.split('/')[-1])
model_name = predictor.endpoint
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
batch_job_name = "test-batch-job" + timestamp
request =
{
"TransformJobName": batch_job_name,
"ModelName": model_name,
"MaxConcurrentTransforms": 1,
"MaxPayloadInMB": 6,
"BatchStrategy": "SingleRecord",
"TransformOutput": {
"S3OutputPath": 's3://{}/test/{}/'.format(bucket_name, batch_job_name)
},
"TransformInput": {
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri":'s3://{}/test_images/'.format(bucket_name)
}
},
"ContentType": "application/x-image",
"SplitType": "None",
"CompressionType": "None"
},
"TransformResources": {
"InstanceType": "ml.p3.2xlarge",
"InstanceCount": 1
}
}
## launch batch transform job
sm_client = boto3.client('sagemaker')
sm_client.create_transform_job(**request)
print("Created Transform job with name: ", batch_job_name)
while(True):
batch_response = sm_client.describe_transform_job(TransformJobName=batch_job_name)
status = batch_response['TransformJobStatus']
if status == 'Completed':
print("Transform job ended with status: " + status)
break
if status == 'Failed':
message = batch_response['FailureReason']
print('Transform failed with the following error: {}'.format(message))
raise Exception('Transform job failed')
time.sleep(30)
You can verify the output of the batch transform job by comparing the output of the real-time inference, endpoint_response
, to the output from the batch transform job, which was saved to s3://<bucket_name>/test/<batch_job_name>/2010_001453.jpg.out
as specified in the S3OutputPath
parameter.
Cleaning up
To finish up this walkthrough, tear down the endpoint instance and remove the Amazon SageMaker model. For more information about additional helper methods, see Using Estimators. Delete the Amazon ECR repository and its images through the Amazon ECR client. See the following code:
# tear down the SageMaker endpoint and endpoint configuration
predictor.delete_endpoint()
# delete the SageMaker model
predictor.delete_model()
# delete ECR repository
ecr_client = boto3.client('ecr')
ecr_client.delete_repository(repository_name='gluoncv', force=True)
Conclusion
Although training models is a data scientist’s the primary objective, the deployment process is equally crucial. Amazon SageMaker offers efficient methods to put these algorithms into production. Built-in algorithms can accelerate the training process, but you may need custom modeling for your use case. When building a model with MXNet, you must specify the configuration and processing steps necessary to run it in production. For this post, we outlined the steps to load our model to Amazon SageMaker and run inference for real-time predictions and in batch jobs.
About the Authors
Hussain Karimi is a data scientist at the Maching Learning Solutions Lab where he works with customers across various verticals to initate and build automated, algorithmic models that generate business value.
Will Gleave is a Machine Learning Consultant with the NatSec team at AWS Professional Services. In his spare time, he enjoys reading, watching sports, and traveling.
Muhyun Kim is a data scientist at Amazon Machine Learning Solutions Lab. He solves customer’s various business problems by applying machine learning and deep learning, and also helps them gets skilled.
Deploying TensorFlow OpenPose on AWS Inferentia-based Inf1 instances for significant price performance improvements
In this post you will compile an open-source TensorFlow version of OpenPose using AWS Neuron and fine tune its inference performance for AWS Inferentia based instances. You will set up a benchmarking environment, measure the image processing pipeline throughput, and quantify the price-performance improvements as compared to a GPU based instance.
About OpenPose
Human pose estimation is a machine learning (ML) and computer vision (CV) technology supporting many applications, from pedestrian intent estimation to motion tracking for AR and gaming. At its core, pose estimation identifies coordinates on an image (joints and keypoints), that, when connected, form a representation of an individual skeleton. The representation of body orientation enables tasks such as teaching a robot to interact with humans or quantifying how good yoga asanas really are.
Amongst the many methods that can be used for human pose estimation, the deep learning (DL) bottoms-up approach taken by OpenPose—released by the Perceptual Computing Lab of Carnegie Mellon University in 2018—has gained a lot of users. OpenPose is a multi-person 2D pose estimation model that employs a technique called Part Affinity Fields (PAF) to associate body parts and form multiple individual skeletons on the image. In the bottoms-up approach, the model identifies the key points and pieces together the skeleton.
To achieve that, OpenPose uses a two-step process. First, it extracts image features using a VGG-19 model and passes those features through a pair of convolutional neural networks (CNN) running in parallel.
One of the CNNs in the pair computes confidence maps to detect body parts. The other computes the PAF and combines the parts to form the individual’s skeleton. You can repeat these parallel branches many times to refine the predictions of the confidence maps and PAF.
The following diagram shows features F from a VGG feeding the PAF and confidence map branches of the OpenPose model. (Source: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields)
The original OpenPose code relies on a Caffe model and pre-compiled C++ libraries. For ease of use and portability of our walkthrough, we work with a reimplementation of the neural networks of OpenPose using TensorFlow 1.15 from the tf-pose-estimation GitHub repo. This repo also provides ML pipeline scripts to pre- and post-process images and videos using OpenPose.
Prerequisites
For this walkthrough, you need an AWS account with access to the AWS Management Console and the ability to create Amazon Elastic Compute Cloud (Amazon EC2) instances with public-facing IP and Amazon Simple Storage Service (Amazon S3) buckets.
Working knowledge of AWS Deep Learning AMIs and Jupyter notebooks with Conda environments is beneficial, but not required.
About AWS Inferentia and Neuron SDK
AWS Inferentia chips are custom built by AWS to provide high-performance inference, with the lowest cost of inference in the cloud, and make it easy for you to integrate ML as part of your standard application features and capabilities.
AWS Neuron is a software development kit (SDK) consisting of a compiler, runtime, and profiling tools that optimize the ML inference performance for the Inferentia chips. Neuron is integrated with popular ML frameworks such as TensorFlow, PyTorch, and MXNet and comes pre-installed in AWS Deep Learning AMIs. Deploying deep learning models on AWS Inferentia is done in the same familiar environment used in other platforms, and you can enjoy the boost in performance and lowest cost.
The latest Neuron release, available on the AWS Neuron GitHub, adds support for more models like OpenPose, which we focus on in this post. It also upgrades Neuron PyTorch to the latest stable version (1.5.1), which allows for a wider range of models to compile and run on AWS Inferentia.
Compiling a TensorFlow OpenPose model with the Neuron SDK
You can start the compilation process by setting up an EC2 instance in AWS for compiling the model. We recommend a z1d.xlarge, due to its good single-core performance and memory size. Use the AWS Deep Learning AMI (Ubuntu 18.04) Version 29.0—ami-043f9aeaf108ebc37
—in the US East (N. Virginia) Region. This AMI comes pre-packaged with the Neuron SDK and the required Neuron runtime for AWS Inferentia.
For more information about running AWS Deep Learning AMIs on EC2 instances, see Launching and Configuring a DLAMI.
When you can connect to the instance through SSH, you activate the aws_neuron_tensorflow_p36
Conda environment and update the Neuron Compiler to the latest release. The compilation script depends on requirements listed in the file requirements-compile.txt
. For compilation scripts and requirements files, see the GitHub repo. Download and install them in the environment with the following code:
source activate aws_neuron_tensorflow_p36
pip install neuron-cc --upgrade --extra-index-url=https://pip.repos.neuron.amazonaws.com
git clone https://github.com/aws/aws-neuron-sdk.git /tmp/aws-neuron-sdk && cp /tmp/aws-neuron-sdk/src/examples/tensorflow/<name_of_the_new_folder>/* . && rm -rf /tmp/aws-neuron-sdk/
pip install -r requirements-compile.txt
You can then start working on the compilation process. You compile the tf-pose-estimation
network frozen graph, available on the GitHub repo. You can adapt the original download script to a single-line wget
command:
wget -c --tries=2 $( wget -q -O - http://www.mediafire.com/file/qlzzr20mpocnpa3/graph_opt.pb | grep -o 'http*://download[^"]*' | tail -n 1 ) -O graph_opt.pb
When the download is complete, run the convert_graph_opt.py
script to compile it for the AWS Inferentia chip. Because Neuron is an ahead-of-time (AOT) compiler, you need to define a specific image size prior to compilation. You can adjust the network input image resolution with the argument --net_resolution
(for example, net_resolution=656x368
).
The compiled model can accept arbitrary batch size inputs at inference runtime. This property enables benchmarking large-scale deployments of the model; however, the pipeline available for image and video process in the tf-pose-estimation
repo utilizes batch size 1.
To start the compilation process, enter the following code:
python convert_graph_opt.py graph_opt.pb graph_opt_neuron_656x368.pb
The compilation process can take up to 20 minutes to complete. During this time, the compiler optimizes the TensorFlow graph operations and provides the AWS Inferentia version of the saved model. During the process you can expect detailed logs such as the following:
2020-07-15 21:44:43.008627: I bazel-out/k8-opt/bin/tensorflow/neuron/convert/segment.cc:460] There are 11 ops of 7 different types in the graph that are not compiled by neuron-cc: Const, NoOp, Placeholder, RealDiv, Sub, Cast, Transpose, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-tensorflow.md).
INFO:tensorflow:fusing subgraph neuron_op_ed41d2deb8c54255 with neuron-cc
INFO:tensorflow:Number of operations in TensorFlow session: 474
INFO:tensorflow:Number of operations after tf.neuron optimizations: 474
INFO:tensorflow:Number of operations placed on Neuron runtime: 465
Before you can measure the performance of the compiled model, you need to switch to an EC2 Inf1 instance, powered by the AWS Inferentia chip. To share the compiled model between the two instances, create an S3 bucket with the following code:
aws s3 mb s3://<MY_BUCKET_NAME>
aws s3 cp graph_opt_neuron_656x368.pb s3://<MY_BUCKET_NAME>/graph_model.pb
Benchmarking the inference time with a Jupyter notebook on AWS EC2 Inf1 instances
After you have the compiled graph_model.pb
in your S3 bucket, you modify the ML pipeline scripts on the GitHub repo to estimate human poses from images and videos.
To set up the benchmarking Inf1 instance, you can repeat the steps you took to provision the compilation z1d instance. You use the same AMI but change the instance type to inf1.xlarge. A similar setup on a g4dn.xlarge instance might be useful to compare the performance of the base tf-pose-estimation
model on GPUs against the compiled model for AWS Inferentia.
Throughout this post, you interact with this instance and the model using a Jupyter Lab server. For more information about provisioning a Jupyter Lab on Amazon EC2, see Set Up a Jupyter Notebook Server.
Setting up the Conda Environment for tf-pose
When you can log in to the Jupyter Lab server, you can clone the GitHub repo containing the TensorFlow version of OpenPose.
On the Jupyter Launcher page, under Other, choose Terminal.
In the terminal, activate the aws_neuron_tensorflow_p36
environment, which contains the Neuron SDK. Activating the environment and cloning are done with the following code:
conda activate aws_neuron_tensorflow_p36
git clone https://github.com/ildoonet/tf-pose-estimation.git
cd tf-pose-estimation
When the cloning is complete, we recommend following the Package Install instructions to install the repo. From the same terminal screen, you customize the environment by installing opencv-python
and dependencies listed on the requirements.txt
of the GitHub repo.
You run two pip commands: the first takes care of opencv-python
and the second completes the installation of the requirements.txt
:
pip install opencv-python
pip install -r requirements.txt
You’re now ready to build the notebooks.
On the repo’s root directory, create a new Jupyter notebook by choosing Notebook, Environment (conda_aws_neuron_tensorflow_p36
). On the first cell of the notebook, import the library as defined in the run.py
script, which is the reference pipeline for image processing. In the following cell, create a logger to record the benchmarking. See the following code:
import argparse
import logging
import sys
import time
from tf_pose import common
import cv2
import numpy as np
from tf_pose.estimator import TfPoseEstimator
from tf_pose.networks import get_graph_path, model_wh
logger = logging.getLogger('TfPoseEstimatorRun')
logger.handlers.clear()
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] [%(levelname)s] %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)
Define the main inferencing function main()
and a helper plotter function plotter()
. These functions directly replicate the OpenPose inference pipeline from run.py
. One simple modification is the addition of a repeats
argument, which allows you to run many inference steps in sequence and improve the measure of the average model throughput (measured in seconds per image):
def main(argString='--image ./images/contortion1.jpg --model cmu', repeats=10):
parser = argparse.ArgumentParser(description='tf-pose-estimation run')
parser.add_argument('--image', type=str, default='./images/apink2.jpg')
parser.add_argument('--model', type=str, default='cmu',
help='cmu / mobilenet_thin / mobilenet_v2_large / mobilenet_v2_small')
parser.add_argument('--resize', type=str, default='0x0',
help='if provided, resize images before they are processed. '
'default=0x0, Recommends : 432x368 or 656x368 or 1312x736 ')
parser.add_argument('--resize-out-ratio', type=float, default=2.0,
help='if provided, resize heatmaps before they are post-processed. default=1.0')
args = parser.parse_args(argString.split())
w, h = model_wh(args.resize)
if w == 0 or h == 0:
e = TfPoseEstimator(get_graph_path(args.model), target_size=(432, 368))
else:
e = TfPoseEstimator(get_graph_path(args.model), target_size=(w, h))
# estimate human poses from a single image !
image = common.read_imgfile(args.image, None, None)
if image is None:
logger.error('Image can not be read, path=%s' % args.image)
sys.exit(-1)
t = time.time()
for _ in range(repeats):
humans = e.inference(image, resize_to_default=(w > 0 and h > 0), upsample_size=args.resize_out_ratio)
elapsed = time.time() - t
logger.info('%d times inference on image: %s at %.4f seconds/image.' % (repeats, args.image, elapsed/repeats))
image = TfPoseEstimator.draw_humans(image, humans, imgcopy=False)
return image, e
def plotter(image):
try:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12,12))
a = fig.add_subplot(1, 1, 1)
a.set_title('Result')
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
except Exception as e:
logger.warning('matplitlib error, %s' % e)
cv2.imshow('result', image)
cv2.waitKey()
Additionally, you can modify the same code structure for inferencing on videos or batches of images, based on the run_video.py
or run_directory.py
, if you’re feeling adventurous!
The main()
function takes as input the same string of arguments as described in the Test Inference section of the GitHub repo. To test the notebook implementation, you use a reference set of arguments (make sure to download the cmu
model using the original download script):
img, e = main('--model cmu --resize 656x368 --image=./images/ski.jpg --resize-out-ratio 2.0')
plotter(img)
The logs show your first multi-person pose analyzed:
‘[TfPoseEstimatorRun] [INFO] 10 times inference on image: ./images/ski.jpg at 1.5624 seconds/image.’
This results in lower than one frame per second (FPS) throughput, which is not a great performance. In this use case, you’re running a TensorFlow graph, --model cmu
, without a GPU. The performance of such a model isn’t optimal on CPU. If you repeat the setup and run the environment on a g4dn.xlarge instance, with one NVIDIA T4 GPU, the result is quite different:
‘[TfPoseEstimatorRun] [INFO] 10 times inference on image: ./images/ski.jpg at 0.1708 seconds/image’
The result is 5.85 FPS, which is much better.
Using the Neuron compiled CMU model
So far, you’ve used model artifacts that came with the repo. Instead of using the original download script to retrieve the CMU model, copy the Neuron compiled model into ./models/graph/cmu/graph_model.pb
and rerun the test:
aws s3 cp s3://<MY_BUCKET_NAME>/graph_opt.pb ./models/graph/cmu/graph_model.pb
Make sure to restart the Python kernel on the notebook if you previously ran a test of the non-Neuron compiled model. Restarting the kernel helps make sure all TensorFlow sessions are closed and get a fresh start for the benchmark. Running the same notebook again results in the following log entry:
‘[TfPoseEstimatorRun] [INFO] 10 times inference on image: ./images/ski.jpg at 0.1709 seconds/image.’
The results show the same frame rate as compared to the g4dn.xlarge instance, in an environment that costs approximately 30% less on demand. Despite the cost benefit from moving the workload to an AWS Inferentia-based instance, this throughput doesn’t convey the observed large performance gains of other reported results. For example, Amazon Alexa text to speech team has cut their inference cost by 50% when migrating to AWS Inferentia.
We decided to profile our version of the compiled graph and look for opportunities to fine-tune the end-to-end inference performance of the OpenPose pipeline. The integration of Neuron with TensorFlow gives access to native profiling libraries. To profile the Neuron compiled graph, we instrumented the TensorFlow session run command on the estimator method using the TensorFlow Python profiler:
from tensorflow.core.protobuf import config_pb2
from tensorflow.python.profiler import model_analyzer, option_builder
run_options = config_pb2.RunOptions(trace_level=config_pb2.RunOptions.FULL_TRACE)
run_metadata = config_pb2.RunMetadata()
peaks, heatMat_up, pafMat_up = self.persistent_sess.run(
[self.tensor_peaks, self.tensor_heatMat_up, self.tensor_pafMat_up], feed_dict={
self.tensor_image: [img], self.upsample_size: upsample_size
},
options=run_options, run_metadata=run_metadata
)
options = option_builder.ProfileOptionBuilder.time_and_memory()
model_analyzer.profile(self.persistent_sess.graph, run_metadata, op_log=None, cmd='scope', options=options)
The model_analyzer.profile
method prints on StdErr
the time and memory consumption of each operation on the TensorFlow graph. With the original code, the Neuron operation and a smoothing operation dominated the total graph runtime. The following output from the StdErr
log shows that the total graph runtime took 108.02 milliseconds, of which the smoothing operation took 43.07 milliseconds:
node name | requested bytes | total execution time | accelerator execution time | cpu execution time
_TFProfRoot (--/16.86MB, --/108.02ms, --/0us, --/108.02ms)
…
TfPoseEstimator/conv5_2_CPM_L1/weights/neuron_op_ed41d2deb8c54255 (430.01KB/430.01KB, 58.42ms/58.42ms, 0us/0us, 58.42ms/58.42ms)
…
smoothing (0B/2.89MB, 0us/43.07ms, 0us/0us, 0us/43.07ms)
smoothing/depthwise (2.85MB/2.85MB, 43.05ms/43.05ms, 0us/0us, 43.05ms/43.05ms)
smoothing/gauss_weight (47.50KB/47.50KB, 18us/18us, 0us/0us, 18us/18us)
…
The smoothing method provides a gaussian blur of the confidence maps calculated by OpenPose. By optimizing this operation, we can extract even more performance out of our end-to-end pose estimation. We modified the filter argument of the smoother on the estimator.py script from 25 to 5. This new configuration took down the total runtime to 67.44 milliseconds, of which the smoother now only takes 2.37ms—a 37% reduction! On a g4dn, this same optimization had little effect on the runtime. You can also optimize your version of the end-to-end pipeline by changing the same parameters and reinstalling the tf-pose-estimation
repo from your local copy.
We ran the same benchmark across seven different instances types and sizes to evaluate the performance and cost of inference of our optimized end-to-end image processing pipeline. For comparison, we also show the On-Demand instance pricing from Amazon EC2 Pricing.
The throughput on the smallest size Inf1 instance—xlarge—is 2 times higher than that of the largest g4dn instance evaluated —8xlarge—at 12 times less the cost per 1000 images. Comparing the two best options, inf1.xlarge and g4dn.xlarge, inf1 has 72% lower cost per 1000 images, or a 3.57 times better price to performance compared to the lowest cost GPU option. The following table summarizes these findings.
inf1.xlarge | inf1.2xlarge | inf1.6xlarge | g4dn.xlarge | g4dn.2xlarge | g4dn.4xlarge | g4dn.8xlarge | |
Image process time [seconds/image] | 0.0703 | 0.0677 | 0.0656 | 0.1708 | 0.1526 | 0.1477 | 0.1427 |
Throughput [FPS] |
14.22 | 14.77 | 15.24 | 5.85 | 6.55 | 6.77 | 7.01 |
1000 Images processing time [seconds] | 70.3 | 67.7 | 65.6 | 170.8 | 152.6 | 147.7 | 142.7 |
On demand cost [$/hr] |
$ 0.368 | $ 0.584 | $ 1.904 | $ 0.526 | $ 0.752 | $ 1.204 | $ 2.176 |
Cost per 1000 images [$] |
$ 0.007 | $ 0.011 | $ 0.035 | $ 0.025 | $ 0.032 | $ 0.049 | $ 0.086 |
The chart below summarizes the throughput and cost per 1000 images results for the xlarge and 2xlarge instance sizes.
We further reduced the image-processing cost and increased throughput of the tf-pose-estimation
on an Inf1 instance by taking a data parallel approach to the end-to-end pipeline. The values shown in the preceding table relate to the use of a single AWS Inferentia processing core—a Neuron core. The benchmarked instance has four, so it’s wasteful to use only one. Our test with embarrassingly parallel implementation of the main()function call using the Python joblib library showed linear scaling up to four threads. This pattern increased the throughput to 56.88 FPS and decreased the cost per 1000 images to below $0.002. This is a good indication that better batching strategy can further improve the price-performance ratio of OpenPose on AWS Inferentia.
The larger CMU model also provides good pose estimation performance. For example, see the following image of the multi-pose detection using the Neuron SDK compiled model, on a scene with subjects at multiple depths.
Safely shutting down and cleaning up
On the Amazon EC2 console, choose the compilation and inference instances, and choose Terminate from the Actions drop-down menu. You persisted the compiled model in your s3://<MY_BUCKET_NAME> so it can be reused later. If you’ve made changes to the code inside the instances, remember to persist those as well. The instance termination discards data stored only in the instance’s home volume.
Conclusion
In this post, you walked through the steps of compiling an open-source OpenPose TensorFlow model, updating a custom end-to-end image processing pipeline, and identifying tools to profile and further optimize your ML inference time on an EC2 Inf1 instance. When tuned, the Neuron compiled TensorFlow model was 72% less expensive than the cheapest GPU instance, with consistently better performance. The steps described in this post also apply to other ML model types and frameworks. For more information, see the AWS Neuron SDK GitHub repo.
Learn more about the AWS Inferentia chip and the Amazon EC2 Inf1 instances to get started with running your own custom ML pipelines on AWS Inferentia using the Neuron SDK.
About the Authors
Fabio Nonato de Paula is a Principal Solutions Architect for Autonomous Computing in AWS. He works with large-scale deployments of machine learning and AI for autonomous and intelligent systems. Fabio is passionate about democratizing access to accelerated computing and distributed ML. Outside of work, you can find Fabio riding his motorcycle on the hills of Livermore valley or reading ComiXology.
Haichen Li is a software development engineer in the AWS Neuron SDK team. He works on integrating machine learning frameworks with the AWS Neuron compiler and runtime systems, as well as developing deep learning models that benefit particularly from the Inferentia hardware.
Amazon scientists applying deep neural networks to custom skills
Amazon scientists are seeing increases in accuracy from an approach that uses a new scalable embedding scheme.Read More
Science innovations power Alexa Conversations dialogue management
Dialogue simulator and conversations-first modeling architecture provide ability for customers to interact with Alexa in a natural and conversational manner.Read More
Translating presentation files with Amazon Translate
As solutions architects working in Brazil, we often translate technical content from English to other languages. Doing so manually takes a lot of time, especially when dealing with presentations—in contrast to plain text documents, their content is spread across various areas in multiple slides. To solve that, we wrote a script that translates Microsoft PowerPoint files using Amazon Translate. This post discusses how the translation script works.
Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. When working with Amazon Translate, you provide text in the source language and receive text translated into the target language. For more information about the languages Amazon Translate supports, see Supported Languages and Language Codes.
The translation script is written in Python and relies on an open-source library to parse the presentation files.
Solution
The script requires three arguments:
- Source language
- Target language
- File path of a .pptx file
The script then performs the following functions:
- Parses the file
- Extracts its texts
- Invokes the Amazon Translate API for each text
- Saves a new file with the translated texts that the API returns
The following command translates a presentation from English to Portuguese:
$ python pptx-translator.py en pt example.pptx
Translating example.pptx from en to pt...
Slide 1 of 7
Slide 2 of 7
Slide 3 of 7
Slide 4 of 7
Slide 5 of 7
Slide 6 of 7
Slide 7 of 7
Saving example-pt.pptx...
To interact with Amazon Translate, the script uses Boto, the AWS SDK for Python. Boto is configurable in multiple ways. Regardless, you must have AWS credentials and a Region set to make requests to AWS. Here is more information about Configuration Credentials.
To handle presentation files, the script uses python-pptx
, an open-source library available on GitHub. When you provide a presentation file path as input, the library returns a Presentation
object. See the following code:
presentation = Presentation(args.input_file_path)
Within a Presentation
object, there are slides and, within the slides, shapes and their paragraphs. You can iterate over all paragraphs and invoke the Amazon Translate API for each text. Amazon Translate offers two different translation processing modes: real-time translation and asynchronous batch processing. The script uses the former, which allows you to call Amazon Translate on a piece of text and synchronously get a response with the corresponding translation. See the following code:
for paragraph in shape.text_frame.paragraphs:
for index, paragraph_run in enumerate(paragraph.runs):
response = translate.translate_text(
Text=paragraph_run.text,
SourceLanguageCode=source_language_code,
TargetLanguageCode=target_language_code,
TerminologyNames=terminology_names)
You then get the translated text from the API to replace the original text. See the following code:
paragraph.runs[index].text = response.get('TranslatedText')
The script replaces not only the visible presentation text, but also its comments. Moreover, the script has a map to update the language identifiers. That’s necessary to indicate the correct language so Microsoft PowerPoint can properly check the spelling. See the following code:
paragraph.runs[index].font.language_id = LANGUAGE_CODE_TO_LANGUAGE_ID[target_language_code]
In addition to passing a text, the source language, and the target language, you can use the Custom Terminology feature of Amazon Translate which makes sure that it translates terms exactly the way you want. To do this, you need to pass a list of pre-translated custom terminology when invoking the API. For instance, you can customize the translation of technical terms, put those terms and their respective translations in a CSV file, and pass its path as an optional argument to the script. The script reads the file and imports its content into Amazon Translate. See the following code:
with open(terminology_file_path, 'rb') as f:
translate.import_terminology(
Name=TERMINOLOGY_NAME,
MergeStrategy='OVERWRITE',
TerminologyData={'File': bytearray(f.read()), 'Format': 'CSV'})
After translating all the slides, you save the presentation as a new file with the following code:
presentation.save(output_file_path)
The script is straightforward, but very useful. For the full code, see the GitHub repo.
Conclusion
This post described a script-based solution to translate presentation files using Amazon Translate into a variety of languages. For more information, see What Is Amazon Translate?
About the Authors
Lidio Ramalho is Senior Manager on the AWS R&D and Innovation Solutions Architecture team. He works with customers to build innovative prototypes in AI/ML, Robotics, AR VR, IoT, Satellite, and Blockchain disciplines.
Rafael Werneck is a Solutions Architect at AWS R&D, based in Brazil. Previously, he worked as a Software Development Engineer on Amazon.com.br and Amazon RDS Performance Insights.
Atlassian continuously profiles services in production with Amazon CodeGuru Profiler
This is a guest post by the Jira Cloud Performance Team at Atlassian. In their own words, Atlassian’s mission is to unleash the potential in every team. Our products help teams organize, discuss, and complete their work. And what teams do can change the world. We have helped NASA teams design the Mars Rover, Cochlear teams develop hearing implants and hundreds of thousands of other teams do amazing things. We have an incredible opportunity to help millions more teams in organizations across nearly every industry. Teamwork is hard. We make it easier.
The products we build at Atlassian have hundreds of developers working on them, composed of a mixture of monolithic applications and microservices. When an incident occurs, it can be hard to diagnose the root cause due to the high rate of change within the codebases. Profiling can speed up root cause diagnosis significantly and is an effective technique to identify runtime hotspots and latency contributors within an application. Without profiling, you commonly need to implement custom and ad hoc latency instrumentation of the code, which can be error-prone or cause other side effects.
At Atlassian, we’ve always had tooling to profile services in production, such as using Linux perf or async-profiler, and while these are highly valuable, our methods had some limitations:
- Intervention from a person (or system) was required to capture a profile at the right time, which meant transient problems were often missed
- Ad hoc profiling didn’t provide a baseline profile to compare with
- For security and reliability, a limited number of people had access to run these tools in production
These limitations led us to look into continuous profiling.
In addition to helping diagnose where a service is spending CPU cycles (or time), we wanted a profiling solution that provided visualizations like flame graphs, which are a great diagnostic aid when trying to understand call paths in a complex and dynamic application, and can also be used to aid a developers’ understanding of the system.
Our existing in-house profiling solution comprised scripts deployed alongside our services that can generate profiles using Linux perf or async-profiler. A subset of privileged developers (and SREs) could run these scripts on production nodes using AWS Systems Manager. Our use of Linux perf and async-profiler came with several advantages, including:
- Data in a format that we could visualize as a flame graph (which is easy to interpret)
- The ability to profile either a single process or a whole node
- Profiling across different dimensions such as CPU, memory, and I/O
Our initial continuous profiling solution comprised a scheduled job that ran async-profiler (or Linux perf) regularly, uploading the raw results to a microservice that transformed the raw data into a columnar data format (Parquet), and writing the result to Amazon Simple Storage Service (Amazon S3).
We defined a schema in AWS Glue allowing developers to query the profile data for a particular service using Amazon Athena. Athena empowered developers to write complex SQL queries to filter profile data on dimensions like time range and stack frames.
We also started building a UI to run the Athena queries and render the results as flame graphs using SpeedScope.
Even with the effort we already employed for this solution, we still had significant work ahead of us to build out an optimal solution.
Meanwhile, the announcement of Amazon CodeGuru Profiler caught our attention—the service offering was highly relevant to us and largely overlapped with our existing capability. After a successful spike and evaluation, we decided to stop building out our solution and integrate CodeGuru Profiler instead.
We chose to define a single profiling group for each of our smaller services. For our larger services, which are partitioned into shards (a separate Auto Scaling group per shard), we choose to create one profiling group per shard.
You can integrate the Java profiler via two available modes: agent and code mode. To ensure a safe rollout, we decided to use the code mode, launching the agent from within our application code. This allowed us to control when to start (or stop) the agent via our existing feature flag mechanism.
We have now integrated CodeGuru Profiler at a platform level, enabling any Atlassian service team to easily take advantage of this capability.
Inspect and latency
One of the first ways we utilized CodeGuru Profiler was to identify code paths that show obvious or well-known problems in terms of CPU utilization or latency. We searched for different forms of synchronization in the profiled data. One interesting case was an EnumMap
that was wrapped in a Collections.synchronizedMap
. The following screenshot shows the thread states of the stack frames in this code path for a span of 24 hours.
Although the involved stack trace consumes less than 0.5% of all runtime, when we visualized the latency of thread states, we saw that it spent twice the amount of time in a BLOCKED
state than a RUNNABLE
state. To increase the ratio of time spent in a RUNNABLE
state, we moved away from using EnumMap
to using an instance of ConcurrentHashMap
.
The following screenshot shows a profile of a similar 24-hour period. After we implemented the change, the relevant stack trace is now all in a RUNNABLE state.
Recommendation Reports
CodeGuru Profiler also provides a recommendation report on every profiling group, which identifies commonly known anti-patterns from a performance perspective and suggests known solutions. One such report we received (see the following screenshot) highlighted an issue with how we used Jackson ObjectMapper
.
Upon receipt of this report, we quickly identified and resolved the problem code.
Conclusion
Integration with CodeGuru Profiler has been a major step forward for us, enabling every developer within Atlassian to own and take action on performance engineering.
Since enabling CodeGuru Profiler, we’ve already gained the following benefits:
- Any Atlassian developer can look up a profile from any point in time to understand the call paths that took place in production. This helps developers understand complex applications and aids us when investigating performance issues.
- The time to diagnose the root cause of performance issues in production has significantly reduced, and our developers no longer need to inject custom instrumentation code when diagnosing problems.
- Open availability of profile data across the organization has helped increase developer ownership of performance optimization.
We’re excited by what the CodeGuru Profiler team has built, and are looking forward to the profiling technologies and capabilities that they’ll build next.
About the Authors
Behrooz Nobakht Senior Software Engineer |
Matthew Ponsford Engineering Manager |
Narayanaswamy Anandapadmanabhan Senior Software Engineer |
We are Jira Cloud Performance from Atlassian. We make tools like Jira and Trello that are used by thousands of teams worldwide. We’re serious about creating amazing products, practices, and open work for all teams. Jira Cloud Performance is a specialized working group focused on enabling Jira and Atlassian teams to better observe, monitor, and enhance the performance of their products and services.