Deploying custom models built with Gluon and Apache MXNet on Amazon SageMaker

Deploying custom models built with Gluon and Apache MXNet on Amazon SageMaker

When you build models with the Apache MXNet deep learning framework, you can take advantage of the expansive model zoo provided by GluonCV to quickly train state-of-the-art computer vision algorithms for image and video processing. A typical development environment for training consists of a Jupyter notebook hosted on a compute instance configured by the operating data scientist. To make sure this environment is replicated during use in production, the environment is wrapped inside a Docker container, which is launched and scaled according to the expected load. Hosting the deep learning model is a challenge that generally involves knowledge of server hosting, cluster management, web API protocols, and network security.

In this post, we demonstrate how Amazon SageMaker supports these libraries and how their integration simplifies the deployment of complex algorithms without having to build expertise in web app infrastructure. Whether inference constraints require real-time predictions with low latency, or irregularly-timed batch jobs with a large number of samples, optimal hosting solutions are available and easy to build.

With Amazon SageMaker, most of the undifferentiated heavy lifting is already done. There is no need to build out a container image from scratch or set up a REST API. Instead, you only need to specify various model functions to processes inference data in a manner consistent to the training pipeline. You can follow this post with an end-to-end example, in which we train an object detection model using open-source Apache tools.

Creating a notebook instance

You can run the example code we provide in this post. It’s recommended to run the code inside an Amazon SageMaker instance type of ml.p3.2xlarge or larger to accelerate training time. To create a notebook instance, complete the following steps:

  1. On the Amazon SageMaker console, choose Notebook instances.
  2. Choose Create notebook instance.
  3. Enter the name of your notebook instance, such as mxnet-gluon-deployment.
  4. Set the instance type to p3.2xlarge.
  5. Choose Additional configuration.
  6. Set the volume size to 20 GB.
  7. Choose Create notebook instance.
  8. When the instance is ready, choose Open in JupyterLab.
  9. From the launcher, you can open a terminal and run the provided code.

Generating the model

For this use case, you build an object detection model using a pretrained Faster R-CNN architecture from the GluonCV model zoo on the Pascal VOC dataset. The first step is to obtain the data, which you can do by running the data preparation script pascal_voc.py for use with GluonCV. The script downloads 8.4 GB of annotated images to ~/.mxnet/datasets/voc/. With the dataset in place, run the training script train_faster_rcnn.py from this GluonCV example.

Model parameters are saved after each epoch, with the best performing model indicated by the suffix _best.params.

Preparing the inference container image

To make sure that the compute environment for the inference instance is set according to our needs, run the model within a Docker container that specifies the required configuration. Containers provide a portable, efficient, standalone package of software for flexible deployment. In most cases, using the default MXNet inference container image in Amazon SageMaker is sufficient for hosting Apache MXNet models. However, we built a computer vision model using GluonCV, which isn’t included in the default image. You can now modify the MXNet inference container image to include GluonCV, which you use for deployment.

Our instance requires Docker for the following steps, which is included in Amazon SageMaker instances. First clone the Amazon SageMaker MXNet serving container GitHub repository:

git clone https://github.com/aws/sagemaker-mxnet-serving-container.git
cd sagemaker-mxnet-serving-container

Included in the repo is a Dockerfile that serves our configuration with MXNet 1.6.0, GluonCV 0.6.0, and Python 3.6.8. You can verify the software versions in ./docker/1.6.0/py3/Dockerfile.gpu:

...
ARG MX_URL=https://aws-mxnet-pypi.s3-us-west-2.amazonaws.com/1.6.0/aws_mxnet_cu101mkl-1.6.0-py2.py3-none-manylinux1_x86_64.whl
...
RUN ${PIP} install --no-cache-dir 
    ${MX_URL} 
    git+git://github.com/dmlc/gluon-nlp.git@v0.9.0 
    gluoncv==0.6.0 
    mxnet-model-server==$MMS_VERSION 
    keras-mxnet==2.2.4.1 
    numpy==1.17.4 
    onnx==1.4.1 
    "sagemaker-mxnet-inference<2"
...

There is no need to edit this file for this post, but you can add additional packages to the preceding code as needed.

Now you build the container image. Before executing the docker build command, copy the necessary artifacts to the ./docker/1.6.0/py3 directory. In the following example code, we use gluoncv-mxnet-serving:1.6.0-gpu-py3 as the name and the tag. Note the . at the end of the last command:

cp -r docker/artifacts/* docker/1.6.0/py3
cd docker/1.6.0/py3
docker build -t gluoncv-mxnet-serving:1.6.0-gpu-py3 -f Dockerfile.gpu .

To test the container was built successfully, you can run the container locally. In the following code, replace <docker image id> and <container id> with the output from the commands docker images and docker ps:

# find docker image id
$ docker images
REPOSITORY                                            TAG                               IMAGE ID            CREATED             SIZE
gluoncv-mxnet-serving                                 1.6.0-gpu-py3                     0012f8ebdcab        24 hours ago        6.56GB
nvidia/cuda                                           10.1-cudnn7-runtime-ubuntu16.04   e11e11484e2e        3 months ago        1.71GB

# start the docker container
$docker run <docker image id> 

In a separate terminal, access the shell of the running container:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
af357bce0c53        0012f8ebdcab        "python /usr/local/b…"   7 hours ago         Up 7 hours          8080-8081/tcp       musing_napier

# access shell of the running docker
$ docker exec -it <container id> /bin/bash

To escape the terminals and tear down the resources, enter exit in the shell accessing the container and enter CTRL+C in the terminal running the container.

Now you’re ready to upload the new MXNet inference container image to Amazon Elastic Container Registry (Amazon ECR) so you can point to this container image when you deploy the model on Amazon SageMaker. For more information, see Pushing an image.

You first authenticate Docker to the Amazon ECR registry with get-login. Assuming the AWS Command Line Interface (AWC CLI) version is prior to 1.17.0, enter the following code to get the authenticated docker login command:

$ aws ecr get-login --region <AWS Region> --no-include-email

For instructions on using AWS CLI version 1.17.0 or higher, see Using an Authorization Token.

Copy the output of the command, then paste and execute it to authenticate your Docker installation into Amazon ECR. Replace with the appropriate Region. For example, to use the US East (N. Virginia) Region, replace with us-east-1.

Create a repository in Amazon ECR using the AWS CLI by running aws ecr create-repository. For this use case, use gluconcv for <repository name>:

$ aws ecr create-repository --repository-name <repository name> --region <AWS Region>

Before pushing the local image to Amazon ECR, tag it with the name of the target repository. The image ID is retrieved with the docker images command and named with the docker tag command and the repository URI, which you can also retrieve on the Amazon ECR console. See the following code:

$ docker images
REPOSITORY                                            TAG                               IMAGE ID            CREATED             SIZE
gluoncv-mxnet-serving                                 1.6.0-gpu-py3                     cb0a03065295        7 minutes ago       4.09GB
nvidia/cuda                                           10.1-cudnn7-runtime-ubuntu16.04   e11e11484e2e        3 months ago        1.71GB

$ docker tag <image id> <AWS account ID>.dkr.ecr.<AWS Region>.amazonaws.com/<repository name>

$ docker images
REPOSITORY                                             TAG                               IMAGE ID            CREATED             SIZE
<AWS account id>.dkr.ecr.<AWS Region>.amazonaws.com/gluoncv   latest                            cb0a03065295        9 minutes ago       4.09GB
gluoncv-mxnet-serving                                  1.6.0-gpu-py3                     cb0a03065295        9 minutes ago       4.09GB
nvidia/cuda                                            10.1-cudnn7-runtime-ubuntu16.04   e11e11484e2e        3 months ago        1.71GB

To push the image to the Amazon ECR repository so that it’s available for hosting on Amazon SageMaker endpoints, use the docker push command. You can confirm that the image is successfully pushed using the aws ecr list-images AWS CLI command:

$ docker push <AWS acconut ID>.dkr.ecr.<AWS Region>.amazonaws.com/<repository name>

$ aws ecr list-images --repository-name gluoncv
{
    "imageIds": [
        {
            "imageDigest": "sha256:66bc1759a4d2e94daff4dd02446024a11c5af29d9259175f11701a0b9ee2d2d1",
            "imageTag": "latest"
        }
    ]
}

Alternatively, you can verify the image exists in the repository by checking on the Amazon ECR console.

When deploying the model, use the image URI as the argument to image. You can run the code to set up the image programmatically from a Jupyter notebook:

account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name
ecr_repository = 'mxnet-gluoncv'
tag = ':latest'
image_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region, ecr_repository + tag)

# Create ECR repository and push docker image
!docker build -t $ecr_repository -f ./docker/Dockerfile.gpu ./docker -q
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $image_uri
!docker push $image_uri

Deploying the model

You can optimize compute resources according to inference requirements based on your use case. If you collect batches of data intermittently and don’t need predictions, you can run batch jobs over the data acquired by spinning up a compute instance when necessary, then process the mass of data, store the predictions, and tear down the instance.

Alternatively, you may require that calls for inference be answered immediately. In this case, spin up a compute instance for real-time inference at an endpoint that consumes data over an API call and returns the model output. You only pay for time when the compute instance is running. We provide details for both use cases in this section.

Prepare the model artifacts by compressing them into a tarball and uploading to Amazon S3, from which the deployed model is read. Because you’re using an architecture that already exists in the GluonCV model, you only need to upload the weights. The .params file from the previous step should ultimately live in s3://<bucket_name>/<prefix>/model.tar.gz. You execute deployment via the Amazon SageMaker SDK. See the following code:

import sagemaker
from sagemaker.mxnet import MXNetModel
model = MXNetModel(
    entry_point='./source_directory/entrypoint.py',
    model_data='s3://{}/{}/{}'.format(bucket_name, s3_prefix, tar_file_name),
    framework_version='1.6.0',
    py_version='py3',
    source_dir='./source_directory/',
    image='<AWS account id>.dkr.ecr.<AWS Region>.amazonaws.com/<repository name>:latest',
    role=sagemaker.get_execution_role()
)

The image ARN argument is the URI of the image you uploaded to the Amazon ECR repository in the preceding section. Make sure that the Region of the Amazon ECR repository and Amazon SageMaker model are the same. Most of the processing, inference, and configuration resides in the following entry_point.py script, which defines the model and the steps necessary to decode the payload so that the MXNet backend properly interprets the data:

entrypoint.py

## import packages ##
import base64
import json
import mxnet as mx
from mxnet import gpu
import numpy as np
import sys
import gluoncv as gcv
from gluoncv import data as gdata


## SageMaker loading function ##
def model_fn(model_dir):
    """
    Load the pretrained model 
    
    Args:
        model_dir (str): directory where model artifacts are saved/loaded
    """
    model = gcv.model_zoo.get_model('faster_rcnn_resnet50_v1b_voc',  pretrained_base=False)
    ctx = mx.gpu(0)
    model.load_parameters(f'{model_dir}/faster_rcnn_resnet50_v1b_voc_best.params', ctx, ignore_extra=True)
    print('Loaded gluoncv model')
    return model, ctx


## SageMaker inference function ##
def transform_fn(net, data, input_content_type, output_content_type):

    ## retrive model and contxt from the first parameter, net
    model, ctx = net

    ## decode image ##
    # for endpoint API calls
    if type(data) == str:
        parsed = json.loads(data)
        img = mx.nd.array(parsed)
    # for batch transform jobs
    else:
        img = mx.img.imdecode(data)
        
        
    ## preprocess ##
    
    # normalization values taken from gluoncv
    # https://gluon-cv.mxnet.io/_modules/gluoncv/data/transforms/presets/rcnn.html
    mean = (0.485, 0.456, 0.406)
    std = (0.229, 0.224, 0.225)
    img = gdata.transforms.image.imresize(img, 800, 600)
    img = mx.nd.image.to_tensor(img)
    img = mx.nd.image.normalize(img, mean=mean, std=std)
    nda = img.expand_dims(0)  
    nda = nda.copyto(ctx)
    
    
    ## inference ##
    cid, score, bbox = model(nda)
    
    # predictions to lists
    cid = cid.asnumpy().tolist()
    score = score.asnumpy().tolist()
    bbox = bbox.asnumpy().tolist()
    
    # format predictions 
    response = []
    for x,y,z in zip(cid[0], score[0], bbox[0]):
        if x[0] == -1.0:
            continue
        response.append([x[0], y[0], z[0]/800, z[1]/600, z[2]/800, z[3]/600])
        
    predictions = {'prediction':response}
    predictionslist = [predictions]
    
    return predictionslist

After you import the supporting libraries for model inference and data processing, define the model in model_fn() by loading the Faster R-CNN architecture and the trained weights you uploaded to Amazon S3. The file name passed in the net.load_parameters() must match the name of the parameters file that you trained and uploaded to Amazon S3 earlier in the tarball. For this use case, the parameters are stored in faster_rcnn_resnet50_v1b_voc_best.params. To utilize the GPU, you must explicitly set the context as such when loading the parameters.

Instructions to run predictions over the model are written in transform_fn(). You can call inference from a living endpoint API or launch it on schedule for batch jobs. The corresponding data type sent to the model varies between these two options. When sent for a real-time prediction over the endpoint API, the transform function receives a string that you can load and interpret according to its underlying data type. Batch transform jobs, on the other hand, send the data directly as a serialized image, which you need to decode with MXNet utilities. You can handle both cases by checking the type of the data object.

The loaded data is normalized according to the default preprocessing steps that GluonCV implements, as enforced in the normalize() function in the entry point script. Lastly, the data is passed through the neural network for inference with the output formatted such that the return payload includes the predicted class ID, confidence of the bounding box, and bounding box attributes.

With all the setup in place, you’re now ready to deploy. See the following code:

predictor = model.deploy(initial_instance_count=1, instance_type='ml.p3.2xlarge')

Testing

With the deployed endpoint up and running, you can make a real-time inference with the returned object from the preceding step. After loading an image into a NumPy array, fire it off for inference:

## inference via endpoint API
home_path = os.path.expanduser('~')
test_image = home_path + '/.mxnet/datasets/voc/VOC2012/JPEGImages/2010_001453.jpg'

# load as as numpy array
test_image_data = np.asarray(imageio.imread(test_image))

# Serializes data and makes a prediction request to the SageMaker endpoint
endpoint_response = predictor.predict(test_image_data)

To visualize the output, draw from the metadata included in the response. See the following code:

## visulize on a test image
img = mpimg.imread(test_image)
fig,ax = plt.subplots(1, dpi=120)
ax.imshow(img)
for box in endpoint_response[0]['prediction']:
    class_id, confidence, xmin, ymin, xmax, ymax = box
    xmin = xmin*img.shape[1]
    xmax = xmax*img.shape[1]
    ymin = ymin*img.shape[0]
    ymax = ymax*img.shape[0]
    if confidence > 0.9:
        height = ymax-ymin
        width = xmax-xmin
        rect = patches.Rectangle(
            (xmin,ymin), width, height, linewidth=1, edgecolor='yellow', facecolor='none')
        ax.add_patch(rect)
ax.axis('off')
plt.show()

After 20 epochs of training, you can see bounding boxes that accurately identifying various objects in the model response. See the following screenshot.

 

The purpose of maintaining an endpoint API is to support a model to be available for real-time predictions. It’s unnecessary to pay for a running endpoint instance if inference jobs are scheduled in advance. For this use case, you send a list of images for prediction to a batch transform job, which spins up a compute instance to run the model and tears it down upon completion. You only pay for the runtime of the instance, which saves costs on downtime. Set up and launch a batch transform job by uploading images to Amazon S3 and defining the data and model paths, along with a few other settings, to a dictionary. See the following code:

## inference via batch transform

# upload a sample of images to SageMaker
test_images = ['/.mxnet/datasets/voc/VOC2012/JPEGImages/2010_003939.jpg',
               '/.mxnet/datasets/voc/VOC2012/JPEGImages/2008_004205.jpg',
               '/.mxnet/datasets/voc/VOC2012/JPEGImages/2009_001139.jpg',
               '/.mxnet/datasets/voc/VOC2012/JPEGImages/2010_001453.jpg',
               '/.mxnet/datasets/voc/VOC2012/JPEGImages/2011_000148.jpg',
               '/.mxnet/datasets/voc/VOC2012/JPEGImages/2011_005806.jpg',
               '/.mxnet/datasets/voc/VOC2012/JPEGImages/2012_004299.jpg']

s3_test_prefix = 'test_images'
for test_image in test_images:
    test_image = home_path + test_image
    s3_client.upload_file(test_image, bucket_name, s3_test_prefix+'/'+test_image.split('/')[-1])

model_name = predictor.endpoint
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
batch_job_name = "test-batch-job" + timestamp
request = 
{
    "TransformJobName": batch_job_name,
    "ModelName": model_name,
    "MaxConcurrentTransforms": 1,
    "MaxPayloadInMB": 6,
    "BatchStrategy": "SingleRecord",
    "TransformOutput": {
        "S3OutputPath": 's3://{}/test/{}/'.format(bucket_name, batch_job_name)
    },
    "TransformInput": {
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri":'s3://{}/test_images/'.format(bucket_name)
            }
        },
        "ContentType": "application/x-image",
        "SplitType": "None",
        "CompressionType": "None"
    },
    "TransformResources": {
            "InstanceType": "ml.p3.2xlarge",
            "InstanceCount": 1
    }
}

## launch batch transform job
sm_client = boto3.client('sagemaker')

sm_client.create_transform_job(**request)

print("Created Transform job with name: ", batch_job_name)

while(True):
    batch_response = sm_client.describe_transform_job(TransformJobName=batch_job_name)
    status = batch_response['TransformJobStatus']
    if status == 'Completed':
        print("Transform job ended with status: " + status)
        break
    if status == 'Failed':
        message = batch_response['FailureReason']
        print('Transform failed with the following error: {}'.format(message))
        raise Exception('Transform job failed') 
    time.sleep(30)

You can verify the output of the batch transform job by comparing the output of the real-time inference, endpoint_response, to the output from the batch transform job, which was saved to s3://<bucket_name>/test/<batch_job_name>/2010_001453.jpg.out as specified in the S3OutputPath parameter.

Cleaning up

To finish up this walkthrough, tear down the endpoint instance and remove the Amazon SageMaker model. For more information about additional helper methods, see Using Estimators. Delete the Amazon ECR repository and its images through the Amazon ECR client. See the following code:

# tear down the SageMaker endpoint and endpoint configuration
predictor.delete_endpoint()

# delete the SageMaker model
predictor.delete_model()
    
# delete ECR repository
ecr_client = boto3.client('ecr')
ecr_client.delete_repository(repository_name='gluoncv', force=True)

Conclusion

Although training models is a data scientist’s the primary objective, the deployment process is equally crucial. Amazon SageMaker offers efficient methods to put these algorithms into production. Built-in algorithms can accelerate the training process, but you may need custom modeling for your use case. When building a model with MXNet, you must specify the configuration and processing steps necessary to run it in production. For this post, we outlined the steps to load our model to Amazon SageMaker and run inference for real-time predictions and in batch jobs.


About the Authors

Hussain Karimi is a data scientist at the Maching Learning Solutions Lab where he works with customers across various verticals to initate and build automated, algorithmic models that generate business value.

 

 

 

Will Gleave is a Machine Learning Consultant with the NatSec team at AWS Professional Services. In his spare time, he enjoys reading, watching sports, and traveling.

 

 

 

Muhyun Kim is a data scientist at Amazon Machine Learning Solutions Lab. He solves customer’s various business problems by applying machine learning and deep learning, and also helps them gets skilled.

Read More

Deploying TensorFlow OpenPose on AWS Inferentia-based Inf1 instances for significant price performance improvements

Deploying TensorFlow OpenPose on AWS Inferentia-based Inf1 instances for significant price performance improvements

In this post you will compile an open-source TensorFlow version of OpenPose using AWS Neuron and fine tune its inference performance for AWS Inferentia based instances. You will set up a benchmarking environment, measure the image processing pipeline throughput, and quantify the price-performance improvements as compared to a GPU based instance.

About OpenPose

Human pose estimation is a machine learning (ML) and computer vision (CV) technology supporting many applications, from pedestrian intent estimation to motion tracking for AR and gaming. At its core, pose estimation identifies coordinates on an image (joints and keypoints), that, when connected, form a representation of an individual skeleton. The representation of body orientation enables tasks such as teaching a robot to interact with humans or quantifying how good yoga asanas really are.

Amongst the many methods that can be used for human pose estimation, the deep learning (DL) bottoms-up approach taken by OpenPose—released by the Perceptual Computing Lab of Carnegie Mellon University in 2018—has gained a lot of users. OpenPose is a multi-person 2D pose estimation model that employs a technique called Part Affinity Fields (PAF) to associate body parts and form multiple individual skeletons on the image. In the bottoms-up approach, the model identifies the key points and pieces together the skeleton.

To achieve that, OpenPose uses a two-step process. First, it extracts image features using a VGG-19 model and passes those features through a pair of convolutional neural networks (CNN) running in parallel.

One of the CNNs in the pair computes confidence maps to detect body parts. The other computes the PAF and combines the parts to form the individual’s skeleton. You can repeat these parallel branches many times to refine the predictions of the confidence maps and PAF.

The following diagram shows features F from a VGG feeding the PAF and confidence map branches of the OpenPose model. (Source: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields)

The original OpenPose code relies on a Caffe model and pre-compiled C++ libraries. For ease of use and portability of our walkthrough, we work with a reimplementation of the neural networks of OpenPose using TensorFlow 1.15 from the tf-pose-estimation GitHub repo. This repo also provides ML pipeline scripts to pre- and post-process images and videos using OpenPose.

Prerequisites

For this walkthrough, you need an AWS account with access to the AWS Management Console and the ability to create Amazon Elastic Compute Cloud (Amazon EC2) instances with public-facing IP and Amazon Simple Storage Service (Amazon S3) buckets.

Working knowledge of AWS Deep Learning AMIs and Jupyter notebooks with Conda environments is beneficial, but not required.

About AWS Inferentia and Neuron SDK

AWS Inferentia chips are custom built by AWS to provide high-performance inference, with the lowest cost of inference in the cloud, and make it easy for you to integrate ML as part of your standard application features and capabilities.

AWS Neuron is a software development kit (SDK) consisting of a compiler, runtime, and profiling tools that optimize the ML inference performance for the Inferentia chips. Neuron is integrated with popular ML frameworks such as TensorFlow, PyTorch, and MXNet and comes pre-installed in AWS Deep Learning AMIs. Deploying deep learning models on AWS Inferentia is done in the same familiar environment used in other platforms, and you can enjoy the boost in performance and lowest cost.

The latest Neuron release, available on the AWS Neuron GitHub, adds support for more models like OpenPose, which we focus on in this post. It also upgrades Neuron PyTorch to the latest stable version (1.5.1), which allows for a wider range of models to compile and run on AWS Inferentia.

Compiling a TensorFlow OpenPose model with the Neuron SDK

You can start the compilation process by setting up an EC2 instance in AWS for compiling the model. We recommend a z1d.xlarge, due to its good single-core performance and memory size. Use the AWS Deep Learning AMI (Ubuntu 18.04) Version 29.0—ami-043f9aeaf108ebc37—in the US East (N. Virginia) Region. This AMI comes pre-packaged with the Neuron SDK and the required Neuron runtime for AWS Inferentia.

For more information about running AWS Deep Learning AMIs on EC2 instances, see Launching and Configuring a DLAMI.

When you can connect to the instance through SSH, you activate the aws_neuron_tensorflow_p36 Conda environment and update the Neuron Compiler to the latest release. The compilation script depends on requirements listed in the file requirements-compile.txt. For compilation scripts and requirements files, see the GitHub repo. Download and install them in the environment with the following code:

source activate aws_neuron_tensorflow_p36
pip install neuron-cc --upgrade --extra-index-url=https://pip.repos.neuron.amazonaws.com
git clone https://github.com/aws/aws-neuron-sdk.git /tmp/aws-neuron-sdk && cp /tmp/aws-neuron-sdk/src/examples/tensorflow/<name_of_the_new_folder>/* . && rm -rf /tmp/aws-neuron-sdk/
pip install -r requirements-compile.txt

You can then start working on the compilation process. You compile the tf-pose-estimation network frozen graph, available on the GitHub repo. You can adapt the original download script to a single-line wget command:

wget -c --tries=2 $( wget -q -O - http://www.mediafire.com/file/qlzzr20mpocnpa3/graph_opt.pb | grep -o 'http*://download[^"]*' | tail -n 1 ) -O graph_opt.pb

When the download is complete, run the convert_graph_opt.py script to compile it for the AWS Inferentia chip. Because Neuron is an ahead-of-time (AOT) compiler, you need to define a specific image size prior to compilation. You can adjust the network input image resolution with the argument --net_resolution (for example, net_resolution=656x368).

The compiled model can accept arbitrary batch size inputs at inference runtime. This property enables benchmarking large-scale deployments of the model; however, the pipeline available for image and video process in the tf-pose-estimation repo utilizes batch size 1.

To start the compilation process, enter the following code:

python convert_graph_opt.py graph_opt.pb graph_opt_neuron_656x368.pb

The compilation process can take up to 20 minutes to complete. During this time, the compiler optimizes the TensorFlow graph operations and provides the AWS Inferentia version of the saved model. During the process you can expect detailed logs such as the following:

2020-07-15 21:44:43.008627: I bazel-out/k8-opt/bin/tensorflow/neuron/convert/segment.cc:460] There are 11 ops of 7 different types in the graph that are not compiled by neuron-cc: Const, NoOp, Placeholder, RealDiv, Sub, Cast, Transpose, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-tensorflow.md).
INFO:tensorflow:fusing subgraph neuron_op_ed41d2deb8c54255 with neuron-cc
INFO:tensorflow:Number of operations in TensorFlow session: 474
INFO:tensorflow:Number of operations after tf.neuron optimizations: 474
INFO:tensorflow:Number of operations placed on Neuron runtime: 465

Before you can measure the performance of the compiled model, you need to switch to an EC2 Inf1 instance, powered by the AWS Inferentia chip. To share the compiled model between the two instances, create an S3 bucket with the following code:

aws s3 mb s3://<MY_BUCKET_NAME>
aws s3 cp graph_opt_neuron_656x368.pb s3://<MY_BUCKET_NAME>/graph_model.pb

Benchmarking the inference time with a Jupyter notebook on AWS EC2 Inf1 instances

After you have the compiled graph_model.pb in your S3 bucket, you modify the ML pipeline scripts on the GitHub repo to estimate human poses from images and videos.

To set up the benchmarking Inf1 instance, you can repeat the steps you took to provision the compilation z1d instance. You use the same AMI but change the instance type to inf1.xlarge. A similar setup on a g4dn.xlarge instance might be useful to compare the performance of the base tf-pose-estimation model on GPUs against the compiled model for AWS Inferentia.

Throughout this post, you interact with this instance and the model using a Jupyter Lab server. For more information about provisioning a Jupyter Lab on Amazon EC2, see Set Up a Jupyter Notebook Server.

Setting up the Conda Environment for tf-pose

When you can log in to the Jupyter Lab server, you can clone the GitHub repo containing the TensorFlow version of OpenPose.

On the Jupyter Launcher page, under Other, choose Terminal.

In the terminal, activate the aws_neuron_tensorflow_p36 environment, which contains the Neuron SDK. Activating the environment and cloning are done with the following code:

conda activate aws_neuron_tensorflow_p36
git clone https://github.com/ildoonet/tf-pose-estimation.git
cd tf-pose-estimation

When the cloning is complete, we recommend following the Package Install instructions to install the repo. From the same terminal screen, you customize the environment by installing opencv-python and dependencies listed on the requirements.txt of the GitHub repo.

You run two pip commands: the first takes care of opencv-python and the second completes the installation of the requirements.txt:

pip install opencv-python 
pip install -r requirements.txt

You’re now ready to build the notebooks.

On the repo’s root directory, create a new Jupyter notebook by choosing Notebook, Environment (conda_aws_neuron_tensorflow_p36). On the first cell of the notebook, import the library as defined in the run.py script, which is the reference pipeline for image processing. In the following cell, create a logger to record the benchmarking. See the following code:

import argparse
import logging
import sys
import time

from tf_pose import common
import cv2
import numpy as np
from tf_pose.estimator import TfPoseEstimator
from tf_pose.networks import get_graph_path, model_wh
logger = logging.getLogger('TfPoseEstimatorRun')
logger.handlers.clear()
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] [%(levelname)s] %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)

Define the main inferencing function main() and a helper plotter function plotter(). These functions directly replicate the OpenPose inference pipeline from run.py. One simple modification is the addition of a repeats argument, which allows you to run many inference steps in sequence and improve the measure of the average model throughput (measured in seconds per image):

def main(argString='--image ./images/contortion1.jpg --model cmu', repeats=10):
    parser = argparse.ArgumentParser(description='tf-pose-estimation run')
    parser.add_argument('--image', type=str, default='./images/apink2.jpg')
    parser.add_argument('--model', type=str, default='cmu',
                        help='cmu / mobilenet_thin / mobilenet_v2_large / mobilenet_v2_small')
    parser.add_argument('--resize', type=str, default='0x0',
                        help='if provided, resize images before they are processed. '
                             'default=0x0, Recommends : 432x368 or 656x368 or 1312x736 ')
    parser.add_argument('--resize-out-ratio', type=float, default=2.0,
                        help='if provided, resize heatmaps before they are post-processed. default=1.0')

    args = parser.parse_args(argString.split())

    w, h = model_wh(args.resize)
    if w == 0 or h == 0:
        e = TfPoseEstimator(get_graph_path(args.model), target_size=(432, 368))
    else:
        e = TfPoseEstimator(get_graph_path(args.model), target_size=(w, h))

    # estimate human poses from a single image !
    image = common.read_imgfile(args.image, None, None)
    if image is None:
        logger.error('Image can not be read, path=%s' % args.image)
        sys.exit(-1)

    t = time.time()
    for _ in range(repeats):
        humans = e.inference(image, resize_to_default=(w > 0 and h > 0), upsample_size=args.resize_out_ratio)
    elapsed = time.time() - t

    logger.info('%d times inference on image: %s at %.4f seconds/image.' % (repeats, args.image, elapsed/repeats))

    image = TfPoseEstimator.draw_humans(image, humans, imgcopy=False)
    return image, e
def plotter(image):
    try:
        import matplotlib.pyplot as plt

        fig = plt.figure(figsize=(12,12))
        a = fig.add_subplot(1, 1, 1)
        a.set_title('Result')
        plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
        
    except Exception as e:
        logger.warning('matplitlib error, %s' % e)
        cv2.imshow('result', image)
        cv2.waitKey()

Additionally, you can modify the same code structure for inferencing on videos or batches of images, based on the run_video.py or run_directory.py, if you’re feeling adventurous!

The main() function takes as input the same string of arguments as described in the Test Inference section of the GitHub repo. To test the notebook implementation, you use a reference set of arguments (make sure to download the cmu model using the original download script):

img, e = main('--model cmu --resize 656x368 --image=./images/ski.jpg --resize-out-ratio 2.0')
plotter(img)

The logs show your first multi-person pose analyzed:

‘[TfPoseEstimatorRun] [INFO] 10 times inference on image: ./images/ski.jpg at 1.5624 seconds/image.’

This results in lower than one frame per second (FPS) throughput, which is not a great performance. In this use case, you’re running a TensorFlow graph, --model cmu, without a GPU. The performance of such a model isn’t optimal on CPU. If you repeat the setup and run the environment on a g4dn.xlarge instance, with one NVIDIA T4 GPU, the result is quite different:

‘[TfPoseEstimatorRun] [INFO] 10 times inference on image: ./images/ski.jpg at 0.1708 seconds/image’ 

The result is 5.85 FPS, which is much better.

Using the Neuron compiled CMU model

So far, you’ve used model artifacts that came with the repo. Instead of using the original download script to retrieve the CMU model, copy the Neuron compiled model into ./models/graph/cmu/graph_model.pb and rerun the test:

aws s3 cp s3://<MY_BUCKET_NAME>/graph_opt.pb ./models/graph/cmu/graph_model.pb

Make sure to restart the Python kernel on the notebook if you previously ran a test of the non-Neuron compiled model. Restarting the kernel helps make sure all TensorFlow sessions are closed and get a fresh start for the benchmark. Running the same notebook again results in the following log entry:

‘[TfPoseEstimatorRun] [INFO] 10 times inference on image: ./images/ski.jpg at 0.1709 seconds/image.’

The results show the same frame rate as compared to the g4dn.xlarge instance, in an environment that costs approximately 30% less on demand. Despite the cost benefit from moving the workload to an AWS Inferentia-based instance, this throughput doesn’t convey the observed large performance gains of other reported results. For example, Amazon Alexa text to speech team has cut their inference cost by 50% when migrating to AWS Inferentia.

We decided to profile our version of the compiled graph and look for opportunities to fine-tune the end-to-end inference performance of the OpenPose pipeline. The integration of Neuron with TensorFlow gives access to native profiling libraries. To profile the Neuron compiled graph, we instrumented the TensorFlow session run command on the estimator method using the TensorFlow Python profiler:

from tensorflow.core.protobuf import config_pb2
from tensorflow.python.profiler import model_analyzer, option_builder

run_options = config_pb2.RunOptions(trace_level=config_pb2.RunOptions.FULL_TRACE)
run_metadata = config_pb2.RunMetadata()

peaks, heatMat_up, pafMat_up = self.persistent_sess.run(
    [self.tensor_peaks, self.tensor_heatMat_up, self.tensor_pafMat_up], feed_dict={
        self.tensor_image: [img], self.upsample_size: upsample_size
    }, 
    options=run_options, run_metadata=run_metadata
)

options = option_builder.ProfileOptionBuilder.time_and_memory()
model_analyzer.profile(self.persistent_sess.graph, run_metadata, op_log=None, cmd='scope', options=options)

The model_analyzer.profile method prints on StdErr the time and memory consumption of each operation on the TensorFlow graph. With the original code, the Neuron operation and a smoothing operation dominated the total graph runtime. The following output from the StdErr log shows that the total graph runtime took 108.02 milliseconds, of which the smoothing operation took 43.07 milliseconds:

node name | requested bytes | total execution time | accelerator execution time | cpu execution time
_TFProfRoot (--/16.86MB, --/108.02ms, --/0us, --/108.02ms)
…
   TfPoseEstimator/conv5_2_CPM_L1/weights/neuron_op_ed41d2deb8c54255 (430.01KB/430.01KB, 58.42ms/58.42ms, 0us/0us, 58.42ms/58.42ms)
…
smoothing (0B/2.89MB, 0us/43.07ms, 0us/0us, 0us/43.07ms)
   smoothing/depthwise (2.85MB/2.85MB, 43.05ms/43.05ms, 0us/0us, 43.05ms/43.05ms)
   smoothing/gauss_weight (47.50KB/47.50KB, 18us/18us, 0us/0us, 18us/18us)
…

The smoothing method provides a gaussian blur of the confidence maps calculated by OpenPose. By optimizing this operation, we can extract even more performance out of our end-to-end pose estimation. We modified the filter argument of the smoother on the estimator.py script from 25 to 5. This new configuration took down the total runtime to 67.44 milliseconds, of which the smoother now only takes 2.37ms—a 37% reduction! On a g4dn, this same optimization had little effect on the runtime. You can also optimize your version of the end-to-end pipeline by changing the same parameters and reinstalling the tf-pose-estimation repo from your local copy.

We ran the same benchmark across seven different instances types and sizes to evaluate the performance and cost of inference of our optimized end-to-end image processing pipeline. For comparison, we also show the On-Demand instance pricing from Amazon EC2 Pricing.

The throughput on the smallest size Inf1 instance—xlarge—is 2 times higher than that of the largest g4dn instance evaluated —8xlarge—at 12 times less the cost per 1000 images. Comparing the two best options, inf1.xlarge and g4dn.xlarge, inf1 has 72% lower cost per 1000 images, or a 3.57 times better price to performance compared to the lowest cost GPU option. The following table summarizes these findings.

inf1.xlarge inf1.2xlarge inf1.6xlarge g4dn.xlarge g4dn.2xlarge g4dn.4xlarge g4dn.8xlarge
Image process time [seconds/image] 0.0703 0.0677 0.0656 0.1708 0.1526 0.1477 0.1427

Throughput

[FPS]

14.22 14.77 15.24 5.85 6.55 6.77 7.01
1000 Images processing time [seconds] 70.3 67.7 65.6 170.8 152.6 147.7 142.7

On demand cost

[$/hr]

$ 0.368 $ 0.584 $ 1.904 $ 0.526 $ 0.752 $ 1.204 $ 2.176

Cost per 1000 images

[$]

$ 0.007 $ 0.011 $ 0.035 $ 0.025 $ 0.032 $ 0.049 $ 0.086

The chart below summarizes the throughput and cost per 1000 images results for the xlarge and 2xlarge instance sizes.

We further reduced the image-processing cost and increased throughput of the tf-pose-estimation on an Inf1 instance by taking a data parallel approach to the end-to-end pipeline. The values shown in the preceding table relate to the use of a single AWS Inferentia processing core—a Neuron core. The benchmarked instance has four, so it’s wasteful to use only one. Our test with embarrassingly parallel implementation of the main()function call using the Python joblib library showed linear scaling up to four threads. This pattern increased the throughput to 56.88 FPS and decreased the cost per 1000 images to below $0.002. This is a good indication that better batching strategy can further improve the price-performance ratio of OpenPose on AWS Inferentia.

The larger CMU model also provides good pose estimation performance. For example, see the following image of the multi-pose detection using the Neuron SDK compiled model, on a scene with subjects at multiple depths.

Safely shutting down and cleaning up

On the Amazon EC2 console, choose the compilation and inference instances, and choose Terminate from the Actions drop-down menu. You persisted the compiled model in your s3://<MY_BUCKET_NAME> so it can be reused later. If you’ve made changes to the code inside the instances, remember to persist those as well. The instance termination discards data stored only in the instance’s home volume.

Conclusion

In this post, you walked through the steps of compiling an open-source OpenPose TensorFlow model, updating a custom end-to-end image processing pipeline, and identifying tools to profile and further optimize your ML inference time on an EC2 Inf1 instance. When tuned, the Neuron compiled TensorFlow model was 72% less expensive than the cheapest GPU instance, with consistently better performance. The steps described in this post also apply to other ML model types and frameworks. For more information, see the AWS Neuron SDK GitHub repo.

Learn more about the AWS Inferentia chip and the Amazon EC2 Inf1 instances to get started with running your own custom ML pipelines on AWS Inferentia using the Neuron SDK.


About the Authors

Fabio Nonato de Paula is a Principal Solutions Architect for Autonomous Computing in AWS. He works with large-scale deployments of machine learning and AI for autonomous and intelligent systems. Fabio is passionate about democratizing access to accelerated computing and distributed ML. Outside of work, you can find Fabio riding his motorcycle on the hills of Livermore valley or reading ComiXology.

 

 

Haichen Li is a software development engineer in the AWS Neuron SDK team. He works on integrating machine learning frameworks with the AWS Neuron compiler and runtime systems, as well as developing deep learning models that benefit particularly from the Inferentia hardware.

 

 

 

 

Read More

All the Right Moves: How PredictionNet Helps Self-Driving Cars Anticipate Future Traffic Trajectories

All the Right Moves: How PredictionNet Helps Self-Driving Cars Anticipate Future Traffic Trajectories

Driving requires the ability to predict the future. Every time a car suddenly cuts into a lane or multiple cars arrive at the same intersection, drivers must make predictions as to how others will act to safely proceed.

While humans rely on driver cues and personal experience to read these situations, self-driving cars can use AI to anticipate traffic patterns and safely maneuver in a complex environment.

We have trained the PredictionNet deep neural network to understand the driving environment around a car in top-down or bird’s-eye view, and to predict the future trajectories of road users based on both live perception and map data.

PredictionNet analyzes past movements of all road agents, such as cars, buses, trucks, bicycles and pedestrians, to predict their future movements. The DNN looks into the past to take in previous road user positions, and also takes in positions of fixed objects and landmarks on the scene, such as traffic lights, traffic signs and lane line markings provided by the map.

Based on these inputs, which are rasterized in top-down view, the DNN predicts road user trajectories into the future, as shown in figure 1.

Predicting the future has inherent uncertainty. PredictionNet captures this by also providing the prediction statistics of the future trajectory predicted for each road user, as also shown in figure 1.

Figure 1. PredictionNet results visualized in top-down view. Gray lines denote the map, dotted white lines represent vehicle trajectories predicted by the DNN, while white boxes represent ground truth trajectory data. The colorized clouds represent the probability distributions for predicted vehicle trajectories, with warmer colors representing points that are closer in time to the present, and cooler colors representing points further in the future.

A Top-Down Convolutional RNN-Based Approach

Previous approaches to predicting future trajectories for self-driving cars have leveraged both imitation learning and generative models that sample future trajectories, as well as convolutional neural networks and recurrent neural networks for processing perception inputs and predicting future trajectories.

For PredictionNet, we adopt an RNN-based architecture that uses two-dimensional convolutions. This structure is highly scalable for arbitrary input sizes, including the number of road users and prediction horizons.

As is typically the case with any RNN, different time steps are fed into the DNN sequentially. Each time step is represented by a top-down view image that shows the vehicle surroundings at that time, including both dynamic obstacles observed via live perception, and fixed landmarks provided by a map.

This top-down view image is processed by a set of 2D convolutions before being passed to the RNN. In the current implementation, PredictionNet is able to confidently predict one to five seconds into the future, depending on the complexity of the scene (for example, highway versus urban).

The PredictionNet model also lends itself to a highly efficient runtime implementation in the TensorRT deep learning inference SDK, with 10 ms end-to-end inference times achieved on an NVIDIA TITAN RTX GPU.

Scalable Results

Results thus far have shown PredictionNet to be highly promising for several complex traffic scenarios. For example, the DNN can predict which cars will proceed straight through an intersection versus which will turn. It’s also able to correctly predict the car’s behavior in highway merging scenarios.

We have also observed that PredictionNet is able to learn velocities and accelerations of vehicles on the scene. This enables it to correctly predict speeds of both fast-moving and fully stopped vehicles, as well as to predict stop-and-go traffic patterns.

PredictionNet is trained on highly accurate lidar data to achieve higher prediction accuracy. However, the inference-time perception input to the DNN can be based on any sensor input combination (that is, camera, radar or lidar data) without retraining. This also means that the DNN’s prediction capabilities can be leveraged for various sensor configurations and levels of autonomy, from level 2+ systems all the way to level 4/level 5.

PredictionNet’s ability to anticipate behavior in real time can be used to create an interactive training environment for reinforcement learning-based planning and control policies for features such as automatic cruise control, lane changes or intersections handling.

By using PredictionNet to simulate how other road users will react to an autonomous vehicle’s behavior based on real-world experiences, we can train a more safe, robust and courteous AI driver.

The post All the Right Moves: How PredictionNet Helps Self-Driving Cars Anticipate Future Traffic Trajectories appeared first on The Official NVIDIA Blog.

Read More

Translating presentation files with Amazon Translate

Translating presentation files with Amazon Translate

As solutions architects working in Brazil, we often translate technical content from English to other languages. Doing so manually takes a lot of time, especially when dealing with presentations—in contrast to plain text documents, their content is spread across various areas in multiple slides. To solve that, we wrote a script that translates Microsoft PowerPoint files using Amazon Translate. This post discusses how the translation script works.

Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. When working with Amazon Translate, you provide text in the source language and receive text translated into the target language. For more information about the languages Amazon Translate supports, see Supported Languages and Language Codes.

The translation script is written in Python and relies on an open-source library to parse the presentation files.

Solution

The script requires three arguments:

  • Source language
  • Target language
  • File path of a .pptx file

The script then performs the following functions:

  • Parses the file
  • Extracts its texts
  • Invokes the Amazon Translate API for each text
  • Saves a new file with the translated texts that the API returns

The following command translates a presentation from English to Portuguese:

$ python pptx-translator.py en pt example.pptx
Translating example.pptx from en to pt...
Slide 1 of 7
Slide 2 of 7
Slide 3 of 7
Slide 4 of 7
Slide 5 of 7
Slide 6 of 7
Slide 7 of 7
Saving example-pt.pptx...

To interact with Amazon Translate, the script uses Boto, the AWS SDK for Python. Boto is configurable in multiple ways. Regardless, you must have AWS credentials and a Region set to make requests to AWS. Here is more information about Configuration Credentials.

To handle presentation files, the script uses python-pptx, an open-source library available on GitHub. When you provide a presentation file path as input, the library returns a Presentation object. See the following code:

presentation = Presentation(args.input_file_path)

Within a Presentation object, there are slides and, within the slides, shapes and their paragraphs. You can iterate over all paragraphs and invoke the Amazon Translate API for each text. Amazon Translate offers two different translation processing modes: real-time translation and asynchronous batch processing. The script uses the former, which allows you to call Amazon Translate on a piece of text and synchronously get a response with the corresponding translation. See the following code:

for paragraph in shape.text_frame.paragraphs:
    for index, paragraph_run in enumerate(paragraph.runs):
        response = translate.translate_text(
                Text=paragraph_run.text,
                SourceLanguageCode=source_language_code,
                TargetLanguageCode=target_language_code,
                TerminologyNames=terminology_names)

You then get the translated text from the API to replace the original text. See the following code:

paragraph.runs[index].text = response.get('TranslatedText')

The script replaces not only the visible presentation text, but also its comments. Moreover, the script has a map to update the language identifiers. That’s necessary to indicate the correct language so Microsoft PowerPoint can properly check the spelling. See the following code:

paragraph.runs[index].font.language_id = LANGUAGE_CODE_TO_LANGUAGE_ID[target_language_code]

In addition to passing a text, the source language, and the target language, you can use the Custom Terminology feature of Amazon Translate which makes sure that it translates terms exactly the way you want. To do this, you need to pass a list of pre-translated custom terminology when invoking the API. For instance, you can customize the translation of technical terms, put those terms and their respective translations in a CSV file, and pass its path as an optional argument to the script. The script reads the file and imports its content into Amazon Translate. See the following code:

with open(terminology_file_path, 'rb') as f:
    translate.import_terminology(
            Name=TERMINOLOGY_NAME,
            MergeStrategy='OVERWRITE',
            TerminologyData={'File': bytearray(f.read()), 'Format': 'CSV'})

After translating all the slides, you save the presentation as a new file with the following code:

presentation.save(output_file_path)

The script is straightforward, but very useful. For the full code, see the GitHub repo.

Conclusion

This post described a script-based solution to translate presentation files using Amazon Translate into a variety of languages. For more information, see What Is Amazon Translate?


About the Authors

Lidio Ramalho is Senior Manager on the AWS R&D and Innovation Solutions Architecture team. He works with customers to build innovative prototypes in AI/ML, Robotics, AR VR, IoT, Satellite, and Blockchain disciplines.

 

 

 

 

Rafael Werneck is a Solutions Architect at AWS R&D, based in Brazil. Previously, he worked as a Software Development Engineer on Amazon.com.br and Amazon RDS Performance Insights.

 

Read More

Improving Holistic Scene Understanding with Panoptic-DeepLab

Improving Holistic Scene Understanding with Panoptic-DeepLab

Posted by Bowen Cheng, Student Researcher, and Liang-Chieh Chen, Research Scientist, Google Research

Real-world computer vision applications, such as self-driving cars and robotics, rely on two core tasks — instance segmentation and semantic segmentation. Instance segmentation identifies the class and extent of individual “things” in an image (i.e., countable objects such as people, animals, cars, etc.) and assigns unique identifiers to each (e.g., car_1 and car_2). This is complemented by semantic segmentation, which labels all pixels in an image, including the “things” that are present as well as the surrounding “stuff” (e.g., amorphous regions of similar texture or material, such as grass, sky or road). This latter task, however, does not differentiate between pixels of the same class that belong to different instances of that class.

Panoptic segmentation represents the unification of these two approaches with the goal of assigning a unique value to every pixel in an image that encodes both semantic label and instance ID. Most existing panoptic segmentation algorithms are based on Mask R-CNN, which treats semantic and instance segmentation separately. The instance segmentation step identifies objects in an image, but it often produces object instance masks that overlap one another. To settle the conflict between overlapping instance masks, one commonly employs an heuristic that resolves the discrepancy either based on the mask with a higher confidence score or by use of a pre-defined pairwise relationship between categories (e.g., a tie should always be worn on a person’s front). Additionally, the discrepancies between semantic and instance segmentation results are sorted out by favoring the instance predictions. While these methods generally produce good results, they also introduce heavy latency, which makes it challenging to apply them in real-time applications.

Driven by the need of a real-time panoptic segmentation model, we propose “Panoptic-DeepLab: a simple, fast and strong system for panoptic segmentation”, accepted to CVPR 2020. In this work, we extend the commonly used modern semantic segmentation model, DeepLab, to perform panoptic segmentation using only a small number of additional parameters with the addition of marginal computation overhead. The resulting model, Panoptic-DeepLab, produces semantic and instance segmentation in parallel and without overlap, avoiding the need for the manually designed heuristics adopted by other methods. Additionally, we develop a computationally efficient operation that merges the semantic and instance segmentation results, enabling near real-time end-to-end panoptic segmentation prediction. Unlike methods based on Mask R-CNN, Panoptic-DeepLab does not generate bounding box predictions and requires only three loss functions during training, significantly fewer than current state-of-the-art methods, such as UPSNet, which can have up to eight. Finally, Panoptic-DeepLab has demonstrated state-of-the-art performance on several academic datasets.

Panoptic segmentation results obtained by Panoptic-DeepLab. Left: Video frames used as input to the panoptic segmentation model. Right: Results overlaid on video frames. Each object instance has a unique label, e.g., car_1, car_2, etc.

Overview
Panoptic-DeepLab is simple both conceptually and architecturally. At a high-level, it predicts three outputs. The first is semantic segmentation, in which it assigns a semantic class (e.g., car or grass) to each pixel. However, it does not differentiate between multiple instances of the same class. So, for example, if one car is partly behind another, the pixels associated with both would have the same associated class and would be indistinguishable from one another. This can be addressed by the second two outputs from the model: a center-of-mass prediction for each instance and instance center regression, where the model learns to regress each instance pixel to its center of mass. This latter step ensures that the model associates pixels of a given class to the appropriate instance. The class-agnostic instance segmentation, obtained by grouping predicted foreground pixels to their closest predicted instance centers, is then fused with semantic segmentation by majority-vote rule to generate the final panoptic segmentation.

Overview of Panoptic-DeepLab. Semantic segmentation associates pixels in the image with general classes, while the class-agnostic instance segmentation step identifies the pixels associated with an individual object, regardless of the class. Taken together one gets the final panoptic segmentation image.

Neural Network Design
Panoptic-DeepLab consists of four components: (1) an encoder backbone pre-trained on ImageNet, shared by both the semantic segmentation and instance segmentation branches of the architecture; (2) atrous spatial pyramid pooling (ASPP) modules, similar to that used by DeepLab, which are deployed independently in each branch in order to perform segmentation at a range of spatial scales; (3) similarly decoupled decoder modules specific to each segmentation task; and (4) task-specific prediction heads.

The encoder backbone (1), which has been pre-trained on ImageNet, extracts feature maps that are shared by both the semantic segmentation and instance segmentation branches of the architecture. Typically, the feature map is generated by the backbone model using a standard convolution, which reduces the resolution of the output map to 1/32nd that of the input image and is too coarse for accurate image segmentation. In order to preserve the details of object boundaries, we instead employ atrous convolution, which better retains important features like edges, to generate a feature map with a resolution of 1/16th the original. This is then followed by two ASPP modules (2), one for each branch, which captures multi-scale information for segmentation.

The light-weight decoder modules (3) follow those used in the most recent DeepLab version (DeepLabV3+), but with two modifications. First, we reintroduce an additional low-level feature map (1/8th scale) to the decoder, which helps to preserve spatial information from the original image (e.g., object boundaries) that can be significantly degraded in the final feature map output by the backbone. Second, instead of using the typical 3 × 3 kernel, the decoder employs a 5 × 5 depthwise-separable convolution, which yields somewhat better performance at only a minimal cost in additional overhead.

The two prediction heads (4) are tailored to their task. The semantic segmentation head employs a weighted version of the standard bootstrapped cross entropy loss function, which weights each pixel differently and has proven to be more effective for segmentation of small-scale objects. The instance segmentation head is trained to predict the offsets between the center of mass of an object instance and the surrounding pixels, without knowledge of the object class, forming the class-agnostic instance masks.

Results
To demonstrate the effectiveness of Panoptic-DeepLab, we conduct experiments on three popular academic datasets, Cityscapes, Mapillary Vistas, and COCO datasets. With a simple architecture, Panoptic-DeepLab ranks first in Cityscapes for all three tasks (semantic, instance and panoptic segmentation) without any task-specific fine-tuning. Additionally, Panoptic-DeepLab won the Best Result, Best Paper, and Most Innovative awards on the Mapillary Panoptic Segmentation track at ICCV 2019 Joint COCO and Mapillary Recognition Challenge Workshop. It outperforms the winner of 2018 by a healthy margin of 1.5%. Finally, Panoptic-DeepLab sets new state-of-the-art bottom-up (i.e., box-free) panoptic segmentation results on the COCO dataset, and is also comparable to other methods based on Mask R-CNN.

Accuracy (PQ) vs. Speed (GPU inference time) across three datasets.

Conclusion
With a simple architecture and only three training loss functions, Panoptic-DeepLab achieves state-of-the-art performance while being faster than other methods based on Mask R-CNN. To summarize, we develop the first single-shot panoptic segmentation model that attains state-of-the-art performance on several public benchmarks, and delivers near real time end-to-end inference speed. We hope our simple and effective Panoptic-DeepLab could establish a solid baseline and further benefit the research community.

Acknowledgements
We would like to thank the support and valuable discussions with Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Florian Schroff as well as the Google Mobile Vision team.

Read More

University of Florida, NVIDIA to Build Fastest AI Supercomputer in Academia

University of Florida, NVIDIA to Build Fastest AI Supercomputer in Academia

The University of Florida and NVIDIA Tuesday unveiled a plan to build the world’s fastest AI supercomputer in academia, delivering 700 petaflops of AI performance.

The effort is anchored by a $50 million gift: $25 million from alumnus and NVIDIA co-founder Chris Malachowsky and $25 million in hardware, software, training and services from NVIDIA.

“We’ve created a replicable, powerful model of public-private cooperation for everyone’s benefit,” said Malachowsky, who serves as an NVIDIA Fellow, in an online event featuring leaders from both the UF and NVIDIA.

UF will invest an additional $20 million to create an AI-centric supercomputing and data center.

The $70 million public-private partnership promises to make UF one of the leading AI universities in the country, advance academic research and help address some of the state’s most complex challenges.

“This is going to be a tremendous partnership,” Florida Gov. Ron DeSantis said. “As we look to keep our best talent  in state, this will be a significant carrot, you’ll also see people around the country want to come to Florida.”

Working closely with NVIDIA, UF will boost the capabilities of its existing supercomputer, HiPerGator, with the recently announced NVIDIA DGX SuperPOD architecture. The system will be up and running by early 2021, just a few weeks after it’s delivered.

This gives faculty and students within and beyond UF the tools to apply AI across a multitude of areas to address major challenges such as rising seas, aging populations, data security, personalized medicine, urban transportation and food insecurity. UF expects to create 30,000 AI-enabled graduates by 2030.

“The partnership here with the UF, the state of Florida, and NVIDIA, anchored by Chris’ generous donation, goes beyond just money,” said NVIDIA CEO Jensen Huang, who founded NVIDIA in 1993 along with Malachowsky and Curtis Priem. “We are excited to contribute NVIDIA’s expertise to work together to make UF a national leader in AI and help address not only the region’s, but the nation’s challenges.”

UF, ranked seventh among public universities in the United States by US News & World Report and aims to break into the top five, offers an extraordinarily broad range of disciplines, Malachowsky said.

The region is also a “living laboratory for some of society’s biggest challenges,” Malachowsky said.

Regional, National AI Leadership

The effort aims to help define a research landscape to deal with the COVID-19, pandemic, which has seen supercomputers take a leading role.

“Our vision is to become the nation’s first AI university,” University of Florida President Kent Fuchs said. “I am so grateful again to Mr. Malachowsky and NVIDIA CEO Jensen Huang.

State and regional leaders already look to the university to bring its capabilities to bear on an array of regional and national issues.

Among them: supporting agriculture in a time of climate change, addressing the needs of an aging population, and managing the effects of rising sea levels in a state with more than 1,300 miles of coastline.

And to ensure no community is left behind, UF plans to promote wide accessibility to these computing capabilities.

As part of this, UF will:

  • Establish UF’s Equitable AI program, to bring faculty members across the university together to create standards and certifications for developing tools and solutions that are cognizant of bias, unethical practice and legal and moral issues.
  • Partner with industry and other academic groups, such as the Inclusive Engineering Consortium, whose students will work with members to conduct research and recruitment to UF graduate programs.

Broad Range of AI Initiatives

Malachowsky has served in a number of leadership roles as NVIDIA has grown from a startup to the global leader in visual and parallel computing. A recognized authority on integrated-circuit design and methodology, he has authored close to 40 patents.

In addition to holding a BSEE from the University of Florida, he has an MSCS from Santa Clara University. He has been named a distinguished alumni of both universities, in addition to being inducted last year into the Florida Inventors Hall of Fame.

UF is the first institution of higher learning in the U.S. to receive NVIDIA DGX A100 systems. These systems are based on the modular architecture of the NVIDIA DGX SuperPOD, which enables the rapid deployment and scaling of massive AI infrastructure.

UF’s HiPerGator 3 supercomputer will integrate 140 NVIDIA DGX A100 systems powered by a combined 1,120 NVIDIA A100 Tensor Core GPUs. It will include 4 petabytes of high-performance storage. An NVIDIA Mellanox HDR 200Gb/s InfiniBand network will provide the high throughput and extremely low-latency network connectivity.

DGX A100 systems are built to make the most of these capabilities as a single software-defined platform. NVIDIA DGX systems are already used by eight of the ten top US national universities.

That platform includes the most advanced suite of AI application frameworks in the world. It’s a software suite that covers data analytics, AI training and inference acceleration, and recommendation systems. Its multi-modal capabilities combine sound, vision, speech and a contextual understanding of the world around us.

Together, these tools have already had a significant impact on healthcare, transportation, science, interactive appliances, the internet and other areas.

More Than Just a Machine

Friday’s announcement, however, goes beyond any single, if singular, machine.

NVIDIA will also contribute its AI expertise to UF through ongoing support and collaboration across the following initiatives:

  • The NVIDIA Deep Learning Institute will collaborate with UF on developing new curriculum and coursework for both students and the community, including programing tuned to address the needs of young adults and teens to encourage their interest in STEM and AI, better preparing them for future educational and employment opportunities.
  • UF will become the site of the latest NVIDIA AI Technology Center, where UF Graduate Fellows and NVIDIA employees will work together to advance AI.
  • NVIDIA solution architects and product engineers will partner with UF on the installation, operation and optimization of the NVIDIA-based supercomputing resources on campus, including the latest AI software applications.

UF will also make investments all around its new machine, well beyond the $20 million targeted at upgrading their data center.

Collectively, all of the data sciences-related activities and programs — and UF’s new supercomputer — will support the university’s broader AI-related aspirations.

To support that effort, the university has committed to fill 100 new faculty positions in AI and related fields, making it one of the top AI universities in the country.

That’s in addition to the 500 recently hired faculty across disciplines, many of whom will weave AI into their teaching and research.

“It’s been thrilling to watch all this,” Malachowsky said. “It provides a blueprint for how other states can work with their region’s resources to make similar investments that bring their residents the benefits of AI, while bolstering our nation’s competitiveness, capabilities, and expertise.”

The post University of Florida, NVIDIA to Build Fastest AI Supercomputer in Academia appeared first on The Official NVIDIA Blog.

Read More

Atlassian continuously profiles services in production with Amazon CodeGuru Profiler

Atlassian continuously profiles services in production with Amazon CodeGuru Profiler

This is a guest post by the Jira Cloud Performance Team at Atlassian. In their own words, Atlassian’s mission is to unleash the potential in every team. Our products help teams organize, discuss, and complete their work. And what teams do can change the world. We have helped NASA teams design the Mars Rover, Cochlear teams develop hearing implants and hundreds of thousands of other teams do amazing things. We have an incredible opportunity to help millions more teams in organizations across nearly every industry. Teamwork is hard. We make it easier.

The products we build at Atlassian have hundreds of developers working on them, composed of a mixture of monolithic applications and microservices. When an incident occurs, it can be hard to diagnose the root cause due to the high rate of change within the codebases. Profiling can speed up root cause diagnosis significantly and is an effective technique to identify runtime hotspots and latency contributors within an application. Without profiling, you commonly need to implement custom and ad hoc latency instrumentation of the code, which can be error-prone or cause other side effects.

At Atlassian, we’ve always had tooling to profile services in production, such as using Linux perf or async-profiler, and while these are highly valuable, our methods had some limitations:

  • Intervention from a person (or system) was required to capture a profile at the right time, which meant transient problems were often missed
  • Ad hoc profiling didn’t provide a baseline profile to compare with
  • For security and reliability, a limited number of people had access to run these tools in production

These limitations led us to look into continuous profiling.

In addition to helping diagnose where a service is spending CPU cycles (or time), we wanted a profiling solution that provided visualizations like flame graphs, which are a great diagnostic aid when trying to understand call paths in a complex and dynamic application, and can also be used to aid a developers’ understanding of the system.

Our existing in-house profiling solution comprised scripts deployed alongside our services that can generate profiles using Linux perf or async-profiler. A subset of privileged developers (and SREs) could run these scripts on production nodes using AWS Systems Manager. Our use of Linux perf and async-profiler came with several advantages, including:

  • Data in a format that we could visualize as a flame graph (which is easy to interpret)
  • The ability to profile either a single process or a whole node
  • Profiling across different dimensions such as CPU, memory, and I/O

Our initial continuous profiling solution comprised a scheduled job that ran async-profiler (or Linux perf) regularly, uploading the raw results to a microservice that transformed the raw data into a columnar data format (Parquet), and writing the result to Amazon Simple Storage Service (Amazon S3).

We defined a schema in AWS Glue allowing developers to query the profile data for a particular service using Amazon Athena. Athena empowered developers to write complex SQL queries to filter profile data on dimensions like time range and stack frames.

We also started building a UI to run the Athena queries and render the results as flame graphs using SpeedScope.

Even with the effort we already employed for this solution, we still had significant work ahead of us to build out an optimal solution.

Meanwhile, the announcement of Amazon CodeGuru Profiler caught our attention—the service offering was highly relevant to us and largely overlapped with our existing capability. After a successful spike and evaluation, we decided to stop building out our solution and integrate CodeGuru Profiler instead.

We chose to define a single profiling group for each of our smaller services. For our larger services, which are partitioned into shards (a separate Auto Scaling group per shard), we choose to create one profiling group per shard.

You can integrate the Java profiler via two available modes: agent and code mode. To ensure a safe rollout, we decided to use the code mode, launching the agent from within our application code. This allowed us to control when to start (or stop) the agent via our existing feature flag mechanism.

We have now integrated CodeGuru Profiler at a platform level, enabling any Atlassian service team to easily take advantage of this capability.

Inspect and latency

One of the first ways we utilized CodeGuru Profiler was to identify code paths that show obvious or well-known problems in terms of CPU utilization or latency. We searched for different forms of synchronization in the profiled data. One interesting case was an EnumMap that was wrapped in a Collections.synchronizedMap. The following screenshot shows the thread states of the stack frames in this code path for a span of 24 hours.

Although the involved stack trace consumes less than 0.5% of all runtime, when we visualized the latency of thread states, we saw that it spent twice the amount of time in a BLOCKED state than a RUNNABLE state. To increase the ratio of time spent in a RUNNABLE state, we moved away from using EnumMap to using an instance of ConcurrentHashMap.

The following screenshot shows a profile of a similar 24-hour period. After we implemented the change, the relevant stack trace is now all in a RUNNABLE state.

Recommendation Reports

CodeGuru Profiler also provides a recommendation report on every profiling group, which identifies commonly known anti-patterns from a performance perspective and suggests known solutions. One such report we received (see the following screenshot) highlighted an issue with how we used Jackson ObjectMapper.

Upon receipt of this report, we quickly identified and resolved the problem code.

Conclusion

Integration with CodeGuru Profiler has been a major step forward for us, enabling every developer within Atlassian to own and take action on performance engineering.

Since enabling CodeGuru Profiler, we’ve already gained the following benefits:

  • Any Atlassian developer can look up a profile from any point in time to understand the call paths that took place in production. This helps developers understand complex applications and aids us when investigating performance issues.
  • The time to diagnose the root cause of performance issues in production has significantly reduced, and our developers no longer need to inject custom instrumentation code when diagnosing problems.
  • Open availability of profile data across the organization has helped increase developer ownership of performance optimization.

We’re excited by what the CodeGuru Profiler team has built, and are looking forward to the profiling technologies and capabilities that they’ll build next.


About the Authors

Behrooz Nobakht

Senior Software Engineer

Matthew Ponsford

Engineering Manager

Narayanaswamy Anandapadmanabhan

Senior Software Engineer

We are Jira Cloud Performance from Atlassian. We make tools like Jira and Trello that are used by thousands of teams worldwide. We’re serious about creating amazing products, practices, and open work for all teams. Jira Cloud Performance is a specialized working group focused on enabling Jira and Atlassian teams to better observe, monitor, and enhance the performance of their products and services.

Read More