Bada Bing Bada Boom: Microsoft Turns to Turing-NLG, NVIDIA GPUs to Instantly Suggest Full-Phrase Queries

Bada Bing Bada Boom: Microsoft Turns to Turing-NLG, NVIDIA GPUs to Instantly Suggest Full-Phrase Queries

Hate hunting and pecking away at your keyboard every time you have a quick question? You’ll love this.

Microsoft’s Bing search engine has turned to Turing-NLG and NVIDIA GPUs to suggest full sentences for you as you type.

Turing-NLG is a cutting-edge, large-scale unsupervised language model that has achieved strong performance on language modeling benchmarks.

It’s just the latest example of an AI technique called unsupervised learning, which makes sense of vast quantities of data by extracting features and patterns without the need for humans to provide any pre-labeled data.

Microsoft calls this Next Phrase Prediction, and it can feel like magic, making full-phrase suggestions in real time for long search queries.

Turing-NLG is among several innovations — from model compression to state caching and hardware acceleration — that Bing has harnessed with Next Phrase Prediction.

Over the summer, Microsoft worked with engineers at NVIDIA to optimize Turing-NLG to their needs, accelerating the model on NVIDIA GPUs to power the feature for users worldwide.

A key part of this optimization was to run this massive AI model extremely fast to power real-time search experience. With a combination of hardware and model optimization Microsoft and NVIDIA achieved an average latency below 10 milliseconds.

By contrast, it takes more than 100 milliseconds to blink your eye.

Learn more about the next wave of AI innovations at Bing.

Before the introduction of Next Phrase Prediction, the approach for handling query suggestions for longer queries was limited to completing the current word being typed by the user.

Now type in “The best way to replace,” and you’ll immediately see three suggestions for completing the phrase: wood, plastic and metal. Type in “how can I replace a battery for,” and you’ll see “iphone, samsung, ipad and kindle” all suggested.

With Next Phrase Prediction, Bing can now present users with full-phrase suggestions.

The more characters you type, the closer Bing gets to what you probably want to ask.

And because these suggestions are generated instantly, they’re not limited to previously seen data or just the current word being typed.

So, for some queries, Bing won’t just save you a few keystrokes — but multiple words.

As a result of this work, the coverage of autosuggestion completions increases considerably, Microsoft reports, improving the overall user experience “significantly.”

The post Bada Bing Bada Boom: Microsoft Turns to Turing-NLG, NVIDIA GPUs to Instantly Suggest Full-Phrase Queries appeared first on The Official NVIDIA Blog.

Read More

Coronavirus Gets a Close-Up: Folding@home Live in NVIDIA Omniverse

Coronavirus Gets a Close-Up: Folding@home Live in NVIDIA Omniverse

For researchers like Max Zimmerman, it was a welcome pile-on to tackle a global pandemic.

A million citizen scientists donated time on their home systems so the Folding@home consortium could calculate the intricate movements of proteins inside the coronavirus. Then a team of NVIDIA simulation experts combined the best tools from multiple industries to let the researchers see their data in a whole new way.

“I’ve been repeatedly amazed with the unprecedented scale of scientific collaborations,” said Zimmerman, a postdoc fellow at the Washington University School of Medicine in St. Louis, which hosts one of eight labs that keep the Folding@home research network humming.

As a result, Zimmerman and colleagues published a paper on BioRxiv, showing images of 17 weak spots in coronavirus proteins that antiviral drug makers can attack. And the high-res simulation of the work continues to educate researchers and the public alike about the bad actor behind the pandemic.

“We are in a position to make serious headway towards understanding the molecular foundations of health and disease,” he added.

An Antiviral Effort Goes Viral

In mid-March, the Folding@home team put many long-running projects on hold to focus on studying key proteins behind COVID. They issued a call for help, and by the end of the month the network swelled to become the world’s first exascale supercomputer, fueled in part by more than 280,000 NVIDIA GPUs.

Researchers harnessed that power to search for vulnerable moments in the rapid and intricate dance of the folding proteins, split-second openings drug makers could exploit. Within three months, computers found many promising motions that traditional experiments could not see.

“We’ve simulated nearly the entire proteome of the virus and discovered more than 50 new and novel targets to aid in the design of antivirals. We have also been simulating drug candidates in known targets, screening over 50,000 compounds to identify 300 drug candidates,” Zimmerman said.

The coronavirus uses cunning techniques to avoid human immune responses, like the Spike protein keeping its head down in a closed position. With the power of an exaflop at their disposal, researchers simulated the proteins folding for a full tenth of a second, orders of magnitude longer than prior work.

Though the time sampled was relatively short, the dataset to enable it was vast.

The SARS-CoV-2 spike protein alone consists of 442,881 atoms in constant motion. In just 1.2 microseconds, it generates about 300 billion timestamps, freeze frames that researchers must track.

Combined with the two dozen other coronavirus proteins they studied, Folding@home amassed the largest collection of molecular simulations in history.

Omniverse Simulates a Coronavirus Close Up

The dataset “ended up on my desk when someone asked what we could do with it using more than the typical scientific tools to really make it shine,” said Peter Messmer, who leads a scientific visualization team at NVIDIA.

Using Visual Molecular Dynamics, a standard tool for scientists, he pulled the data into NVIDIA Omniverse, a platform built for collaborative 3D graphics and simulation soon to be in open beta. Then the magic happened.

The team connected Autodesk’s Maya animation software to Omniverse to visualize a camera path, creating a fly-through of the proteins’ geometric intricacies. The platform’s core technologies such as NVIDIA Material Definition Language (MDL) let the team give tangible surface properties to molecules, creating translucent or glowing regions to help viewers see critical features more clearly.

With Omniverse, “researchers are not confined to scientific visualization tools, they can use the same tools the best artists and movie makers use to deliver a cinematic rendering — we’re bringing these two worlds together,” Messmer said.

Simulation Experts Share Their Story Live

The result was a visually stunning presentation where each spike on a coronavirus protein is represented with more than 1.8 million triangles, rendered by a bank of NVIDIA RTX GPUs.

Zimmerman and Messmer will co-host a live Q&A technical session Oct. 8 at 11 AM PDT to discuss how they developed the simulation that packs nearly 150 million triangles to represent a millisecond in a protein’s life.

The work validates the mission behind Omniverse to create a universal virtual environment that spans industries and disciplines. We’re especially proud to see the platform serve science in the fight against the pandemic.

The experience made Zimmerman “incredibly optimistic about the future of science. NVIDIA GPUs have been instrumental in generating our datasets, and now those GPUs running Omniverse are helping us see our work in a new and vivid way,” he said.

Visit NVIDIA’s COVID-19 Research Hub to learn more about how AI and GPU-accelerated technology continues to fight the pandemic. And watch NVIDIA CEO Jensen Huang describe in a portion of his GTC keynote below how Omniverse is playing a role.

The post Coronavirus Gets a Close-Up: Folding@home Live in NVIDIA Omniverse appeared first on The Official NVIDIA Blog.

Read More

GTI: Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations

GTI: Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations

It takes a lot of data for robots to autonomously learn to perform simple manipulation tasks as as grasping and pushing. For example, prior work12 has leveraged Deep Reinforcement Learning to train robots to grasp and stack various objects. These tasks are usually short and relatively simple – for example, picking up a plastic bottle in a tray. However, because reinforcement learning relies on gaining experiences through trial-and-error, hundreds of robot hours were required for the robot to learn to picking up objects reliably.

On the other hand, imitation learning can learn robot control policies directly from expert demonstrations without trial-and-error and thus require far less data than reinforcement learning. In prior work, a handful of human demonstrations have been used to train a robot to perform different skills such as pushing an object to a target location from only image input 3.


Imitation Learning has been used to directly learn short-horizon skills from 100-300 demonstrations.

However, because the control policies are only trained with a fixed set of task demonstrations, it is difficult for the policies to generalize outside of the training data. In this work, we present a method for learning to solve new tasks by piecing together parts of training tasks that the robot has already seen in the demonstration data.

A Motivating Example

Consider the setup shown below. In the first task, the bread starts in the container, and the robot needs to remove the purple lid, retrieve the bread, put it into this green bowl, and then serve it on a plate.


In the first task, the robot needs to retrieve the bread from the covered container and serve it on a plate.

In the second task, the bread starts on the table, and it needs to be placed in the green bowl and then put into the oven for baking.


In the second task, the robot needs to pick the bread off the table and place it into the oven for baking.

We provide the robot with demonstrations of both tasks. Note that both tasks require the robot to place the bread into this green bowl! In other words, these task trajectories intersect in the state space! The robot should be able to generalize to new start and goal pairs by choosing different paths at the intersection, as shown in the picture. For example, the robot could retrieve the bread from the container and place the bread into the oven, instead of placing it on the plate.

The task demonstrations for both tasks will intersect in the state space since both tasks require the robot to place the bread into the green bowl. By leveraging this task intersection and composing pieces of different demonstrations together, the robot will be able to generalize to new start and goal pairs.

In summary, our key insights are:

  • Multi-task domains often contain task intersections.
  • It should be possible for a policy to generate new task trajectories by composing training tasks via the intersections.

Generalization Through Imitation

In this work, we introduce Generalization Through Imitation (GTI), a two-stage algorithm for enabling robots to generalize to new start and goal pairs through compositional imitation.

  • Stage 1: Train policies to generate diverse (potentially new) rollouts from human demonstrations. 
  • Stage 2: Use these rollouts to train goal-directed policies to achieve targeted new behaviors by self-imitation.

Generating Diverse Rollouts from Human Demonstrations

In Stage 1, we would like to train policies that are able to both reproduce the task trajectories in the data and also generate new task trajectories consisting of unseen start and goal pairs. This can be challenging – we need to encourage our trained policy to understand how to stop following one trajectory from the dataset and start following a different one in order to end up in a different goal state.

Here, we list two core technical challenges.

  • Mode Collapse. If we naively train imitation learning policies on the demonstration data of the two tasks,  the policy tends to only go to a particular goal regardless of the initial states, as indicated by the red arrows in the picture below.
  • Spatio-temporal Variation There is a large amount of spatio-temporal variation from human demonstrations on a real robot that must be modeled and accounted for.


Generating diverse rollouts from a fixed set of human demonstrations is difficult due to the potential for mode collapse (left) and because the policy must also model spatio-temporal variations in the data (right).

In order to get a better idea of how to encourage a policy to generate diverse rollouts, let’s take a closer look at the task space. The left image in the figure below shows the set of demonstrations. Consider a state near the beginning of a demonstration, as shown in the middle image. If we start in this state, and try to set a goal for our policy to achieve, according to the demonstration data, the goals can be modeled by a gaussian distribution. However, if we start at the intersection, the goal could spread across two tasks. It would be better for us to model the goal distributions with a multi-modal gaussian.



Task intersections are better modeled with mixtures of gaussians in order to capture the different possible future states.

Based on this observation, we design a hierarchical policy learning algorithm, where the high-level policy captures distribution of future observations in a multimodal latent space. The low-level policy conditions on the latent goal to fully explore the space of demonstrations.

GTI Algorithm Details

Let’s take a closer look at the learning architecture for our Stage 1 policy, shown below. The high-level planner is a conditional variational autoencoder4, that attempts to learn the distribution of future image observations conditioned on current image observations. The encoder encodes both a current and future observation into a latent space. The decoder attempts to reconstruct the future observation from the latent. The latent space is regularized with a learned Gaussian mixture model prior. This prior encourages the model to a latent multimodal distribution of future observations. We can think of this latent space as modeling short-horizon subgoals. We train our low-level controller to imitate actions in the dataset that lead to particular subgoals.

The diagram above depicts the Stage 1 training architecture.

Next, we use the Stage 1 policy to collect a handful of self-generated diverse rollouts, shown below. Every 10 timesteps, we sample a new latent subgoal from the GMM prior, and use it to condition the low-level policy. The diversity captured in the GMM prior ensures that the Stage 1 policy will exhibit different behaviors at trajectory intersections, resulting in novel trajectories with unseen start and goal pairs.

The Stage 1 trained policy is used to generate a self-supervised dataset that covers the space of start and goal states by composing seen behaviors together.

Finally, the self-generated dataset is used to train a new, goal-directed policy that can perform intentional behaviors from these undirected rollouts.

Stage 2 policy learning is just goal-conditioned behavioral cloning from the Stage 1 dataset, where the goals are final image observations from the trajectories collected in Stage 1.

Real Robot Experiments

Data Collection

This is our hardware setup. We used a Franka robotic arm and two cameras for data collection – a front view camera and a wrist-mounted camera.

Hardware setup used in our work.

We used the RoboTurk phone teleoperation interface56 to collect human demonstrations. We collect only 50 demonstrations for each of the two tasks. The data collection took less than an hour.

Results

Below, we show the final trained Stage 2 model. We ask the robot to start from the initial state of one task, bread-in-container, and reach the goal of the other task, which is to put the bread in the oven. The goal is specified by providing an image observation that shows the bread in the oven. We emphasize that the policy is performing closed-loop visuomotor control at 20hz purely from image observations. Note that this task requires accurate contact-rich manipulations, and is long-horizon. With only visual information, our method can perform intricate tasks such as grasping, pushing the oven tray into the oven, or manipulating a constrained mechanism like closing door of the oven.

Our Stage 1 policy can recover all start and goal combinations, including both behavior seen in training and new unseen behaviors.

Finally, we show that our method is robust towards unexpected situations. In the case below (left), the policy is stuck because of conflicting supervisions. Sampling latent goals allows the policy to get unstuck and complete the task successfully. Our policy is also very reactive and can quickly recover from errors. In the case below (right), the policy failed to grasp the bread twice, and finally succeeded the third time. It also made two attempts to get a good grasp of the bowl, and complete the task successfully

Summary

  • Imitation learning is an effective and safe technique to train robot policies in the real world because it does not depend on an expensive random exploration process. However, due to the lack of exploration, learning policies that generalize beyond the demonstrated behaviors is still an open challenge.
  • Our key insight is that multi-task domains often present a latent structure, where demonstrated trajectories for different tasks intersect at common regions of the state space.
  • We present Generalization Through Imitation (GTI), a two-stage offline imitation learning algorithm that exploits this intersecting structure to train goal-directed policies that generalize to unseen start and goal state combinations.
  • We validate GTI on a real robot kitchen domain and showcase the capacity of trained policies to solve both seen and unseen task configurations.

This blog post is based on the following paper:

  1. Quillen, D., Jang, E., Nachum, O., Finn, C., Ibarz, J., & Levine, S. (2018, May). Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods. In 2018 IEEE International Conference on Robotics and Automation (ICRA) (pp. 6284-6291). IEEE. 

  2. Cabi, S., Colmenarejo, S. G., Novikov, A., Konyushkova, K., Reed, S., Jeong, R., … & Sushkov, O. (2019). A Framework for Data-Driven Robotics. arXiv preprint arXiv:1909.12200. 

  3. Zhang, T., McCarthy, Z., Jow, O., Lee, D., Chen, X., Goldberg, K., & Abbeel, P. (2018, May). Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA) (pp. 1-8). IEEE. 

  4. Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. 

  5. Mandlekar, A., Zhu, Y., Garg, A., Booher, J., Spero, M., Tung, A., … & Savarese, S. (2018). Roboturk: A crowdsourcing platform for robotic skill learning through imitation. arXiv preprint arXiv:1811.02790. 

  6. Mandlekar, A., Booher, J., Spero, M., Tung, A., Gupta, A., Zhu, Y., … & Fei-Fei, L. (2019). Scaling Robot Supervision to Hundreds of Hours with RoboTurk: Robotic Manipulation Dataset through Human Reasoning and Dexterity. arXiv preprint arXiv:1911.04052. 

Read More

Achieving 1.85x higher performance for deep learning based object detection with an AWS Neuron compiled YOLOv4 model on AWS Inferentia

Achieving 1.85x higher performance for deep learning based object detection with an AWS Neuron compiled YOLOv4 model on AWS Inferentia

In this post, we show you how to deploy a TensorFlow based YOLOv4 model, using Keras optimized for inference on AWS Inferentia based Amazon EC2 Inf1 instances. You will set up a benchmarking environment to evaluate throughput and precision, comparing Inf1 with comparable Amazon EC2 G4 GPU-based instances. Deploying YOLOv4 on AWS Inferentia provides the highest throughput, lowest latency with minimal latency jitter, and the lowest cost per image.

The following charts show a 2-hour run in which Inf1 provides higher throughout and lower latency. The Inf1 instances achieved up to 1.85 times higher throughput and 37% lower cost per image when compared to the most optimized Amazon EC2 G4 GPU-based instances.

In addition, the following graph records the P90 inference latency is 60% lower on Inf1, and with significant lower variance compared to the G4 instances.

When you use the AWS Neuron data type auto-casting feature, there is no measurable degradation in accuracy. The compiler automatically converts the pipeline to mixed precision with BF16 data types for increased performance. The model reaches 48.7% mean average precision—thanks to the state-of-the-art YOLOv4 model implementation.

About AWS Inferentia and AWS Neuron SDK

AWS Inferentia chips are custom built by AWS to provide high-inference performance, with the lowest cost of inference in the cloud, with seamless features such as auto-conversion of trained FP32 models to Bfloat16, and elasticity in its machine learning (ML) models’ compute architecture, which supports a wide range of model types from image recognition, object detection, natural language processing (NLP), and modern recommender models.

AWS Neuron is a software development kit (SDK) consisting of a compiler, runtime, and profiling tools that optimize the ML inference performance of the Inferentia chips. Neuron is natively integrated with popular ML frameworks such as TensorFlow and PyTorch, and comes pre-installed in the AWS Deep Learning AMIs. Therefore, deploying deep learning models on AWS Inferentia is done in the same familiar environment used in other platforms, and your applications benefit from the boost in performance and lowest cost.

Since its launch, the Neuron SDK has seen dramatic improvement in the breadth of models that deliver high performance at a fraction of the cost. This includes NLP models like the popular BERT, image classification models (ResNet, VGG), and object detection models (OpenPose and SSD). The latest Neuron release (1.8.0) provides optimizations that improve performance of YOLO v3 and v4, VGG16, SSD300, and BERT. It also improves operational deployments of large-scale inference applications, with a session management agent incorporated into all supported ML frameworks and a new Neuron tool that allows you to easily scale monitoring of large fleets of Inference applications.

You Only Look Once

Object detection stands out as a computer vision (CV) task that has seen large accuracy improvements (average precision at 50 IoU > 70) due to deep learning model architectures. An object detection model tries to localize and classify objects in an image, allowing for applications ranging from real-time inspection of manufacturing defects to medical imaging and tracking your favorite player and ball on a soccer match.

Addressing the real-time inference challenges of such computer vision tasks is key for deploying these models at scale.

YOLO is part of the deep learning (DL) single-stage object detection model family, which includes models such as Single-Shot Detector (SSD) and RetinaNet. These models are usually built from stacking a backbone, neck, and head neural network that together perform detection and classification tasks. The main predictions are bounding boxes for identified objects and associated classes.

The backbone network takes care of extracting features of the input image, while the head gets trained on the supervised task, to predict the edges of the bounding box and classify its contents. The addition of a neck neural network allows for the head network to process features from intermediate steps of the backbone. The whole pipeline processes the images only once, hence the name You Only Look Once (YOLO).

On the other hand, models with two-stage detectors process further features from the previous convolutional layers to obtain proposals of regions, prior to generating object class prediction. In this way, the network focuses on detecting and classifying objects on regions of high object probability.

The following diagram illustrates this architecture (from YOLOv4: Optimal Speed and Accuracy of Object Detection, arXiv:2004.10934v1).

Single-stage models allow for multiple predictions of the same object in a single image. These predictions get disambiguated later by a process called non-max suppression (NMS), which takes care of leaving only the highest probability bounding box and label for the object. It’s a less computationally costly workflow than the two-stage approach.

Models like YOLO are all about performance. Its latest incarnation, version 4, aims at pushing the prediction accuracy further. The research paper YOLOv4: Optimal Speed and Accuracy of Object Detection shows how real-time inference can be achieved above the human perception of around 30 frames per second (FPS). In this post, you explore ways to push the performance of this model even further and use AWS Inferentia as a cost-effective hardware accelerator for real-time object detection.

Prerequisites

For this walkthrough, you need an AWS account with access to the AWS Management Console and the ability to create Amazon Elastic Compute Cloud (Amazon EC2) instances with public-facing IP.

Working knowledge of AWS Deep Learning AMIs and Jupyter notebooks with Conda environments is beneficial, but not required.

Building a YOLOv4 predictor from a pre-trained model

To start building the model, set up an inf1.2xlarge EC2 instance in AWS, with 8 vCPU cores and 16 GB of memory. The Inf1 instance allows for optimizing the ratio between CPU and Inferentia devices through the selection of inf1.xlarge or inf1.2xlarge. We found that for YOLOv4, the optimal CPU to accelerator balance is achieved with inf.2xlarge. Going up to the second size instance improves throughput for a lower cost per image. Use the AWS Deep Learning AMI (Ubuntu 18.04) version 34.0—ami-06a25ee8966373068—in the US East (N. Virginia) Region. This AMI comes pre-packaged with the Neuron SDK and the required Neuron runtime for AWS Inferentia. For more information about running AWS Deep Learning AMIs on EC2 instances, see Launching and Configuring a DLAMI.

Next you can connect to the instance through SSH, activate the aws_neuron_tensorflow_p36 Conda environment, and update the Neuron compiler to the latest release. The compilation script depends on requirements listed in the YOLOv4 tutorial posted on the Neuron GitHub repo. Install them by running the following code in the terminal:

pip install neuron-cc tensorflow-neuron requests pillow matplotlib pycocotools==2.0.1 torch~=1.5.0 --force --extra-index-url=https://pip.repos.neuron.amazonaws.com

You can also run the following steps directly from the provided Jupyter notebook. If doing so, skip to the Running a performance benchmark on Inferentia section to explore the performance benefits of running YOLOv4 on AWS Inferentia.

The benchmark of the models requires an object detection validation dataset. Start by downloading the COCO 2017 validation dataset. The COCO (Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset, with over 300,000 images and 1.5 million object instances. The 2017 version of COCO contains 5,000 images for validation.

To download the dataset, enter the following code on the terminal:

curl -LO http://images.cocodataset.org/zips/val2017.zip
curl -LO http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip -q val2017.zip
unzip annotations_trainval2017.zip

When the download is complete, you should see a val2017 and an annotations folder available in your working directory. At this stage, you’re ready to build and compile the model.

The GitHub repo contains the script yolo_v4_coco_saved_model.py for downloading the pretrained weights of a PyTorch implementation of YOLOv4, and the model definition for YOLOv4 using TensorFlow 1.15 and Keras. The code was adapted from an earlier implementation and converts the PyTorch checkpoint to a Keras h5 saved model. This implementation of YOLOv4 is optimized to run on AWS Inferentia. For more information about optimizations, see Working with YOLO v4 using AWS Neuron SDK.

To download, convert, and save your Keras model to the yolo_v4_coco_saved_model folder, enter the following code:

python3 yolo_v4_coco_saved_model.py ./yolo_v4_coco_saved_model

To instantiate a new predictor from the saved model, use tf.contrib.predictor.from_saved_model('./yolo_v4_coco_saved_model') on your inference script.

The following code implements a single batch predictor and image annotation script, so you can test the saved model:

import json
import tensorflow as tf
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches

yolo_pred_cpu = tf.contrib.predictor.from_saved_model('./yolo_v4_coco_saved_model')
image_path = './val2017/000000581781.jpg'
with open(image_path, 'rb') as f:
    feeds = {'image': [f.read()]}

results = yolo_pred_cpu(feeds)

# load annotations to decode classification result
with open('./annotations/instances_val2017.json') as f:
    annotate_json = json.load(f)
label_info = {idx+1: cat['name'] for idx, cat in enumerate(annotate_json['categories'])}

# draw picture and bounding boxes
fig, ax = plt.subplots(figsize=(10, 10))
ax.imshow(Image.open(image_path).convert('RGB'))

wanted = results['scores'][0] > 0.1

for xyxy, label_no_bg in zip(results['boxes'][0][wanted], results['classes'][0][wanted]):
    xywh = xyxy[0], xyxy[1], xyxy[2] - xyxy[0], xyxy[3] - xyxy[1]
    rect = patches.Rectangle((xywh[0], xywh[1]), xywh[2], xywh[3], linewidth=1, edgecolor='g', facecolor='none')
    ax.add_patch(rect)
    rx, ry = rect.get_xy()
    rx = rx + rect.get_width() / 2.0
    ax.annotate(label_info[label_no_bg + 1], (rx, ry), color='w', backgroundcolor='g', fontsize=10,
                ha='center', va='center', bbox=dict(boxstyle='square,pad=0.01', fc='g', ec='none', alpha=0.5))
plt.show()

The performance in this setup isn’t optimal because you ran YOLO only on CPU. Despite the native parallelization from TensorFlow, the eight cores aren’t enough to bring the inference time close to real time. For that, you use AWS Inferentia.

Compiling YOLOv4 to run on AWS Inferentia

The compilation of YOLOv4 uses the TensorFlow-Neuron API tfn.saved_mode.compile, working directly with the saved model directory created before. To further reduce the Neuron runtime overhead, two extra arguments are added to the compiler call: no_fuse_ops and minimum_segment_size.

The first argument, no_fuse_ops, partitions the graph prior to casting the FP16 tensors running in the sub-graph back to FP32, as defined in the model script. This allows for operations that run more efficiently on CPU to be skipped while the Neuron compiler runs its automatic smart partitioning. The argument minimum_segment_size sets the minimum number of operations in a sub-graph, to enforce trivial compilable sections to run on CPU. For more information, see Reference: TensorFlow-Neuron Compilation API.

To compile the model, enter the following code:

import shutil
import tensorflow as tf
import tensorflow.neuron as tfn


def no_fuse_condition(op):
    return any(op.name.startswith(pat) for pat in ['reshape', 'lambda_1/Cast', 'lambda_2/Cast', 'lambda_3/Cast'])

with tf.Session(graph=tf.Graph()) as sess:
    tf.saved_model.loader.load(sess, ['serve'], './yolo_v4_coco_saved_model')
    no_fuse_ops = [op.name for op in sess.graph.get_operations() if no_fuse_condition(op)]

shutil.rmtree('./yolo_v4_coco_saved_model_neuron', ignore_errors=True)

result = tfn.saved_model.compile(
                './yolo_v4_coco_saved_model', './yolo_v4_coco_saved_model_neuron',
                # we partition the graph before casting from float16 to float32, to help reduce the output tensor size by 1/2
                no_fuse_ops=no_fuse_ops,
                # to enforce trivial compilable subgraphs to run on CPU
                minimum_segment_size=100,
                batch_size=1,
                dynamic_batch_size=True,
)

print(result)

On an inf1.2xlarge, the compilation takes only a few minutes and outputs the ratio of the graph operations run on the AWS Inferentia chip. For our model, it’s approximately 79%. As mentioned earlier, to optimize the compiled model for performance, the target of the compilation shouldn’t be to maximize operations on the AWS inferential chip, but to balance the use of the available CPUs for efficient combined hardware utilization.

AWS Inferentia is designed to reach peak throughput at small—usually single-digit—batch sizes. When optimizing a specific model for throughput, explore compiling the model with different values of the batch_size argument and test what batch size yields the maximum throughput for your model. In the case of our YOLOv4 model, the best batch size is 1.

Replace the model path on the predictor instantiation to tf.contrib.predictor.from_saved_model('./yolo_v4_coco_saved_model_neuron') for a comparison with the previous CPU only inference. You get similar detection accuracy at a fraction of the inference time, approximately 40 milliseconds.

Setting up a benchmarking pipeline

To set up a performance measuring pipeline, create a multi-threaded loop running inference on all the COCO images downloaded. The code available in the notebook adapts the original implementation of the eval function. The following adapted version implements a ThreadPoolExecutor to send four parallel prediction calls at a time:

from concurrent import futures

def evaluate(yolo_predictor, images, eval_pre_path, anno_file, eval_batch_size, _clsid2catid):
    batch_im_id_list, batch_im_name_list, batch_img_bytes_list = get_image_as_bytes(images, eval_pre_path)

    # warm up
    yolo_predictor({'image': np.array(batch_img_bytes_list[0], dtype=object)})

    with futures.ThreadPoolExecutor(4) as exe:
        fut_im_list = []
        fut_list = []
        start_time = time.time()
        for batch_im_id, batch_im_name, batch_img_bytes in zip(batch_im_id_list, batch_im_name_list, batch_img_bytes_list):
            if len(batch_img_bytes) != eval_batch_size:
                continue
            fut = exe.submit(yolo_predictor, {'image': np.array(batch_img_bytes, dtype=object)})
            fut_im_list.append((batch_im_id, batch_im_name))
            fut_list.append(fut)
        bbox_list = []
        count = 0
        for (batch_im_id, batch_im_name), fut in zip(fut_im_list, fut_list):
            results = fut.result()
            bbox_list.extend(analyze_bbox(results, batch_im_id, _clsid2catid))
            for _ in batch_im_id:
                count += 1
                if count % 100 == 0:
                    print('Test iter {}'.format(count))
        
        print('==================== Performance Measurement ====================')
        print('Finished inference on {} images in {} seconds'.format(len(images), time.time() - start_time))
        print('=================================================================')
    
    # start evaluation
    box_ap_stats = bbox_eval(anno_file, bbox_list)
    return box_ap_stats

Additional helper functions are used to calculate average precision scores of the deployed model.

Running a performance benchmark on Inferentia

To run the COCO evaluation and benchmark the time to infer over the 5,000 images, run the evaluate function as shown in the following code:

val_coco_root = './val2017'
val_annotate = './annotations/instances_val2017.json'
clsid2catid = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 13, 12: 14, 13: 15, 14: 16,
               15: 17, 16: 18, 17: 19, 18: 20, 19: 21, 20: 22, 21: 23, 22: 24, 23: 25, 24: 27, 25: 28, 26: 31,
               27: 32, 28: 33, 29: 34, 30: 35, 31: 36, 32: 37, 33: 38, 34: 39, 35: 40, 36: 41, 37: 42, 38: 43,
               39: 44, 40: 46, 41: 47, 42: 48, 43: 49, 44: 50, 45: 51, 46: 52, 47: 53, 48: 54, 49: 55, 50: 56,
               51: 57, 52: 58, 53: 59, 54: 60, 55: 61, 56: 62, 57: 63, 58: 64, 59: 65, 60: 67, 61: 70, 62: 72,
               63: 73, 64: 74, 65: 75, 66: 76, 67: 77, 68: 78, 69: 79, 70: 80, 71: 81, 72: 82, 73: 84, 74: 85,
               75: 86, 76: 87, 77: 88, 78: 89, 79: 90}
eval_batch_size = 8

with open(val_annotate, 'r', encoding='utf-8') as f2:
    for line in f2:
        line = line.strip()
        dataset = json.loads(line)
        images = dataset['images']

box_ap = evaluate(yolo_pred, images, val_coco_root, val_annotate, eval_batch_size, clsid2catid)

When the evaluation is complete, you see logs on the screen like the following:

…

Test iter 4500
Test iter 4600
Test iter 4700
Test iter 4800
Test iter 4900
==================== Performance Measurement ====================
Finished inference on 5000 images in 47.50522780418396 seconds
=================================================================

…

Accumulating evaluation results...
DONE (t=6.78s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.487
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.741
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.531
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.546
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.604
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.573
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.601
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.430
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.657
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.744

At 5,000 images processed in 47 seconds, this deployment achieves 106 FPS, 3.5 times faster than the real-time threshold of 30 FPS. The research paper YOLOv4: Optimal Speed and Accuracy of Object Detection lists the results for batch one performance over the same COCO 2017 dataset running on a NVIDIA Volta GPU, such as the V100. The largest frame rate obtained was 96 FPS, at 41.2% mAP. Our model architecture and deployment achieves higher mAP, 48.7%, with a higher frame rate.

To have a direct comparison between AWS Inferentia, NVIDIA Volta, and Turing architectures, we replicated the same experiment in two GPU based instances, g4dn.xlarge and p3.2xlarge, by running the exact same model prior to compilation, with no further GPU optimization. This time we achieved 39 FPS and 111 FPS for the g4dn.xlarge and p3.2xlarge, respectively.

A YOLO model deployed in production usually doesn’t see a defined batch of 5,000 images at a time. To measure production like performance, we set up a prediction-only multi-threaded pipeline that runs inference for extended periods.

For a total time of 2 hours, we continually ran 8 parallel prediction calls with a batch of 4 images on each, totaling 32 images at a time. To maximize GPU throughput and try to decrease the performance gap between the Inf1 and G4 instances, we use the TensorFlow XLA compiler. This setup mimics a live endpoint behavior running at maximum throughput.

GPU thermal throttling

In contrast to AWS Inferentia chips, GPU throughput is inversely proportional to GPU temperature. GPU temperature can vary on endpoints running for extended periods at high throughput, which leads to FPS and latency fluctuations. This effect is known as thermal throttling. Some production systems can define a limit throughput below the maximum achievable to avoid performance swings over time. The following graph shows the average FPS over 30 second increments for the duration of the test . We observed up to 12% variation of the FPS rolling average on the GPU instance. On AWS Inferentia, this variation is below 3% for a substantially larger FPS average.

During the 2-hour period, we ran inference on over 856,000 images on the inf1.2xlarge instance. On the g4dn.xlarge, the maximum number of inferences achieved was 486,000. That amounts to 76% more images processed over the same amount of time using AWS Inferentia! Latency averages for batch 4 inference are also 60% lower for AWS Inferentia.

Using the total throughput collected during our 2-hour test, we calculated that the price of running 1 million inferences is $1.362 on an inf1.xlarge in the us-east-1 Region. For the g4dn.xlarge, the price is $2.163—a 37% price reduction for the YOLOv4 object detection pipeline on AWS Inferentia.

Safely shutting down and cleaning up

On the Amazon EC2 console, choose the instances used to perform the benchmark, and choose Terminate from the Actions drop-down menu. Stopping the instance discards data stored only in the instance’s home volume. You can persist the compiled model in an Amazon Simple Storage Service (S3) bucket, so it can be reused later. If you’ve made changes to the code inside the instances, remember to persist those as well.

Conclusion

In this post, you walked through the steps of optimizing a TensorFlow YOLOv4 model to run on AWS Inferentia. You explored AWS Neuron optimizations that yield better model performance with improved average precision, and in a much more cost-effective way. In production, the Neuron compiled model is up to 37% less expensive in the long run, with little throughput and latency fluctuations, when compared to the most optimized GPU instance.

Some of the steps described in this post also apply to other ML model types and frameworks. For more information, see the AWS Neuron SDK GitHub repo.

Learn more about the AWS Inferentia chip and the Amazon EC2 Inf1 instances to get started with running your own custom ML pipelines on AWS Inferentia using the Neuron SDK.


About the Authors

Fabio Nonato de Paula is a Principal Solutions Architect for Autonomous Computing in AWS. He works with large-scale deployments of ML and AI for autonomous and intelligent systems. Fabio is passionate about democratizing access to accelerated computing and distributed ML. Outside of work, you can find Fabio riding his motorcycle on the hills of Livermore valley or reading ComiXology.

 

 

 

Haichen Li is a software development engineer in the AWS Neuron SDK team. He works on integrating machine learning frameworks with the AWS Neuron compiler and runtime systems, as well as developing deep learning models that benefit particularly from the Inferentia hardware.

 

 

 

Samuel Jacob is a senior software engineer in the AWS Neuron team. He works on AWS Neuron runtime to enable high performance inference data paths between AWS Neuron SDK and AWS Inferentia hardware. He also works on tools to analyze and improve AWS Neuron SDK performance. Outside of work, you can catch him playing video games or tinkering with small boards such as RaspberryPi.

 

 

Read More

Simulation Without Limits: DRIVE Sim Levels Up with NVIDIA Omniverse

Simulation Without Limits: DRIVE Sim Levels Up with NVIDIA Omniverse

The line between the physical and virtual worlds is blurring as autonomous vehicle simulation sharpens with NVIDIA Omniverse, our photorealistic 3D simulation and collaboration platform.

During the GPU Technology Conference keynote, NVIDIA founder and CEO Jensen Huang showcased for the first time NVIDIA DRIVE Sim running on NVIDIA Omniverse. DRIVE Sim leverages the cutting-edge capabilities of the platform for end-to-end, physically accurate autonomous vehicle simulation.

Omniverse was architected from the ground up to support multi-GPU, large-scale, multisensor simulation for autonomous machines. It enables ray-traced, physically accurate, real-time sensor simulation with NVIDIA RTX technology.

The video shows a digital twin of a Mercedes-Benz EQS driving a 17-mile route around a recreated version of the NVIDIA campus in Santa Clara, Calif. It includes Highways 101 and 87 and Interstate 280, with traffic lights, on-ramps, off-ramps and merges as well as changes to the time of day, weather and traffic.

To achieve the real-world replica of the testing loop, the real environment was scanned at 5-cm accuracy and recreated in simulation. The hardware, software, sensors, car displays and human-machine interaction were all implemented in simulation in the exact same way as the real world, enabling bit- and timing-accurate simulation.

Physically Accurate Sensor Simulation 

Autonomous vehicle simulation requires accurate physics and light modeling. This is especially critical for simulating sensors, which requires modeling rays beyond the visible spectrum and accurate timing between the sensor scan and environment changes.

Ray tracing is perfectly suited for this, providing realistic lighting by simulating the physical properties of light. And the Omniverse RTX renderer coupled with NVIDIA RTX GPUs enables ray tracing at real-time frame rates.

The capability to simulate light in real time has significant benefits for autonomous vehicle simulation. In the video, the vehicles show complex reflections of objects in the scene — including those not directly in the frame, just as it would in the real world. This also applies to other reflective surfaces such as wet roadways, reflective signs and buildings.

The Mercedes EQS shows the complexity of reflections enabled with ray tracing, including reflections of objects that are in the scene, but not in the frame.

RTX also enables high-fidelity shadows. Typically in virtual environments, shadows are pre-computed or pre-baked. However, to provide a dynamic environment for simulation, pre-baking isn’t possible. RTX enables high-fidelity shadows to be computed at run-time. In the night parking example from the video, the shadows from the lights are rendered directly instead of being pre-baked. This leads to shadows that appear softer and are much more accurate.

Nighttime parking scenarios show the benefit of ray tracing for complex shadows generated by dynamic light sources.

Universal Scene Description

DRIVE Sim is based on Universal Scene Description, an open framework developed by Pixar to build and collaborate on 3D content for virtual worlds.

USD provides a high level of abstraction to describe scenes in DRIVE Sim. For instance, USD makes it easy to define the state of the vehicle (position, velocity, acceleration) and trigger changes based on its proximity to other entities such as a landmark in the scene.

Also, the framework comes with a rich toolset and is supported by most major content creation tools.

Scalability and Repeatability

Most applications for generating virtual environments are targeted to systems with one to two GPUs, such as PC games. While the timing and latency of such architectures may be good enough for consumer games, designing a repeatable simulator for autonomous vehicles requires a much higher level of precision and performance.

Omniverse enables DRIVE Sim to simultaneously simulate multiple cameras, radars and lidars in real time, supporting sensor configurations from Level 2 assisted driving to Level 4 and Level 5 fully autonomous driving.

Together, these new capabilities brought to life by Omniverse deliver a simulation experience that is virtually indistinguishable from reality.

Watch NVIDIA CEO Jensen Huang recap all the news from GTC: 

 

It’s not too late to get access to hundreds of live and on-demand talks at GTC. Register now through Oct. 9 using promo code CMB4KN to get 20 percent off.

The post Simulation Without Limits: DRIVE Sim Levels Up with NVIDIA Omniverse appeared first on The Official NVIDIA Blog.

Read More

Do the Robot: Free Online Training, AI Certifications Make It Easy to Learn and Teach Robotics

Do the Robot: Free Online Training, AI Certifications Make It Easy to Learn and Teach Robotics

On land, underwater, in the air — even underground and on other planets — new autonomous machines and the applications that run on them are emerging daily.

Robots are working on construction sites to improve safety, they’re on factory floors to enhance logistics and they’re roaming farm rows to pick weeds and harvest crops.

As AI-powered autonomous machines proliferate, a new generation of students and developers will play a critical role in teaching and training these robots how to behave in the real world.

To help people get started, we’ve announced the availability of free online training and AI-certification programs. Aptly timed with World Teacher’s Day, these resources open up the immense potential of AI and robotics teaching and learning.

And there’s no better way to get hands-on learning and experience than with the new Jetson Nano 2GB Developer Kit, priced at just $59. NVIDIA CEO Jensen Huang announced this ultimate starter AI computer during the GPU Technology Conference on Monday. Incredibly affordable, the Jetson Nano 2GB helps make AI accessible to everyone.

New AI Certification Programs for Teachers and Students

NVIDIA offers two AI certification tracks to educators, students and engineers looking to reskill. Both are part of the NVIDIA Deep Learning Institute:

  • NVIDIA Jetson AI Specialist: This certification can be completed by anyone and recognizes competency in Jetson and AI using a hands-on, project-based assessment. This track is meant for engineers looking to reskill and advanced learners to build on their knowledge.
  • NVIDIA Jetson AI Ambassador: This certification is for educators and leaders at robotics institutions. It recognizes competency in teaching AI on Jetson using a project-based assessment and an interview with the NVIDIA team. This track is ideal for educators or instructors to get fully prepared to teach AI to students.

Additionally, the Duckietown Foundation is offering a free edX course on AI and robotics based on the new NVIDIA Jetson Nano 2GB Developer Kit.

“NVIDIA’s Jetson AI certification materials thoroughly cover the fundamentals with the added advantage of hands-on project-based learning,” said Jack Silberman, Ph.D., lecturer at UC San Diego, Jacobs School of Engineering, Contextual Robotics Institute. “I believe these benefits provide a great foundation for students to prepare for university robotics courses and compete in robotics competitions.”

“We know how important it is to provide all students with opportunities to impact the future of technology,” added Christine Nguyen, STEM curriculum director at the Boys & Girls Club of Western Pennsylvania. “We’re excited to utilize the NVIDIA Jetson AI Specialist certification materials with our students as they work towards being leaders in the fields of AI and robotics.”

“Acquiring new technical skills with a hands-on approach to AI learning becomes critical as AIoT drives the demand for interconnected devices and increasingly complex industrial applications,” said Matthew Tarascio, vice president of Artificial Intelligence at Lockheed Martin. “We’ve used the NVIDIA Jetson platform as part of our ongoing efforts to train and prepare our global workforce for the AI revolution.”

By making it easy to “teach the teachers” with hands-on AI learning and experimentation, Jetson is enabling a new generation to build a smarter, safer AI-enabled future.

Watch NVIDIA CEO Jensen Huang recap autonomous machines news at GTC:

It’s not too late to get access to hundreds of live and on-demand talks at GTC. Register now through Oct. 9 using promo code CMB4KN to get 20 percent off.

The post Do the Robot: Free Online Training, AI Certifications Make It Easy to Learn and Teach Robotics appeared first on The Official NVIDIA Blog.

Read More

Hands-On AI: Duckietown Foundation Offering Free edX Robotics Course Powered by NVIDIA Jetson Nano 2GB

Hands-On AI: Duckietown Foundation Offering Free edX Robotics Course Powered by NVIDIA Jetson Nano 2GB

For many, the portal into AI is robotics. And one of the best ways to get good at robotics is to get hands on.

Roll up your sleeves, because this week at NVIDIA’s GPU Technology Conference, the Duckietown Foundation announced that it’s offering a free edX course on AI and robotics using the Duckiebot hardware platform powered by the new NVIDIA Jetson Nano 2GB Developer Kit.

The Duckietown project, which started as an MIT class in 2016, has evolved into an open-source platform for robotics and AI education, research and outreach. The project is coordinated by the Duckietown Foundation, whose mission is to reach and teach a wide audience of students about robotics and AI.

It does this through hands-on learning activities in which students put AI and robotics components together to address modern autonomy challenges for self-driving cars. Solutions are implemented in the Duckietown robotics ecosystem, where the interplay among theory, algorithms and deployment on real robots is witnessed firsthand in a model urban environment.

NVIDIA Jetson Nano 2GB Developer KitThe Jetson Nano 2GB Developer Kit has the performance and capability to run a diverse set of AI models and frameworks. This makes it the ultimate AI starter computer for learning and creating AI applications.

The new devkit is the latest offering in the NVIDIA Jetson AI at the Edge platform, which ranges from entry-level AI devices to advanced platforms for fully autonomous machines. To help people get started with robotics, NVIDIA also announced the availability of free online training and AI-certification programs.

“The Duckietown educational platform provides a hands-on, scaled down, accessible version of real world autonomous systems,” said Emilio Frazzoli, professor of Dynamic Systems and Control at ETH Zurich and advisor for the Duckietown Foundation. “Integrating NVIDIA’s Jetson Nano power in Duckietown enables unprecedented, affordable access to state-of-the-art compute solutions for learning autonomy.”

Another highlight of the course is the Duckietown Autolab remote infrastructure, which enables remote evaluation of robotic agents elaborated by learners with Duckiebot robots at home, providing feedback on assignments. This lets the course provide a realistic development flow with real hardware evaluation.

Duckiebot powered by Jetson Nano 2GB
Duckiebot powered by Jetson Nano 2GB.

Enrollment is now open for the free edX course, called “Self-Driving Cars with Duckietown,” which starts in February. To find out more about the technical specifications of the new NVIDIA powered Duckiebot or to pre-order, check out the Duckietown’s Store.

The AI Driving Olympics

For more advanced students, or for people who just want to witness the fun, Duckietown has created the “AI Driving Olympics” (AI-DO) competition. It focuses on autonomous vehicles with the objective of evaluating the state of the art in embodied AI, by benchmarking novel machine learning approaches to autonomy in a set of fun challenges.

AI-DO is made up of a series of increasingly complex tasks — from simple lane-following to fleet management. For each, competitors use various resources, such as simulation, logs, code templates, baseline implementations, and standardized physical autonomous Duckiebots operating in Duckietown, a formally defined urban environment.

Submissions are evaluated in simulation on the cloud, physically in remote Duckietown Autolabs, and running on actual Duckiebots at the live finals competition.

Participants can participate remotely at any stage of the competition. They just need to send their source code packaged as a Docker image. Team members will then be able to use Duckiebot’s “Autolabs,” which are facilities that allow remote experimentation in reproducible settings.

The next AI-DO race will be at NeurIPS, Dec. 6-12.

Duckietown classes and labs are offered at 80+ universities, including ETH Zürich and Université de Montréal. Curriculum materials for undergraduate and graduate courses are available open source. This includes weekly lecture plans, open source software, and a modular, do-it-yourself hardware smart city environment with autonomous driving car kits.

Watch NVIDIA CEO Jensen Huang recap all the autonomous machines news announced at GTC:

The post Hands-On AI: Duckietown Foundation Offering Free edX Robotics Course Powered by NVIDIA Jetson Nano 2GB appeared first on The Official NVIDIA Blog.

Read More

Locomation and Blackshark.ai Innovate in Real and Virtual Dimensions at GTC

Locomation and Blackshark.ai Innovate in Real and Virtual Dimensions at GTC

The NVIDIA DRIVE ecosystem is going multidimensional.

During the NVIDIA GPU Technology Conference this week, autonomous trucking startup Locomation and simulation company Blackshark.ai announced technological developments powered by NVIDIA DRIVE.

Locomation, a Pittsburgh-based provider of autonomous trucking technology, said it would integrate NVIDIA DRIVE AGX Orin in the upcoming rollout of its platooning system on public roads in 2022.

Innovating in the virtual world, Blackshark.ai detailed its toolset to create buildings and landscape assets for simulation environments on NVIDIA DRIVE Sim.

Together, these announcements mark milestones in the path toward safer, more efficient autonomous transportation.

Shooting for the Platoon

Locomation recently announced its first commercial system, Autonomous Relay Convoy, which allows one driver to pilot a lead truck while a fully autonomous follower truck operates in tandem.

The ARC system will be deployed with Wilson Logistics, which will operate more than 1,000 Locomation-equipped trucks, powered by NVIDIA DRIVE AGX Orin, starting in 2022.

NVIDIA DRIVE AGX Orin is a highly advanced software-defined platform for autonomous vehicles.  The system features the new Orin system-on-a-chip, which delivers more than 200 trillion operations per second — nearly 7x the performance of NVIDIA’s previous-generation Xavier SoC.

In August, Locomation and Wilson Logistics successfully completed the first-ever on-road pilot program transporting commercial freight using ARC. Two Locomation trucks, hauling Wilson Logistics trailers and freight, were deployed on a 420-mile long route along I-84 between Portland, Ore., and Nampa, Idaho. This stretch of interstate has some of the most challenging road conditions for truck driving, with curvatures, inclines and wind gusts.

“We’re moving rapidly toward autonomous trucking commercialization, and NVIDIA DRIVE presents a solution for providing a robust, safety-forward platform for our team to work with,” said Çetin Meriçli, CEO and cofounder of Locomation.

Constructing a New Dimension

While Locomation is deploying autonomous vehicles in the real world, Blackshark.ai is making it easier to create building and landscape assets used to enhance the virtual world on a global scale.

The startup has developed a digital twin platform that uses AI and cloud computing to automatically transform satellite data, aerial images or map and sensor data into building, landscape and infrastructure assets that contribute to a semantic photorealistic 3D environment.

During the opening GTC keynote, NVIDIA founder and CEO Jensen Huang showcased the technology on NVIDIA DRIVE Sim. DRIVE Sim uses high-fidelity simulation to create a safe, scalable and cost-effective way to bring self-driving vehicles to our roads.

It taps into the computing horsepower of NVIDIA RTX GPUs to deliver a powerful, scalable, cloud-based computing platform. One that is capable of generating billions of qualified miles for autonomous vehicle testing.

In the demo video, Blackshark’s AI automatically generated the trees and buildings used to reconstruct the city of San Jose in simulation for an immersive, authentic environment.

These latest announcements from Locomation and Blackshark.ai demonstrate the breadth of the DRIVE ecosystem, spanning the real and virtual worlds to push autonomous innovation further.

Watch NVIDIA CEO Jensen Huang recap all the news from GTC. It’s not too late to get access to hundreds of live and on-demand talks — register now through Oct. 9 using promo code CMB4KN to get 20 percent off.

The post Locomation and Blackshark.ai Innovate in Real and Virtual Dimensions at GTC appeared first on The Official NVIDIA Blog.

Read More