December 2020 – Page 9

Talk Stars: Israeli AI Startup Brings Fluency to Natural Language Understanding

Whether talking with banks, cell phone providers or insurance companies, customers often encounter AI-powered voice interfaces to direct their calls to the right department.

But these interfaces typically are limited to understanding certain keywords. Onvego, an Israel-based startup, is working to make these systems understand what you say, no matter how you say it.

Before starting Onvego, the company’s founders created a mobile speech apps platform to assist blind people. Now they’re creating pre-built AI models for such use cases as accepting or requesting payments, scheduling appointments or booking reservations.

Onvego is a member of NVIDIA Inception, a program that accelerates AI and data science startups with go-to-market support, expertise and technology.

The company’s AI enables enterprises to easily build their own conversational interfaces in 10 different languages, with more on the way. Its technology already powers Israel’s toll road payment systems, enabling drivers to pay with their voice.

“Say the customer said, ‘I want to pay my bill.’ The system has to understand what that means,” said Alon Buchnik, CTO and co-founder of Onvego. “Once it does, it sends that information back to the speech machine, where logic is applied.”

The system then walks the driver through the payment process. Onvego’s AI also powers two emergency road services providers in Israel, providing AI-powered answers to drivers in need.

“The speech machine understands exactly what the problem is,” said Buchnik. “It understands if it needs to send a tow truck or just a technician.”

In Search of Ubiquity

Road-related applications are just the tip of the iceberg for Onvego. The company envisions its technology being inside everything from coffee machines to elevators. Along those lines, it’s forged partnerships with GE, GM, Skoda, Amazon and numerous other companies.

For instance, Onvego’s AI is being incorporated into a line of elevators, enabling the manufacturer to provide a conversational voice interface for users.

With the COVID-19 pandemic raging around the globe, Buchnik believes the company’s no-touch technology, such as activating elevators by voice only, can deliver an added benefit by reducing transmission of the virus.

But Onvego’s most-ambitious undertaking may be its contact call center technology. The company has developed an application, powered by NVIDIA GPUs, that’s designed to do the work of an entire call center operation.

It runs as a cloud-based service as well as enterprise on-premises solution that would provide real-time natural language call center support for IoT devices at the network’s edge, even where there’s no internet connectivity.

GPUs at the Core

Buchnik said that while it would be possible for the Onvego call center application to answer 50 simultaneous calls without GPUs, “it would require a huge CPU” to do so. “For the GPU, it’s nothing,” he said.

Onvego also uses a CUDA decoder so developers can access decoding capabilities on the GPU.

Training of the company’s automatic speech recognition models occurs on NVIDIA GPU-powered instances from AWS or Azure, which Onvego acquired through NVIDIA Inception.

Aside from its efforts to expand the use of its technology, Onvego is focused on the standalone container for locations at the edge or completely independent from the network. They plan to run it on an NVIDIA Jetson Nano.

The idea of providing intelligent natural language interfaces to people wherever they’re needed is providing Buchnik and his team with all the motivation they need.

“This is our vision,” he said. “This is where we want to be.”

Buchnik credits the NVIDIA Inception program for providing the company access to the best AI experts, top technical resources and support, and access to a large marketing program with strong positioning in different market verticals.

By using the NVIDIA resources and platforms, Onvego is hoping to promote its intelligent voice solutions to markets and industries that it has not yet reached.

The post Talk Stars: Israeli AI Startup Brings Fluency to Natural Language Understanding appeared first on The Official NVIDIA Blog.

A Model-Based Approach Towards Identifying the Brain’s Learning Algorithms

Introduction

One of the tenets of modern neuroscience is that the brain modifies the
strengths of its synaptic connections (“weights”) during learning in
order to better adapt to its environment. However, the underlying
learning rules (“weight updates”) in the brain are currently unknown.
Many proposals have been suggested, ranging from Hebbian-style
mechanisms that seem biologically plausible but are not very effective
as learning algorithms in that they prescribe purely local changes to
the weights between two neurons that increase only if they activate
together — to backpropagation, which is effective from a learning
perspective by assigning credit to neurons along the entire downstream
path from outputs to inputs, but has numerous biologically implausible
elements.

A major long-term goal of computational neuroscience is to identify
which learning rules actually drive learning in the brain. A further
difficulty is that we do not even have strong ideas for what needs to be
measured in the brain to quantifiably assert that one learning rule is
more consistent with those measurements than another learning rule. So
how might we approach these issues? We take a simulation-based approach,
meaning that experiments are done on artificial neural networks rather
than real brains. We train over a thousand artificial neural networks
across a wide range of possible learning rule types (conceived of as
“optimizers”), system architectures, and tasks, where the ground truth
learning rule is known, and quantify the impact of these choices. Our
work suggests that recording activities from several hundred neurons,
measured semi-regularly during learning, may provide a good basis to
identify learning rules — a testable hypothesis within reach of
current neuroscience tools!

Background: A Plethora of Theories and a Paucity of Evidence

The brain modifies the connections between neurons during learning to
improve behavior; however, the underlying rules that govern these
modifications are unknown. The most famous proposed learning rule is
“Hebbian learning”, also known by the mantra: “neurons that fire
together; wire together”. In this proposal, a synaptic connection
strengthens if one neuron (“pre-synaptic”) consistently sends a signal
to another neuron (“post-synaptic”). The changes prescribed by Hebbian
learning are “local” in that they do not take into account a synapse’s
influence further downstream in the network. This locality makes
learning rather slow even in the cases where additional issues, such as
the weight changes becoming arbitrarily large, are mitigated. Though
there have been many suggested theoretical strategies to deal with this
problem, commonly involving simulations with artificial neural networks
(ANNs), these strategies appear difficult to scale up to solve
large-scale tasks such as ImageNet categorization
[1].

This property of local changes is in stark contrast to backpropagation,
the technique commonly used to optimize artificial neural networks. In
backpropagation, as the name might suggest, an error signal is
propagated backward along the entire downstream path from the outputs of
a model to the inputs of the model. This allows credit to be effectively
assigned to every neuron along the path.

Although backpropagation has long been a standard component of deep
learning, its plausibility as a biological learning rule (i.e. how the
brain modifies the strengths of its synaptic connections) is called into
question for several reasons. Chief among them is that backpropagation
requires perfect symmetry, whereby the backward error-propagating
weights are the transpose of the forward inference weights, for which
there is currently little biological support
[2,
3].

**Avoiding weight symmetry.** Backpropagation naturally couples the
forward and backward weights. This constraint can be relaxed by
uncoupling them, thereby generating a spectrum of learning rule
hypotheses about how the backward weights may be updated.
For more details, see our recent prior work.

Recent approaches, from us and others
[4,
5], introduce approximate
backpropagation strategies that do not require this symmetry, and can
still succeed at large-scale learning as backpropagation does. However,
given the number of proposals, a natural question to ask is how
realistic they are. At the moment, our hypotheses are governed by domain
knowledge that specifies what “can” and “cannot” be biologically
plausible (e.g. “exact weight symmetry is likely not possible” or
“separate forward and backward passes during learning seem
implausible”), as well as characterizations of ANN task performance
under a given learning rule (which is not always directly measurable
from animal behavior). In order to be able to successfully answer this
question, we need to be able to empirically refute hypotheses. In
other words, we would ideally want to know what biological data to
collect in order to claim that one hypothesis is more likely than
another.

More concretely, we can ask: what specific measurements from the brain,
in the form of individual activation patterns over time, synaptic
strengths, or paired-neuron input-output relations, would allow one to
draw quantitative comparisons of whether the observations are more
consistent with one or another specific learning rule? For example,
suppose we record neural responses (“activation patterns”) while an
animal is learning a task. Would these data be sufficient to enable us
to broadly differentiate between learning rule hypotheses, e.g. by
reliably indicating that one learning rule’s changes over time more
closely match the changes measured from real data than those prescribed
by another learning rule?

Some potential observables to measure on which to separate candidate
learning rule hypotheses. (Pyramidal neuron schematic adapted from Figure
4 of [6])

Answering this question turns out to be a substantial challenge, because
it is difficult on purely theoretical grounds to identify which patterns
of neural changes arise from given learning rules, without also knowing
the overall network connectivity and reward target (if any) of the
learning system.

But, there may be a silver lining. While ANNs consist of units that are
highly simplified with respect to biological neurons, recent progress
within the past few years has shown that the internal representations that
emerge in trained deep ANNs often overlap strongly with representations
in the brain, and are in fact quantifiably similar to many
neurophysiological and behavioral observations in animals
[7]. For
instance, task-optimized, deep convolutional neural networks (CNNs) have
emerged as quantitatively accurate models of encoding in primate visual
cortex [8,
9,
10]. This is
due to (1) their cortically-inspired architecture, a cascade of
spatially-tiled linear and nonlinear operations; and (2) their being
optimized to perform certain behaviors that animals must perform to
survive, such as object recognition
[11]. CNNs trained
to recognize objects on ImageNet predict neural responses of primate
visual cortical neurons better than any other model class. Thus, these
models are, at the moment, some of our current best algorithmic
“theories” of the brain — a system that was ultimately not designed by
us, but rather the product of millions of years of evolution. On the
other hand, ANNs are designed by us — so the ground truth learning
rule is known and every unit (artificial “neuron”) can be measured up to
machine precision.

Can we marry what we can measure in neuroscience with what we can
conclude from machine learning, in order to identify what experimentally
measurable observables may be most useful for inferring the underlying
learning rule? If we can’t do this in our models, then it seems very
unlikely to be able to do this in the real brain. But if we can do this
in principle, then we are in a position to generate predictions as to
what data to collect, and whether that is even within reach of current
experimental neuroscience tools.

Methods

We adopt a two-stage “virtual experimental” approach. In the first
stage, we train ANNs with different learning rules, across a variety of
architectures, tasks, and associated hyperparameters. These will serve
as our “model organisms” on which we will subsequently perform idealized
neuroscience measurements. In the second stage, we calculate aggregated
statistics (“measurements”) from each layer of the models as features
from which to train simple classifiers that classify the category that a
given learning rule belongs to (specified below). These classifiers
include the likes of a linear SVM, as well as simple non-linear ones
such as a Random Forest and a 1D convolutional two-layer perceptron.

**Overall approach.** Observable statistics are generated from each
neural network’s layer, through the model training process for each
learning rule. We take a quantitative approach whereby a classifier is
cross-validated and trained on a subset of these trajectories and
evaluated on the remaining data.

Generating a large-scale dataset is crucial to this endeavor, in order
to both emulate a variety of experimental neuroscience scenarios and be
able to derive robust conclusions from them. Thus, in the first stage,
we train ANNs on tasks and architectures that have been shown to explain
variance in neural responses from sensory (visual and auditory)
brain areas [8,
12].
These include supervised tasks across vision and audition, as well as
self-supervised ones. We consider both shallow and deep feedforward
architectures on these tasks, that are of depth comparable to what is
considered reasonable from the standpoint of shallower non-primate (e.g.
mouse
[13]) and
deeper primate sensory systems
[8,
14,
15].

The learning rules, tasks, architectures, and hyperparameters from which
we generate data, comprising over a thousand training experiments in total.

In the second stage, we train classifiers on the observable statistics from these ANNs to predict the learning rules (as specified in the table above) used to train them.
The four learning rules were chosen as they span the space of commonly
used variants of backpropagation (SGDM and Adam), as well as potentially
more biologically-plausible “local” learning rules (Feedback
Alignment (FA) and Information Alignment (IA)) that efficiently
train networks at scale to varying degrees of performance but avoid exact weight
symmetry.

Because the primary aim of this study is to determine the extent that
different learning rules led to different encodings within ANNs, we
begin by defining representative features that can be drawn from the
course of model training. For each layer in a model, we consider three
measurements: weights of the layer, activations from the layer, and
layer-wise activity change of a given layer’s outputs relative to its
inputs. We choose ANN weights to analogize to synaptic strengths in the
brain, activations to analogize to post-synaptic firing rates, and
layer-wise activity changes to analogize to paired measurements that
involve observing the change in post-synaptic activity with respect to
changes induced by pre-synaptic input.

For each measure, we consider three functions applied to it: “identity”,
“absolute value”, and “square”. Finally, for each function of the
weights and activations, we consider seven statistics, and for the
layer-wise activity change observable, we only use the mean statistic
due to computational restrictions. This results in a total of 45
continuous valued observable statistics for each layer, though 24
observable statistics are ultimately used for training the classifiers,
since we remove any statistic that has a divergent value during the
course of model training. We also use a ternary indicator of layer
position in the model hierarchy: “early”, “middle”, or “deep”
(represented as a one-hot categorical variable).

We Can Separate Learning Rules from Aggregate Statistics of the Weights, Activations, or Layer-wise Activity Changes

Across tasks, different learning rules give rise to perceptible
differences in observable statistics.

Already by eye, one can pick up distinctive differences across the
learning rules for each of the training trajectories of these metrics.
Of course, this is not systematic enough to clearly judge one set of
observables versus another, but provides some initial assurance that
these metrics seem to capture some inherent differences in learning
dynamics across rules.

So these initial observations seem promising, but we want to make this
approach more quantitative. Suppose for each layer we concatenate the
trajectories of each observable and the position in the model hierarchy
that this observable came from. Can we generalize well across held-out
examples?

It turns out that the answer is in fact, yes. Across all classes of
observables, the Random Forest attains the highest test accuracy, and
all observable measures perform similarly under this classifier.

**Test set confusion matrices.** Random Forest performs the best and differences in learning rate policy
(Adam vs. SGDM) are more difficult to distinguish.

Looking at confusion matrices on the test set, we see that the Random
Forest hardly mistakes one learning rule from any of the others. And
when the classifiers do make mistakes, they generally tend to confuse
Adam vs. SGDM more so than IA vs. FA, suggesting that they are able to
pick up more on differences (reflected in the observable statistics) due
to high-dimensional direction of the gradient tensor than the magnitude
of the gradient tensor (the latter being directly tied to learning rate
policy).

Adding Back Some Experimental Neuroscience Realism

Up until this point, we have had access to all input types, the full learning trajectory, and noiseless access to all units when making our virtual measurements of ANN observable statistics.
But in a real experiment where someone were to
collect such data from a neural circuit, the situation would be far from
this ideal scenario. We therefore explore experimental realism in
several ways, in order to identify which observable measures are robust
across these scenarios.

Access to only portions of the learning trajectory: subsampling observable trajectories

The results presented thus far were obtained with access to the entire
learning trajectory of each model. Often however, an experimentalist
collects data throughout learning at regularly spaced intervals. We
capture this variability by randomly sampling a fixed number of points
at a fixed temporal spacing for each trajectory, which we refer to as a
“subsample period”.

Sparse subsampling across learning trajectory is most robust to
trajectory undersampling.

We find across observable measures that robustness to undersampling of
the trajectory is largely dependent on the subsample period length. As
the subsample period length increases (in the middle and right-most
columns), the Random Forest classification performance increases
compared to the same number of sampled points for a smaller period
(depicted in the left-most column).

Taken together, these results suggest that data consisting of
measurements collected temporally further apart across the learning
trajectory is more robust to undersampling than data collected closer
together in training time. Furthermore, across individual observable
measures, the weights are overall the most robust to undersampling of
the trajectory, but with enough frequency of samples we can achieve
comparable performance with the activations.

Incomplete and noisy measurements: subsampling units and Gaussian noise before collecting observables

The aggregate statistics computed from the observable measures thus far
have operated under the idealistic assumption of noiseless access to
every unit in the model. However, in most datasets, there is a
significant amount of unit undersampling as well as non-zero measurement
noise. How do these two factors affect learning rule identification, and
in particular, how noise and subsample-robust are particular observable
measures?

Addressing this question would provide insight into the types of
experimental neuroscience paradigms that may be most useful for
identifying learning rules, and predict how certain experimental tools
may fall short for given observables. For instance, optical imaging
techniques can use fluorescent indicators of electrical activities of
neurons to give us simultaneous access to thousands of neurons.
But these techniques can have lower temporal resolution and signal-to-noise than
electrophysiological recordings that more directly measure the
electrical activities of neurons, which in turn may lack the same
coverage.

Activations are the most robust to measurement noise and unit
undersampling. Reported here is Random Forest test set accuracy in
separating IA vs. FA, averaged over 10 train/test splits per random
sampling and simulated measurement noise seed.

To account for these tradeoffs, we model measurement noise as an
additive white Gaussian noise process added to units of ResNet-18
trained on the ImageNet and self-supervised SimCLR tasks. We choose IA
vs. FA since the differences between them are conceptually stark: IA
imposes dynamics on the feedback error weights during learning, whereas
FA keeps them fixed. If there are scenarios of measurement noise and
unit subsampling where we are at chance accuracy for this problem (50%),
then it may establish a strong constraint on learning rule
separability more generally.

Our results suggest that if one makes experimental measurements by
imaging synaptic strengths, it is still crucial that the optical imaging
readout not be very noisy, since even with the amount of units typically
recorded currently (on the order of several hundred to several thousand
synapses), a noisy imaging strategy of synaptic strengths may be
rendered ineffective.

Instead, current electrophysiological techniques that measure the
activities from hundreds of units could form a good set of neural data
to separate learning rules. Recording more units with these techniques
can improve learning rule separability from the activities, but it does
not seem necessary, at least in this setting, to record a majority of
units to perform this separation effectively.

Conclusions

As experimental techniques in neuroscience continue to advance, we will
be able to record data from more neurons with higher temporal
resolution. But even if we had the perfect measurement tools, it is not
clear ahead of time what should be measured in order to identify the
learning rule(s) operative within a given neural circuit, or whether
this is even possible in principle. Our model-based approach
demonstrates that we can identify learning rules solely on the basis of
standard types of experimental neuroscience measurements from the
weights, activations, or layer-wise activity changes, without knowledge
of the architecture or loss target of the learning system.

Additionally, our results suggest the following prescription for the type of
experimental neuroscience data to be collected towards this goal:

Electrophysiological recordings of post-synaptic activities
from a neural circuit on the order of several hundred units, frequently
measured at wider intervals during the course of learning, may provide a
good basis on which to identify learning rules.

We have made our dataset, code, and interactive
tutorial publicly
available so that others can analyze these properties without needing to
train neural networks themselves. Our dataset may also be of interest to
researchers theoretically or empirically investigating learning in deep
neural networks. For further details, check out our NeurIPS 2020
paper.

Acknowledgements

I would like to thank my collaborator Sanjana Srivastava
and advisors Surya Ganguli and Daniel Yamins. I would also like to
thank Jacob Schreiber, Sidd Karamcheti, and Andrey Kurenkov for their
editorial suggestions on this post.

New Amazon SageMaker Neo features to run more models faster and more efficiently on more hardware platforms

Amazon SageMaker Neo enables developers to train machine learning (ML) models once and optimize them to run on any Amazon SageMaker endpoints in the cloud and supported devices at the edge. Since Neo was first announced at re:Invent 2018, we have been continuously working with the Neo-AI open-source communities and several hardware partners to increase the types of ML models Neo can compile, the types of target hardware Neo can compile for, and to add new inference performance optimization techniques.

As of this writing, Neo optimizes models trained in DarkNet, Gluon, Keras, MXNet, PyTorch, TensorFlow, TensorFlow-Lite, ONNX, and XGBoost for inference on Android, iOS, Linux, and Windows machines based on processors from Ambarella, Apple, ARM, Intel, NVIDIA, NXP, Qualcomm, Texas Instruments, and Xilinx. Models optimized by Neo can perform up to 25 times faster with no loss in accuracy.

Over the past few months, Neo has added a number of key new features:

Expanded support for PC and mobile devices
Heterogeneous execution with NVIDIA TensorRT
Bring Your Own Codegen (BYOC) framework
Inference optimized containers
Compilation for dynamic models

In this post, we summarize how these new features allow you to run more models on more hardware platforms both faster and more efficiently.

Expanded support for PC and mobile devices

Earlier in 2020, Neo launched support for Windows on x86 processor-based devices, allowing you to run your models faster and more efficiently on personal computers and other Windows devices. In addition, Neo launched support for Android on ARM-based processors and Qualcomm processors with Hexagon DSP.

Most recently, Apple and AWS partnered to automate model conversion to Core ML format using Neo. As a result, ML app developers can now train models in SageMaker and convert them to Core ML format with a click of button, and deploy the models on iOS and MacOS devices.

Heterogeneous execution with NVIDIA TensorRT

Neo uses the NVIDIA TensorRT acceleration library to increase the speedup of ML models on NVIDIA Jetson devices at the edge and AWS g4dn and p3 instances in the AWS Cloud. The TensorRT library supports a subset of operators commonly used in deep learning models.

Previously, Neo used TensorRT only when the entire computational graph of the model and all its operators could be accelerated by the library. As a result, few models could take advantage of TensorRT acceleration.

Recently, Neo added the capability to partition a model into sub-graphs, in which case a part of the model can be handled by TRT while the other part can be compiled by Apache TVM. To execute the compiled model, Neo runtime uses the heterogeneous execution mechanism to run both parts on the hardware. With this approach, Neo can provide the best available performance for a broader range of frameworks and models.

Bring your own codegen

We also expanded the heterogeneous execution approach to other hardware targets. Neo partnered with chip vendors to use the Bring Your Own Codegen (BYOC) mechanism in TVM to plug in partners’ proprietary toolchains for their ML accelerators, such as Ambarella’s CV Tools and Texas Instruments’ TIDL, with the Neo compilation API.

When you compile, Neo partitions a model so you can run the supported portion supported in the ML accelerators and the rest on the host CPU. With this approach, Neo maximizes the utilization of the ML accelerator on the chip, increases the types of models that you can compile for the chip, and makes it easier for you to take advantage of new ML accelerators from chip vendors.

Inference optimized containers

Like all deep learning compilers, Neo supports a subset of operators and models in a given framework. Before adding this feature, Neo could only compile a model if all the operators from the model were supported by Neo. Now, when you use Neo to compile a MXNet, PyTorch, or TensorFlow model for CPU or GPU inferences in SageMaker hosted endpoints on AWS, Neo partitions the models, compiles a portion to accelerate performance, and leaves the un-compiled part of the model to continue running natively in the framework. You can use Neo’s inference optimized containers to deploy on SageMaker hosted endpoints. As a result, you can optimize any MXNet, PyTorch, and TensorFlow model with Neo for any SageMaker hosted endpoint.

Compilation for dynamic models

Deep learning models contain dynamic features, such as control flow, dynamic operations, dynamic data structures, and dynamic input and output shapes that pose significant challenges to existing deep learning compilers. These models, including some object detection and semantic segmentation models, are becoming increasingly popular. Recently, we added the ability in Neo to compile these dynamic models. You can now use Neo to optimize models with dynamic features, and get up to two times the performance speedup.

Summary

We continually make improvements and add supported hardware endpoints, models, and frameworks to Neo based on your feedback. We encourage you to sign in to the SageMaker console or use the Neo compilation API to compile your trained models for the target hardware of your interest. For more information about Neo, see the following:

About the Authors

Tingwei Huang is a product management leader at AWS AI Service.

Vin Sharma is a Engineering Leader for AWS Deep Learning. He leads the team building Neo, which helps ML models train once and run anywhere in the cloud and at the edge.

Model dynamism Support in Amazon SageMaker Neo

Amazon SageMaker Neo was launched at AWS re:Invent 2018. It made notable performance improvement on models with statically known input and output data shapes, typically image classification models. These models are usually composed of a stack of blocks that contain compute-intensive operators, such as convolution and matrix multiplication. Neo applies a series of optimizations to boost the model’s performance and reduce memory usage. The static feature significantly simplifies the compilation, and you can decide on runtime inference tasks such as memory sizes ahead of time using a dedicated analysis pass. Runtime is just acted as a topological graph walker that invokes each operator sequentially.

However, we have been seeing a growing number of customers requiring more advanced models to fulfill tasks like object detection. These models contain dynamic features, such as control flow, dynamic operations, dynamic data structures, and dynamic input and output shapes. This posts significant challenges to the existing deep learning compiler because they have been mainly confined to static models. To address this problem, existing solutions either use just-in-time compilation to compile and run the dynamic portion (XLA), which causes extra compilation overhead, or convert the dynamic model into a static representation first (TFLite). To meet your requirements, we designed and implemented a suite of techniques ranging from the front-end parser to the backend runtime to handle object detection and segmentation models trained by TensorFlow, PyTorch, and MXNet. In this post, we’ll walk you through how Neo supports object detection and semantic segmentation models. We also compare inference performance improvements for both instance and edge type devices for Neo object detection and segmentation models.

Methodology

This section describes how object detection and semantic segmentation models are supported in Neo. We discuss the following:

How the front end handles popular frameworks differently
How the backend is designed to support dynamism
An example using the AWS Command Line Interface (AWS CLI) to demonstrate how to easy it is to perform inference for an object detection model in Neo

Frontend

The approaches vary for each framework because they handle dynamism, particularly control flow, differently. For example, MXNet doesn’t use any control flow to implement the object detection and segmentation models, which allows us to have a quick one-to-one operator mapping from MXNet to Relay operators. PyTorch has control flow primitives, such as If and Loop, which largely simplifies the conversion because we can create Relay If statements and recursion functions correspondingly.

Among the most popular frameworks, TensorFlow is the most difficult to support because it doesn’t directly employ conditional and looping operators to implement control flow. Instead, low-level data flow primitives, such as Merge, Exit, Switch, NextIteration, and Enter, are used to express complex control flow logic for the better support of parallel and distributed execution. For more information, see Implementation of Control Flow in TensorFlow.

To decompile these primitives into the original control flow operators, we proposed dedicated analysis and pattern matching techniques that have been contributed back to the Apache TVM. For more information, see the RFC Decompile TensorFlow Control Flow Primitives to Relay and Enhance TensorFlow Frontend Control Flow Support.

Backend

The backend compiler has worked well in supporting static models where the input data type and shape for each tensor is known at the compile-time. However, this assumption doesn’t hold for dynamic models, such as TensorFlow SSD, because the data shapes can only be determined at runtime.

To support dynamic data shapes, we introduced a special dimension called Any to represent statically unknown dimensions. For instance, a tensor type could be represented as Tensor[(5, Any), float32], where the second dimension was unknown. Accordingly, we defined some type inference rules to infer the type of the tensor when Any shape is involved.

To get the data shape of a tensor at runtime, we defined shape functions to compute the output shape of the tensor to determine the size of required memory. Based on the categories of the operators, shape functions were classified into three patterns:

Data-independent shapes – Are used for operators whose output shape is only determined by the shapes of the inputs, such as 2D convolution.
Data-dependent shapes – Require the real input value instead of the shape to compute the output shapes. For example, arange needs the value of start, stop, and step to compute the output shape.
Upper bound shapes – Are used to quickly estimate an upper bound shape for the output in order to avoid redundant computation. This is useful because operators, such as Non Maximum Suppression (NMS), involve non-trivial computation to infer the output shape at runtime, and the amount of computation for the shape function could be on par with that of running the operator.

To effectively run the dynamic models, we designed a virtual machine as an execution engine to invoke runtime type inference, handle control flow, and dispatch operator kernels. We compiled the model into a machine-dependent kernel code and machine-independent bytecode. They were then loaded and run by the virtual machine.

Because each instruction works on coarse-grained data, such as tensor, the instructions are compactly organized, meaning the dispatching overhead isn’t a concern. We designed the virtual machine in a register-based manner to simplify the design and allow users to read and modify the code easily. We designed a set of instructions to control running each type of bytecode, such as storage allocation, tensor memory allocation on the storage, control flow, and kernel invocation.

After the virtual machine loads the compiled bytecode and kernels, it interprets the bytecode in a dispatching loop by checking it op-code and invoking appropriate logic. For more information, see Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference.

Performing inference and object detection in Neo

This section provides an example to illustrate how you can compile a Faster R-CNN model from TensorFlow 1.15 and deploy it on an AWS C5 instance using Neo.

Prepare the pre-trained model by downloading it from the TensorFlow Detection Model Zoo and extracting it:

$ wget http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet50_coco_2018_01_28.tar.gz
$ tar -xzf faster_rcnn_resnet50_coco_2018_01_28.tar.gz

Get the frozen protobuf file and upload it to Amazon Simple Storage Service (Amazon S3):

$ tar -czf tf_frcnn.tar.gz faster_rcnn_resnet50_coco_2018_01_28.tar.gz
$ aws s3 cp tf_frcnn.tar.gz s3://<your-bucket>/<your-input-folder>

We can now compile the model using Neo. For this post, we use the AWS CLI. We first create a configuration JSON file where the required information is fed (such as the input size, framework, location of the output artifacts, and target platform that we compile the model for):

Create the configuration file with the following code:

{
    "CompilationJobName": "compile-tf-ssd",
    "RoleArn": "arn:aws:iam::<your-account>:role/service-role/AmazonSageMaker-ExecutionRole-yyyymmddThhmmss",
    "InputConfig": {
        "S3Uri": "s3://<your-bucket>/<your-input-folder>/tf_frcnn.tar.gz",
        "DataInputConfig":  "{'image_tensor': [1,512,512,3]}",
        "Framework": "TENSORFLOW"
    },
    "OutputConfig": {
        "S3OutputLocation": "s3://<your-bucket>/<your-output-folder>",
        "TargetPlatform": {
            "Os": "LINUX",    
            "Arch": "X86_64"
        },
        "CompilerOptions": "{'mcpu': 'skylake-avx512'}"
    },
    "StoppingCondition": {"MaxRuntimeInSeconds": 1800}
}

Compile it with SageMaker CLI:

$ aws sagemaker create-compilation-job --cli-input-json file://config.json --region us-west-2

Finally, we’re ready to deploy the compiled model with DLR.

Before the deployment, download the compiled artifacts from the S3 bucket where it was saved to:

$ aws s3 cp s3://<your-bucket>/<output-folder>/output_artifacts.tar.gz tf_frcnn_compiled.tar.gz
$ mkdir compiled_model
$ tar -xzf tf_frcnn_compiled.tar.gz -C compiled_model

Install DLR for inference:
```
$ pip install dlr
```

Perform inference as the following:

if __name__ == "__main__":
    data = cv2.imread("input_image.jpg")
    data = cv2.resize(data, (512, 512), interpolation=cv2.INTER_AREA)
    data = np.expand_dims(data, 0)
    model = dlr.DLRModel('compiled_model', 'cpu', 0)
    result = model.run(data)

Performance comparison

In this section, we compare the performance of the most widely used TF object detection and segmentation models on a variety of EC2 server platforms and NVIDIA Jetson based edge devices. We use the models from the TensorFlow Detection Model Zoo. As discussed earlier, these models show dynamism and are significantly more complex than the static models like ResNet50. We use Neo to compile these models and generate high-performance machine code for a variety of target platforms. Here, we show the performance comparison for these models across many hardware devices against the best baseline available for the hardware platforms.

EC2 C5.9x large server instance

C5 instances are Intel Xeon server instances suitable for compute-intensive deep learning applications. For this comparison, we report the average latency for the TensorFlow baseline and Neo-compiled model. All the reported latency numbers are in milliseconds. We observe that Neo outperforms TensorFlow for all the three models, and by up to 20% for the Mask R-CNN ResNet-50 model.

Model name	TF 1.15.0	Neo	Speedup
ssd_mobilenet_v1_coco	17.96	16.39	1.09579
faster_rcnn_resnet50_coco	152.62	142.3	1.07252
mask_rcnn_resnet50_atrous_coco	391.91	326.44	1.20056

EC2 m6g.8x large server instance

M6 instances are the ARM Graviton server instances suitable for compute-intensive deep learning applications. To get a baseline, we use the TensorFlow packages provided from ARM Tool-Solutions. Our observations are similar to C5 instances. Neo outperforms TensorFlow, and we observe significant speedup for large models like Faster RCNN and MaskRCNN.

Model name	TF 1.15.0	Neo	Speedup
ssd_mobilenet_v1_coco	29.04	28.75	1.01009
faster_rcnn_resnet50_coco	290.64	202.71	1.43377
mask_rcnn_resnet50_atrous_coco	623.98	368.81	1.69187

NVIDIA server instance and edge devices

Finally, we compare the performance of the MobileNet SSD model on NVIDIA Jetson based edge devices—Jetson Xavier and Jetson Nano. MobileNet SSD is a popular object detection model for edge devices. This is because it has low compute and memory requirements, and is suitable for already resource-constrained edge devices. To have a performance baseline, we use the TF-TRT package, where TensorFlow is integrated with NVIDIA TensorRT as the backend. We present the comparison in the following table. We observe that Neo achieves significant speedup for both Xavier and Nano edge devices.

Performance comparison for ssd_mobilenet_v1_coco
Hardware device	TF 1.15	Neo	Speedpup
NVIDIA Jetson Nano	163	140	1.16429
Jetson Xavier	109	56	1.94643

Summary

This post described how Neo supports model dynamism. Multiple techniques were proposed from the front-end parser to backend runtime to enable the model support. We compared the inference performance of Neo object detection and segmentation models against those required by the TensorFlow framework or TensorFlow backed with TensorRT. We observed that Neo obtained speedups for these models on both instances and edge devices.r

This solution doesn’t have any the service API changes, so you can still use the original API to compile new models. All code has been contributed back to the Apache TVM. For more information about compiling a model using Apache TVM, see Compile PyTorch Object Detection Models.

Acknowledgements: We sincerely thank the following engineers and applied scientists who have contributed to the support of dynamic models: Haichen Shen, Wei Chen, Yong Wu, Yao Wang, Animesh Jain, Trevor Morris, Rohan Mukherjee, Ricky Das

About the Author

Zhi Chen is a Senior Software Engineer at AWS AI who leads the deep learning compiler development in Amazon SageMaker Neo. He helps customers deploy the pre-trained deep learning models from different frameworks on various platforms. Zhi obtained his PhD from University of California, Irvine in Computer Science, where he focused on compilers and performance optimization.

Amazon SageMaker Neo makes it easier to get faster inference for more ML models with NVIDIA TensorRT

Amazon SageMaker Neo now uses the NVIDIA TensorRT acceleration library to increase the speedup of machine learning (ML) models on NVIDIA Jetson devices at the edge and AWS g4dn and p3 instances in the AWS Cloud. Neo compiles models from TensorFlow, TFLite, MXNet, PyTorch, ONNX, and DarkNet to make optimal use of NVIDIA GPUs, providing you with the best available performance from the hardware for a broader range of frameworks and models, without the need to learn the nitty-gritty details of deep learning frameworks and acceleration libraries.

The NVIDIA TensorRT library supports a subset of operators commonly used in deep learning models. Previously, Neo used TensorRT only when the entire computational graph of the model and all its operators could be accelerated by the library. As a result, the use of the library was limited mostly to image classification models.

Now, Neo takes advantage of TensorRT for all models, even when a model contains operators that the library doesn’t support. Neo does this by partitioning the model into sub-graphs, in which TRT handles one type of sub-graph and Apache TVM handles the other. Then, at runtime, Neo uses a new heterogeneous execution mechanism to run both types of sub-graphs with the same runtime.

With this approach, Neo automatically takes advantage of TensorRT to accelerate computation-heavy operations, such as convolutions supported by the accelerator library, while generating highly performant CUDA code for all other operations using Apache TVM. As a result, Neo delivers better performance for more models than either NVIDIA TRT or Apache TVM alone.

The Neo team generalized this approach into a mechanism we call Bring Your Own Codegen. It allows us to easily extend this work to new hardware partners, who can bring their own accelerator libraries to take advantage of the wide range of frameworks and models covered by Neo while improving performance to the full extent possible on their hardware.

Performance highlights

The following table summarizes the platform, corresponding framework, model, and latency performance.

Platform	Framework	Model	Latency
Jetson Xavier	TensorFlow	SSD MobilenetV2 COCO	49.71ms
	MXNet	SSD MobileNet 1.0 VOC	19.7ms
	Pytorch	YoloV4	68.76ms
	DarkNet	YoloV3 Tiny	18.98ms
Jetson Nano	TensorFlow	SSD MobilenetV2 COCO	223.69ms
	MXNet	SSD MobileNet 1.0 VOC	131.58ms
	DarkNet	YoloV3 Tiny	41.73ms

Conclusion

We’re very excited to offer this new integration with TensorRT, which allows you to speed up inference for your ML models. To get started with Amazon SageMaker Neo for NVIDIA Jetson or AWS g4dn, p3, and p2 instances, see Amazon SageMaker Neo.

About the Author

Trevor Morris is a Software Engineer at AWS AI working on compiler technology and optimization for machine learning inference. He focuses on improving performance for GPUs, with previous experience at NVIDIA.

Optimizing ML models for iOS and MacOS devices with Amazon SageMaker Neo and Core ML

Core ML is a machine learning (ML) model format created and supported by Apple that compiles, deploys, and runs on Apple devices. Developers who train their models in popular frameworks such as TensorFlow and PyTorch convert models to Core ML format to deploy them on Apple devices.

Recently, Apple and AWS partnered to automate model conversion to Core ML in the cloud using Amazon SageMaker Neo. Neo is an ML model compilation service on AWS that enables you to automatically convert models trained in TensorFlow, PyTorch, MXNet, and other popular frameworks, and optimize them for the target of your choice. With the new automated model conversion to Core ML, Neo now makes it easier to build apps on Apple’s platform to convert models from popular libraries like TensorFlow and PyTorch to Core ML format.

In this post, we show how to set up automatic model conversion, add a model to your app, and deploy and test your new model.

Prerequisites

To get started, you first need to create an AWS account and create an AWS Identity and Access Management (IAM) administrator user and group. For instructions, see Set Up Amazon SageMaker. You will also need Xcode 12 installed on your machine.

Converting models automatically

One of the biggest benefits of using Neo is automating model conversion from a framework format such as TensorFlow or PyTorch to Core ML format by hosting the coremltools library in the cloud. You can do this via the AWS Command Line Interface (AWS CLI), Amazon SageMaker console, or SDK. For more information, see Use Neo to Compile a Model.

You can train your models in SageMaker and convert them to Core ML format with the click of a button. To set up your notebook instance, generate a Core ML model, and create your compilation job on the SageMaker console, complete the following steps:

On the SageMaker console, under Notebook, choose Notebook instances.

Choose Create notebook instance.
For Notebook instance name, enter a name for your notebook.
For Notebook instance type¸ choose your instance (for this post, the default ml.t2.medium should be enough.
For IAM role, choose your role or let AWS create a role for you.

After the notebook instance is created, the status changes to InService.

Open the instance or JupyterLab.

You’re ready to start with your first Core ML model.

Begin your notebook by importing some libraries:

import warnings
 warnings.simplefilter(action='ignore', category=FutureWarning)

 import sagemaker
 from PIL import Image
 import torch
 import torch.nn as nn
 import torchvision
 from torchvision import transforms

Choose the following image (right-click) and save it.

You use this image for testing the model later, and it helps make the segmentation model.

Upload this image in your local directory. If you’re using a SageMaker notebook instance, you can upload the image by choosing Upload.

To use this image, you need to format it so that it works with the segmentation model when testing the model’s output. See the following code:

# Download the sample image. For this I right-clicked and copied and pasted what 
 # was on the website and used it locally.
 input_image = Image.open("dog_and_cat.jpg")

 preprocess = transforms.Compose([
     transforms.ToTensor(),
     transforms.Normalize(
         mean=[0.485, 0.456, 0.406],
         std=[0.229, 0.224, 0.225],
     ),
 ])
 input_tensor = preprocess(input_image)
 input_batch = input_tensor.unsqueeze(0)

You now need a model to work with. For this post, we use the TorchVision deeplabv3 segmentation model, which is publicly available.

The deeplabv3 model returns a dictionary, but we want only a specific tensor for our output.

To remedy this, wrap the model in a module that extracts the output we want:

class WrappedDeeplabv3Resnet101(nn.Module):

    def __init__(self):
        super(WrappedDeeplabv3Resnet101, self).__init__()
        self.model = torch.hub.load(
            'pytorch/vision:v0.6.0',
            'deeplabv3_resnet101',
            pretrained=True
        ).eval()

    def forward(self, x):
        res = self.model(x)
        # Extract the tensor we want from the output dictionary
        x = res["out"]
        return x

Trace the PyTorch model using a sample input:

traceable_model = WrappedDeeplabv3Resnet101().eval()
 trace = torch.jit.trace(traceable_model, input_batch)

Your model is now ready.

Save your model. The following code saves it with the .pth file extension:

trace.save('DeepLabV3.pth')

Your model artifacts must be in a compressed tarball file format (.tar.gz) for Neo.

Convert your model with the following code:

import tarfile
 with tarfile.open('DeepLabV3.tar.gz', 'w:gz') as f:
     f.add('DeepLabV3.pth')

Upload the model to your Amazon Simple Storage Service (Amazon S3) bucket.

If you don’t have an S3 bucket, you can let SageMaker make one for you by creating a sagemaker.Session(). The bucket has the following format: sagemaker-{region}-{aws-account-id}. Your model must be saved in an S3 bucket in order for Neo to access it. See the following code:

import sagemaker
 sess = sagemaker.Session()
 bucket = sess.default_bucket()

Specify the directory where you want to store the model:

prefix = 'model' # Output directory in your S3 bucket

Upload the model to Amazon S3 and print out the S3 bucket path URI for future reference:

model_path = sess.upload_data(path='DeepLabV3.tar.gz', key_prefix=prefix)
 print(model_path)

On the SageMaker console, under Inference, choose Compilation jobs.

Choose Create compilation job.
In the Job settings section, for Job name, enter a name for your job.
For IAM role, choose a role.

In the Input configuration section, for Location of model artifacts, enter the location of your model. Use the path from the print statement earlier (print(model_path)).
For Data input configuration, enter the shape of the model tensor. For this post, the TorchVision deeplabv3 segmentation model has the shape [1,3,448,448].
For Machine learning framework, choose PyTorch.

In the Output configuration section, select Target device.
For Target device, choose coreml.
For S3 Output location, enter the output location of the compilation job (for this post, /output).
Choose Submit.

You’re redirected to the Compilation jobs page on the SageMaker console. When the compilation job is complete, you see the status COMPLETED.

If you go to your S3 bucket, you can see the output of the compilation job. The output has an .mlmodel file extension.

The output of the Neo service CreateCompilationJob is a model in Core ML format, which you can download from the S3 bucket location to your Mac. You can use this conversion process with any type of model that coremltools supports—from image classification or segmentation, to object detection or question answering text recognition.

Adding the model to your app

Make sure that you have installed Xcode version 12.0 or later. For more information about using Xcode, see the Xcode documentation. To add the converted Core ML model to your app in Xcode, complete the following steps:

Download the model from the S3 bucket location.
Drag the model to the Project Navigator in your Xcode app.

Select any preferred options.
Choose Finish.

Choose the model in the Xcode Project Navigator to see its details, including metadata, predictions, and utilities.

Choose the Predictions tab to see the model’s input and output.

Deploying and testing the model

Xcode automatically generates a Swift model class for the Core ML model, which you can use in your code to pass inputs.

For example, to load the model in your code, use the following model class:

let model = DeepLabV3()

You can now pass through input values using the model class. To test that your app is performing as expected with the model, launch the app in the Xcode simulator and target a specific device. If it works in the Xcode simulator, it works on the device!

Conclusion

Neo has created an easy way to generate Core ML format models from TensorFlow and PyTorch. You’re not required to learn about the coremltools library or how to convert their models in TensorFlow or PyTorch to Core ML format. After you convert your model to Core ML format, you can follow a well-defined path to compile and deploy your model to an iOS device or Mac computer. If you’re already a SageMaker customer, you can train your model in SageMaker and convert it to Core ML format using Neo with the click of a button.

For more information, see the following resources:

About the Author

Lokesh Gupta is a Software Development Manager at AWS AI service.

Speeding up TensorFlow, MXNet, and PyTorch inference with Amazon SageMaker Neo

Various machine learning (ML) optimizations are possible at every stage of the flow during or after training. Model compiling is one optimization that creates a more efficient implementation of a trained model. In 2018, we launched Amazon SageMaker Neo to compile machine learning models for many frameworks and many platforms. We created the ML compiler service so that you don’t need to set up compiler software, such as TVM, XLA, Glow, TensorRT, or OpenVINO, or be concerned with tuning the compiler for best model performance.

Since then, we have updated Neo to support more operators and expand model coverage for TensorFlow, PyTorch, and Apache MXNet (incubating). In October 2020, we made an internal change to allow a model to be partially compiled for CPU and GPU targets. Prior to this change, Neo could only compile a model if all the operators from the model could be compiled. With this change, Neo can figure out which part of the model can be compiled, and generates a model artifact combining the compiled and non-compiled parts. The combined model artifact can be used by SageMaker managed inference endpoints. The non-compiled parts of the model continue running in the framework, while the compiled parts run natively on CPU or GPU. As a result, many more models can see increased inference speeds in SageMaker when they are compiled with Neo.

The interface to model compiling has remained unchanged. This post shows the resulting model performance improvements and the mechanics behind how they work. For a step-by-step tutorial on using Neo to compile a model and deploy in SageMaker managed endpoints, see these notebook examples: Tensorflow mnist, PyTorch VGG19, and MxNet SSD Mobilenet.

Partially compiling a model

In the following example, I took a pre-trained alpha pose model alpha_pose_resnet101_v1b_coco from the GluonCV model zoo and compiled it with Neo. I saved the model from the model zoo into the following two files:

alpha_pose_resnet101_v1b_coco-0000.params
alpha_pose_resnet101_v1b_coco-symbol.json

Then I packed these files into a tar.gz file in an Amazon Simple Storage Service (Amazon S3) bucket and used Neo to compile the model.

Neo compiled the model and created a tar.gz file in an S3 bucket. After downloading and unpacking, I have two files that represent the compiled model (in addition to some other files, which I don’t discuss in detail):

compiled-0000.params
compiled-symbol.json

The compiled-symbol.json file contains all nodes of the model graph and edges between nodes. In this case, the Neo compiler service created five optimized subgraphs in the alpha pose model. Each subgraph is represented by the _tvm_subgraph_op node in the model graph. I can use a simple grep command to discover number of subgraphs:

$ cat compiled-symbol.json |grep _tvm_subgraph_op
      "op": "_tvm_subgraph_op",
      "op": "_tvm_subgraph_op",
      "op": "_tvm_subgraph_op",
      "op": "_tvm_subgraph_op",
      "op": "_tvm_subgraph_op",

Next I use a slightly more complex grep command to show you how many ops of each kind are in this model, which ops are in the subgraphs, and which ops are not. The following code block contains 11 instances of the Activation op in all the subgraphs (line is indented), four instances of the broadcast_like op not in any subgraph (line is not indented), and five instances of the subgraph_op:

$ cat compiled-symbol.json | grep ""op"" | grep -v null | sort | uniq -c
 
     11               "op": "Activation", 
    106               "op": "BatchNorm", 
    107               "op": "Convolution", 
      8               "op": "FullyConnected", 
      1               "op": "Pooling", 
      9               "op": "Reshape", 
      4               "op": "_contrib_AdaptiveAvgPooling2D", 
     33               "op": "elemwise_add", 
      4               "op": "elemwise_mul", 
      8               "op": "expand_dims", 
     99               "op": "relu", 
      3               "op": "transpose", 
      5       "op": "_tvm_subgraph_op", 
      4       "op": "broadcast_like",

In a future update of Neo, we may add support to compile the broadcast_like op, in which case the model is entirely compiled.

You can visualize the compiled model with the graph visualization tool. The following visualization depicts the partially compiled alpha pose model. This shows you the data flow between the subgraphs and ops not compiled (broadcast_like).

Even though I’m showing you an example of the Neo compiled artifacts from the GluonCV model zoo, the same subgraph concept applies to the TensorFlow and PyTorch compiled artifacts as well. The format of compiled artifacts is different in these other frameworks.

The following table shows the measured latency speedup of this partially compiled model compared with a non-compiled model on one CPU and one GPU Amazon Elastic Compute Cloud (Amazon EC2) instance. The speedup is specific to the model and instance type because the performance gain achieved varies with model architecture and platform.

Instance	Speedup
c5.9xl	1.28
g4dn.xl	1.23

Next I deployed the compiled model to SageMaker endpoints using the SageMaker inference container, which is integrated with TVM runtime.

Speedup numbers across common models and frameworks

The following table lists latency speedup that you might see from a few common models in all three frameworks in CPU and GPU instances.

Framework	Model	Instance	Speedup
TensorFlow	resnet50	c5.9xl	2.86
TensorFlow	resnet50	g4dn.xl	1.86
PyTorch	inception v3	c5.4xl	3.03
PyTorch	inception v3	p3.2xl	3.53
MXNet	yolo3	m5.12xl	1.26
MXNet	yolo3	g4dn.xl	1.11

These numbers are only general guidelines, as opposed to performance expectations for your specific model and instance choice. The numbers in the table are measured at the instance level and don’t include time spent on preprocessing and postprocessing. In SageMaker hosting, preprocessing and postprocessing can also take time, and is worth looking into in your overall optimization strategy.

How compiling works

In all frameworks (PyTorch, TensorFlow, and MXNet), we start by analyzing the model. We look at clusters of operators that are compilable, and fuse these into subgraphs. We avoid creating too many subgraphs using heuristics. Running subgraphs has an extra cost of data copy and launch overhead. If all operators are compilable in a model, the entire model is a single subgraph with all the operators.

In all frameworks, we use TVM to compile the subgraphs. On Nvidia GPU instance types (g4, p3, p2), we use the TensorRT integration feature of TVM to further identify operators in the subgraphs that can be compiled by TensorRT, creating subgraphs within a subgraph. A hybrid model running on these GPU instances may use the framework runtime, TVM runtime, and TensorRT runtime.

In some dynamic model cases, we use the relayVM from TVM, which has native support for dynamic tensor shape and control flow operators. This allows fully ahead-of-time compilation for models such as Mask R-CNN. As of this writing, compilers such as XLA or TensorRT use just-in-time to handle dynamic tensor shapes, which incur extra compiling cost whenever a new tensor shape is present when running a model.

At the subgraph level, TVM uses a framework-specific front-end component to convert the subgraph into relay IR (intermediate language). Relay IR is very expressive and can support data types, variables, control flow, function calls, and highly parallelizable compute operations such as matrix multiplication.

From relay IR, TVM does two types of optimizations: graph level and node or tensor level. One kind of graph-level optimization is to fuse two or more nodes together to avoid extra data copy. This is especially useful when GPU is involved because launching a small kernel too many times is very expensive. Another kind of graph-level optimization is to change the way a multi-dimensional array is stored in memory based on the operators involved. An example is that the conv2D operator used in computer vision models prefers the 4-D array sent to it to be in the NCHW format. Yet another optimization is to pre-compute parts of the subgraph at compile time (constant folding). By rewriting the graph in certain ways, TVM can improve the run speed of the model.

Node- or tensor-level optimization is about generating more efficient code for the operator. For example, the most optimal way of doing conv2D depends on the size of the 4-D array in each dimension. TVM can take advantage of this knowledge and generate code based on the hardware attributes of the target device, such as L1 cache size and CPU or GPU instruction scheduling policies.

Summary

Neo can now compile nearly all ML models from TensorFlow, PyTorch, and MXNet frameworks for SageMaker CPU and GPU instances. We continue to tune and optimize Neo. If you have any questions or comments, use the Amazon SageMaker Discussion Forums or send an email to amazon-neo-feedback@amazon.com.

About the Author

Wei Xiao is a Principle Engineer working on the optimization of machine learning systems in the Amazon AWS AI org. Previously he worked on distributed systems and relational databases in Amazon and Microsoft for many years.

Predicting soccer goals in near real time using computer vision

In a soccer game, fans get excited seeing a player sprint down the sideline during a counterattack or when a team is controlling the ball in the 18-yard box because those actions could lead to goals. However, it is difficult for human eyes to fully capture such fast movements, let alone predict goals. With machine learning (ML), we can incorporate more fine-grained information at the pixel level to develop a solution that predicts goals with high confidence before they happen.

Sportradar, a leading real-time sports data provider that collects and analyzes sports data, and the Amazon ML Solutions Lab collaborated to develop a computer vision-based Soccer Goal Predictor to detect exciting moments that lead to goals, thereby increasing fan engagement and helping broadcasters provide viewers an enhanced experience. Most action recognition models are used to identify events when they occur, but Amazon ML Solutions Lab developed a novel computer vision-based Soccer Goal Predictor that can predict future soccer goals 2 seconds in advance of the event.

“We deliberately threw one of the hardest possible computer vision problems at the Amazon ML Solutions Lab team to see what the art of the possible could do, and I am very impressed with the results,” says Ben Burdsall, Group CTO at Sportradar. “The team built a video action recognition model to predict future soccer goals two seconds in advance using Amazon SageMaker and demonstrated its application for match intensity tracking. This has opened doors to many new business opportunities. The implementation costs and latency of this model on our production pipeline using AWS’s infrastructure also look very encouraging. After today, I have no more skepticism about the potential of computer vision in innovating our business.”

The team used Amazon SageMaker notebook instances to create a data processing pipeline that extracted training examples from raw videos and used transfer learning to fine-tune an Inflated 3D Networks (I3D) model. The results have inspired Sportradar’s data science and innovation teams to develop new statistics to embed into their broadcast videos to enhance fan engagement.

This post explains how we applied transfer learning using an I3D model towards goal prediction and used the inference to create an intensity index to quantify the likelihood of a team scoring goals. Additionally, we discuss how we constructed a momentum index to measure the change of velocity during attacks. (Attack is a soccer term used to describe the movement of the team in possession of the ball.) With the intensity index and the momentum index, we can detect whether there is an intense moment (a moment that leads to a goal) in near-real-time using live feeds, and build products to help broadcasters engage fans during broadcasts.

Processing the data and building the model

To capture these intense moments, we translated this objective into a binary classification problem: differentiating activities that lead to goals from those that do not. The samples in the positive class (goals class) are video clips that are 2 seconds away from the goals, and the ones from the negative class are clips in which players are engaged in activities that do not lead to goals (ballsafe class). We generated 1,550 clips from 398 professional soccer matches provided by Sportradar.

A lot of action can happen in a few seconds during soccer matches, so we used short video clips to train the model. For this use case, we extracted 5-second clips. A challenge with video processing is that reading multiple video streams and extracting clips sequentially can be very time-consuming, taking several hours to complete. To speed up the clip-extraction process, we created a data pipeline using multiprocessing in an Amazon SageMaker notebook using ml.c5.18xlarge instance with 72 CPUs to parallelize the I/O-heavy clip extraction process and reduced the clip-extraction time from 12 hours to under 15 minutes.

After data processing, we built a binary classification model using the I3D model from GluonCV’s model zoo. The I3D model uses 3D convolutions to learn spatiotemporal information directly from videos. Given that we did not have a large dataset, we employed transfer learning and fine-tuned the I3D model to get well-performant video models with our own data. For more information about fine-tuning and an I3D model using GluonCV, see Fine-tuning SOTA video models on your own dataset.

Using Amazon SageMaker notebook instances, we first loaded an I3D network pre-trained on the Kinetcis400 dataset into a Jupyter notebook. We then fine-tuned this network on the data from Sportradar to find the best set of parameters, especially those specific to action recognition models (e.g., number of frames, number of segments, stride for frame sampling).

Results

We used recall as our primary metric for model evaluation because we wanted to capture near-100% goals (positive class). The following graphs depict the confusion matrix and the precision-recall curve. It can be seen that it is difficult for the model to differentiate between the two classes when we have near-100% recall. We re-calibrated the predicted probabilities to look at the model performance for achieving 80% and 90% recall for the positive class (sequence leads to a goal) respectively.

The following table shows the precision and recall of the negative class when we fix the recall of the positive class. We can see that our model can differentiate the two classes with the new settings. When we fix the recall of the positive class at 90%, we can capture 68% of the negative class samples, and the precision is 75%.

	At 80% Goal Recall	At 90% Goal Recall
Ballsafe Recall	0.81	0.68
Goal Precision	0.82	0.75

Intensity index and momentum index

After training and validation, we selected the model that gives the best recall on the validation set. We generated inferences over three full games of video using a moving window with the predicted probabilities acting as the intensity index. To measure the change of velocity during attacks, we also generated a momentum index for the current timestamp, using the slope of the linear regression line of predicted probabilities from four previous timestamps. Finally, we used min-max normalization to scale the index between -1 and 1. Therefore, the momentum index effectively measures how the predicted goal probabilities change in the recent few seconds.

The following image illustrates the model inference using a 5-second moving window on a 40-second clip. The areas that are marked red are moments when the predicted scores are signaling intense moments. The first two red bars depict a near-goal situation, a very intense moment in the game. Ultimately, the team scored a goal at the end of that clip during the third high intensity red bar.

The meter on the left side measures the momentum index from -1 to 1, and the match intensity line chart at the bottom is the goal predictions using our model. Because a lot of action can happen in 2 seconds, the model’s high goal probability predictions are still reasonable before the shots were missed.

Watch the full video:

Model performance in production

Sportradar is investing in computer vision both through internal research, development, and external partnerships. To facilitate the rapid transition of computer vision models from the lab to production and running computer vision models at scale, Sportradar has developed a near-real-time computer vision inference pipeline using AWS services. The pipeline helps ensure that the service level agreements and low latency requirements for near-real-time computer vision workloads are met in a cost-effective way by using Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon FSx for Lustre.

When deployed and tested in Sportradar’s computer vision inference pipeline, the latency for generating an inference from the goal prediction model featured in this article was around 200 milliseconds, with a total end-to-end processing latency of around 700 milliseconds.

Summary

Sportradar collaborated with the Amazon ML Solutions Lab to develop a novel computer vision-based Soccer Goal Predictor that predicts future goals with the precision of 75% while keeping the recall at 90%. We used transfer learning to fine-tune an I3D model in Amazon SageMaker to classify attacks that lead to goals as well as activities that don’t lead to goals, and used the model inference to create a momentum index to signal the intense moments in soccer. This model has been tested in Sportradar’s computer vision pipeline, and it can achieve close to real-time inference with low latency in sub-seconds. This approach can be applied to other sports in terms of key events prediction and game intensity measurement using computer vision infrastructure without relying on sensors for data collection.

If you’d like help accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab program.

For more information about what the Amazon ML Solutions Lab is doing in the world of sports, see AWS Sports ML page.

About the Authors

Daliana Zhen Liu a Senior Data Scientist at the Amazon ML Solutions Lab. She builds AI/ML solutions to help customers across various industries to accelerate their business. Previously, she worked at Amazon’s A/B testing platform helping retail teams make better data-driven decisions.

Andrej Bratko leads a team of data scientists and engineers at Sportradar working on machine learning solutions in areas such as sports data collection, sportsbook risk management, odds trading and sports integrity. His team is also responsible for Sportradar’s big data and analytics infrastructure supporting machine learning model development and deployment. He holds a PhD in machine learning from the University of Ljubljana, Slovenia.

Jure Prevc is a Data Scientist at Sportradar working mostly on risk management in sports betting. Besides his primary focus, he has a wide interest in different applications of machine learning and enjoys working on state-of-the-art solutions to solve complex business problems.

Luka Pataky leads the Innovation Team at Sportradar – pioneering new technologies and products. One of the key innovation projects is computer vision, where he leads a team of data and computer vision engineers working on solutions to collect sports data and gather deeper insight into games. His team also works on other projects related to applying emerging technologies to products and processes, as well as projects that look at reinventing the way of how fans engage with sports and data.

Mehdi Noori is a Data Scientist at the Amazon ML Solutions Lab, where he works with customers across various verticals, and helps them to accelerate their cloud migration journey, and to solve their ML problems using state-of-the-art solutions and technologies.

Suchitra Sathyanarayana is a manager at Amazon ML Solutions Lab, where she helps AWS customers across different industry verticals accelerate their AI and cloud adoption. She holds a PhD in Computer Vision from Nanyang Technological University, Singapore.

Uros Lipovsek is machine learning engineer with experience in ML, computer vision, data engineering and devops. He is architecting computer vision pipeline at Sportradar and holds B.S. in Economics with focus on Econometrics from University of Ljubljana.

Amazon Scholar George Karypis receives ICDM 10-Year-Highest-Impact award

University of Minnesota professor and Amazon Scholar, together with coauthor, receives recognition for paper that proposes novel approach to algorithm that generates high-quality recommendations for e-commerce products at high speeds.Read More

Incremental learning: Optimizing search relevance at scale using machine learning

Amazon Kendra is releasing incremental learning to automatically improve search relevance and make sure you can continuously find the information you’re looking for, particularly when search patterns and document trends change over time.

Data proliferation is real, and it’s growing. In fact, International Data Corporation (IDC) predicts that 80% of all data will be unstructured by 2025. However, mining data for accurate answers continues to be a challenge for many organizations. In an uncertain business environment there is mounting pressure find relevant information quickly, and use it to sustain and enhance business performance.

Organizations need solutions that deliver accurate answers fast and evolve the process of knowledge discovery from being a painstaking chore that typically results in dead ends, into a delightful experience for customers and employees.

Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra has reimagined enterprise search for your websites and applications so employees and customers can easily find what they’re looking for, even when answers could be scattered across multiple locations and content repositories within your organization.

Intelligent search helps you consolidate unstructured data from across your business into a single, secure, and searchable index. However, data ingestion and data source consolidation is just one aspect of upgrading a conventional search solution to a ML-powered intelligent search solution.

A more unique aspect of intelligent search is its ability to deliver relevant answers without the tuning complexities typically needed for keyword-based engines. Amazon Kendra’s deep learning language models and algorithms deliver accuracy out of the box, and automatically tune search relevance on a continuous basis.

Continuous improvement

A large part of this is incremental learning, which now comes built-in to Amazon Kendra. Incremental learning creates a mechanism for Amazon Kendra to observe user activity, search patterns, and user interactions. Amazon Kendra then uses these fundamentally important data points to understand user preferences for various documents and answers so it can take action and optimize search results. For example, the following screenshot shows how users can rate how helpful a search result is.

By transparently capturing user queries along with their preferred answers and documents, Amazon Kendra can learn from these patterns and take actions to improve future search results. For example, if an employee performs a search “What is our company expense policy?” without being specific about what kind of expense policy they’re interested in, they may see a host of varying topics.

Their results could include “airfare policies,” “hotels and accommodation,” or meals and entertainment policy,” and although each topic is technically related to company expense policy, the most commonly sought document may not necessarily be at the top of the list.

However, if it turns out that when employees typically ask that question, they’re searching for content related to “home-office reimbursements,” Amazon Kendra can learn from how users interact with results and adapt its models to re-rank information so “home-office expense policy” content gets promoted at the top of the search results page in future searches.

Scaling lessons learned

Incremental learning autonomously optimizes search results over time without the need to develop, train, and deploy ML models. It tunes future search results quickly in a way that’s data-driven and cost effective.

Consider the process for optimizing search accuracy in non-ML powered solutions, it would require significant effort, machine learning skill, and maintenance, even when using “ML” plugins added on top of legacy engines.

As unstructured data continues to dominate and grow at exponential speeds within the enterprise, implementing an adaptive, intelligent, and nimble enterprise search solution becomes critical to keeping up with the pace of change.

Conclusion

To learn more about how organizations are using intelligent search to boost workforce productivity, accelerate research and development, and enhance customer experiences download our ebook, 7 Reasons Why Your Organization Needs Intelligent Search. For more information about incremental learning, see Submitting feedback, or to learn more about Amazon Kendra visit the website or you can watch this video “What is Intelligent Search?”

About the Authors

Jean-Pierre Dodel leads product management for Amazon Kendra, a new ML-powered enterprise search service from AWS. He brings 15 years of Enterprise Search and ML solutions experience to the team, having worked at Autonomy, HP, and search startups for many years prior to joining Amazon 4 years ago. JP has led the Kendra team from its inception, defining vision, roadmaps, and delivering transformative semantic search capabilities to customers like Dow Jones, Liberty Mutual, 3M, and PwC.

Tom McMahon is a Product Marketing Manager on the AI Services team at AWS. He’s passionate about technology and storytelling and has spent time across a wide-range of industries including healthcare, retail, logistics, and eCommerce. In his spare time he enjoys spending time with family, music, playing golf, and exploring the amazing Pacific northwest and its surrounds.

In Search of Ubiquity

GPUs at the Core

Introduction

Background: A Plethora of Theories and a Paucity of Evidence

Methods

We Can Separate Learning Rules from Aggregate Statistics of the Weights, Activations, or Layer-wise Activity Changes

Adding Back Some Experimental Neuroscience Realism

Access to only portions of the learning trajectory: subsampling observable trajectories

Incomplete and noisy measurements: subsampling units and Gaussian noise before collecting observables

Conclusions

Acknowledgements

Expanded support for PC and mobile devices

Heterogeneous execution with NVIDIA TensorRT

Bring your own codegen

Inference optimized containers

Compilation for dynamic models

Summary

About the Authors

Methodology

Frontend

Backend

Performing inference and object detection in Neo

Performance comparison

EC2 C5.9x large server instance

EC2 m6g.8x large server instance

NVIDIA server instance and edge devices

Summary

About the Author

Performance highlights

Conclusion

About the Author

Prerequisites

Converting models automatically

Adding the model to your app

Deploying and testing the model

Conclusion

About the Author

Partially compiling a model

Speedup numbers across common models and frameworks

How compiling works

Summary

About the Author

Processing the data and building the model

Results

Intensity index and momentum index

Model performance in production

Summary

About the Authors

Continuous improvement

Scaling lessons learned

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.