Portrait Depth API: Turning a Single Image into a 3D Photo with TensorFlow.js

Posted by Ruofei Du, Yinda Zhang, Ahmed Sabie, Jason Mayes, Google.

A depth map is essentially an image (or image channel) that contains information relating to the distance of the surfaces of objects in the scene from a given viewpoint (in this case, the camera itself) for every pixel in that image. Depth maps are a fundamental building block for a variety of computer graphics and computer vision applications, such as augmented reality, portrait mode, and 3D reconstruction. Despite the recent advances in depth sensing capabilities with ARCore Depth API, the majority of photographs on the web are still missing associated depth maps. This, combined with users from the web community expressing a growing interest in having depth capabilities within JavaScript to enhance existing web apps such as to bring images to live, apply real time AR effects to a human face and body, or even reconstruct items for use in VR environments, helped shape the path for what you see today.

Today we are introducing the Depth API, the first depth estimation API from TensorFlow.js. With this new API, we are also introducing the first depth model for portrait, ARPortraitDepth, which estimates a depth map for a single portrait image. To demonstrate one of many possible usages of depth information, we also present a computational photography application, 3D photo, which utilizes the predicted depth and enables a 3D parallax effect on the given portrait image. Try the live demo below, everyone can easily make their social media profile photo 3D as shown below.


Try out the 3D portrait demo for yourself!

Examples generated from the 3D photo application.

ARPortraitDepth: Single Image Depth Estimation

At the core of the Portrait Depth API is a deep learning model, named ARPortraitDepth, that takes a single color portrait image as the input and produces a depth map. For the sake of computational efficiency, we adopt a light-weight U-Net architecture. As shown below, the encoder gradually downscales the image or feature map resolution by half, and the decoder increases the feature resolution to the same as the input. Deep learning features from the encoder are concatenated to the corresponding layers with the same spatial resolution in the decoders to bring high resolution signals for depth estimation. During training, we force the decoder to produce depth predictions with increasing resolutions at each layer, and add a loss for each of them with the ground truth. This empirically helps the decoder to predict accurate depth by gradually adding details.

Abundant and diverse training data is critical for the machine learning model to achieve overall decent performance, e.g. accuracy and robustness. We synthetically render pairs of color and depth images with various camera configurations, e.g. focal length, camera pose, from 3D digital humans captured by a high quality performance capture system, and run relighting augmentation with High Dynamic Range environment illumination maps to increase the realism and diversity of the color images, e.g. shadows on the face. We also collect real data using mobile phones equipped with a front facing depth sensor, e.g. Google Pixel 4, where the depth quality, as the training ground truth, is not as accurate and complete as that in our synthetic data, but the color images are effective in improving the performance of our model when running on images in the wild.

Single image depth estimation pipeline.

To enhance the robustness against background variation, in practice, we run an off-the-shelf body segmentation model with MediaPipe and TensorFlow.js before sending the image into the neural network of depth estimation.

The portrait depth model could enable a whole host of creative applications orientated around the human body that could drive next generation web apps. We refer readers to ARCore Depth Lab for more inspirations.

For the 3D photo application, we created a high-performance rendering pipeline. It first generates a segmented mask using the TensorFlow.js existing body segmentation API. Next, we pass the masked portrait into the Portrait Depth API and obtain a depth map on the GPU. Eventually, we generate a depth mesh in three.js, with vertices arranged in a regular grid and displaced by re-projecting corresponding depth values (see the figure below for generating the depth mesh). Finally, we apply texture projection to the depth mesh and rotate the camera around the z axis in a circle. Users can download the animations in GIF or WebM format.

Generating the depth mesh from the depth map for the 3D photo application.

Portrait Depth API Installation

The portrait depth API is currently offered as one variant of the new depth API.

To install the API and runtime library, you can either use the <script> tag in your html file or use NPM.

Through script tag:

<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-core"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-webgl"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-converter"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/body-segmentation"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/depth-estimation"></script>

Through NPM:

yarn add @tensorflow/tfjs-core @tensorflow/tfjs-backend-webgl
yarn add @tensorflow/tfjs-converter
yarn add @tensorflow-models/body-segmentation
yarn add @tensorflow-models/depth-estimation

To reference the API in your JS code, it depends on how you installed the library.

If installed through script tag, you can reference the library through the global namespace depthEstimation.

If installed through NPM, you need to import the libraries first:

import '@tensorflow/tfjs-backend-core';
import '@tensorflow/tfjs-backend-webgl';
import '@tensorflow/tfjs-converter';
import '@tensorflow-models/body-segmentation;
import * as depthEstimation from '@tensorflow-models/depth-estimation;

Try it yourself!

First, you need to create an estimator:

const model = depthEstimation.SupportedModels.ARPortraitDepth;

const estimatorConfig = {
minDepth: 0, // The minimum depth value outputted by the estimator.
maxDepth: 1, // The maximum depth value outputted by the estimator.

};

estimator = await depthEstimation.createEstimator(model, estimatorConfig);

Once you have an estimator, you can pass in a video stream, static image, or TensorFlow.js tensors to estimate depth:

const video = document.getElementById('video');
const depthMap = await estimator.estimateDepth(video);

How to use the output?

The depthMap result above contains depth values for each pixel in the image.

The depthMap is an object which stores the underlying depth values. You can then utilize the provided asynchronous conversion functions such as toCanvasImageSource, toArray, and toTensor depending on the desired output type that you want for efficiency.

It should be noted that different models have different internal representations of data. Therefore converting from one form to another may be expensive. In the name of efficiency, you can call getUnderlyingType to determine what form the depth map is in already so you may choose to keep it in the same form for faster results.

The semantics of the depthMap are as follows: the depth map is the same size as the input image. For array and tensor representations, there is one depth value per pixel. For CanvasImageSource, the green and blue channels are always set to 0, whereas the red channel stores the depth value.

See below output snippet for example:

  {
toCanvasImageSource(): ...
toArray(): ...
toTensor(): ...
getUnderlyingType(): ...
}

Browser Performance

Portrait Depth model

MacBook M1 Pro 2021. 

(FPS)

iPhone 13 Pro

(FPS)

Pixel 6 Pro

(FPS)

Desktop PC 

Intel i9-10900K. Nvidia GTX 1070 GPU.

(FPS)

TFJS Runtime

With WebGL backend.

51

22

5

47

Acknowledgements

We would like to acknowledge our colleagues who participated in or sponsored creating Portrait Depth API in TensorFlow.js: Na Li, Xiuxiu Yuan, Rohit Pandey, Abhishek Kar, Sergio Orts Escolano, Christoph Rhemann, Idris Aleem, Sean Fanello, Adarsh Kowdle, Ping Yu, Alex Olwal‎. We would also like to acknowledge the body segmentation model provided by MediaPipe, and The Relightables for high quality synthetic data.

Read More

Why Karrot Uses TFX, and How to Improve Productivity on ML Pipeline Development

Posted by Ukjae Jeong, Gyoung-yoon Yoo, and Myeonghyeon Song from Karrot

Karrot (the global service of Danggeun Market in Korea) is a local community service app that connects neighbors based on a secondhand marketplace. Danggeun Market was launched in 2015, and over 23 million people in Korea are using Danggeun Market in their local communities. Currently, Karrot is operated in 440 local communities in four countries: the U.K., the U.S., Canada, and Japan. In our service, scrolling through feeds to find inexpensive and useful items has become a daily pleasure for users. For better user experiences, we’ve been applying several machine learning models such as recommendation models.

We are also working on ways to effectively and efficiently apply ML models. In particular, we’re putting lots of effort into building machine learning pipelines for periodic deployment, rapid experiments, and continuous model improvement.

For the ML pipelines, we’ve been using TFX (TensorFlow Extended) for production. So in this article, we will briefly introduce why we use TFX, and how we utilize TFX to improve productivity.

Machine Learning in Karrot

There are many ML projects inside Karrot. ML models are running inside the services. For example, we use automation models to detect fraud, and there are recommendation models to improve the user experience on our app feed. If you are interested in detailed descriptions of the models, please refer to our team blogs, which are written in Korean.

As we’ve been using Kubeflow for our ML models, we were able to periodically train, experiment, and deploy models but still, we had some pain points. As we started to use TFX with Kubeflow last year, TFX pushed this line further and let the team easily use our production pipelines.

How TFX helps us with production ML

TFX helps build and deploy production ML pipelines easily with open and extendable design.

TFX, completely open-sourced in 2019, is an end-to-end platform for production ML pipelines. It supports writing ML workflows in component units, which then can be run in multiple environments – Apache Beam, Dataflow, Kubeflow, and Airflow. It also comes with well-written standard components for data ingestion/transformation, training, and deployment.

Standard Components

TFX provides several standard components. If you are looking for components for data ingestion, there are CsvExampleGen based on local CSV files, PrestoExampleGen, and BigQueryExampleGen which can ingest data directly from Presto, BigQuery, and many other sources with some customization. So you can easily process data from multiple sources just by connecting pre-built components to your TFX pipelines.

It can also handle large-scale data processing smoothly. Since the Transform component that performs feature engineering is implemented on Apache Beam, you can execute it on GCP Dataflow or another compute cluster in a distributed manner.

Of course, many other convenient components exist and are added constantly.

Component Reusability

In order to adapt TFX to our product, there is a need for custom components. TFX has a well-structured component design that enables us to create custom components naturally and easily connect them to existing TFX pipelines. A simple Python function or container can be transformed into a TFX component, or you can write the whole component in the exact same way as standard components are written. For more details, check out the custom component guide.

To enhance our productivity by delivering these advantages, we share custom components that have similar use cases among our ML pipelines as an internal library of Karrot Market.

Various Runners are Supported

TFX is compatible with a variety of environments. It can be run locally on your laptop or run on DataFlow, GCP’s batch data processing service compatible with Apache Beam. You can visualize the output by manually running each component in a Jupyter Notebook. TFX also supports KubeFlow and Vertex AI, which have recently been released with new features as well. Therefore, the pipeline code is written once, and can then be run almost anywhere. We can simply create development, experiment, and deployment environments at once. For that reason, the burden of deploying models to production was significantly reduced by using TFX for our services.

Technical lessons

As we set up our ML pipelines with TFX, code qualities and our experiences in model development have increased.

However, there were difficulties. We didn’t have a uniform project structure or best practices among our team. Maybe this is because TFX itself is relatively new and we’ve been using it before version 1. It became harder to understand codes and start to contribute. As the pipelines are becoming larger and more complex, it’s getting harder to understand the meaning of custom components, corresponding config values, and dependencies. In addition, it was difficult to introduce some of the latest features to the team.

Improving the Development Experience

We decided to create and use a template for TFX pipelines to make it easier to understand each other’s code, implement pipelines with the same pattern, and share know-how with each other. We merged components frequently used in Karrot and put them in a shared library so that ML pipelines can be developed very quickly.

It was expected that the template would accelerate the development of new projects. In addition, as mentioned above, we expected that each project would have a similar structure, making it easier to understand each other’s projects.

So far, we have briefly introduced the template project. Here are some of our considerations to make better use of TFX in this project.

Configuration first

We prioritize our configuration first. It should be enough to understand how pipelines work by reading their configuration. If we can understand specific settings very easily, we can set up various experiments and proceed with them to AB testing.

example_gen_config.proto written in Protocol Buffer(Protobuf), denotes the specification of config. config.pbtxt holds the values, and pipeline.py builds up the pipeline.

// config.pbtxt
example_gen_config {
big_query_example_gen_config {
query: "# query for example gen"
}


...
}

...
// example_gen_config.proto
message ExampleGenConfig {
oneof config {
BigQueryExampleGenConfig big_query_example_gen_config = 1;
CsvExampleGenConfig csv_example_gen_config = 2;
}

...
}

// When BigQueryExampleGen is used
message BigQueryExampleGenConfig {
optional string query = 1;
}

// When CsvExampleGenConfig is used
message CsvExampleGenConfig {
optional string input_base = 1;
}
# pipeline.py
def create_pipeline(config):
...
example_gen = _create_example_gen(config.example_gen_config)
...




def _create_example_gen(config: example_gen_config_pb2.ExampleGenConfig):
...

if config.HasField("big_query_example_gen_config"):
...
return ...


if config.HasField("csv_example_gen_config"):
...
return ...


raise ...

All configurations of ExampleGen are determined by a single ExampleGenConfig message. Similarly, all pipeline components only depend on their configs and are created from them. This way, you can understand the structure of the pipeline just by looking at the configuration file. There is also the intention to make customization and code understanding easier by separating the part that defines each component.

For example, let’s assume the following situation: In order to test the data transformation later, the Transform component needs to support various data processing methods. You might want to add a data augmentation process in the transform component. Then it should be done by adding a config related to the data augmentation function. Similarly, you can extend the predefined Protobuf specification to easily support multiple processing methods and make it easy to see which processing method to use.

Managing Configs with Protobuf

About the example code above, some people may wonder why they use Protobuf as a configuration tool. There are several reasons for this, and we will compare advantages with YAML, which is one of the common practices for configuration.

First, Protobuf has a robust interface, and validation such as type checking is convenient. There is no need to check whether any field is defined, as Protobuf defines the object structure in advance. In addition, it is useful to support backward/forward compatibility in a project under active development.

Also, you can easily check the pipeline structure. YAML has a hierarchical structure, but in the case of hydra, which is often used in the machine learning ecosystem, the stage (e.g. production, dev, alpha) settings are divided into several files, so we thought that Protobuf has better stability and visibility.

If you use Protobuf as your project setup tool, many of the Protobuf definitions defined in TFX can be reused.

TensorFlow Ecosystem with Bazel

Bazel is a language-independent build system that is easy to extend and supports a variety of languages and tools. From simple projects to large projects using multiple languages and tools, it can be used quickly and concisely in most situations. For more information, please refer to Bazel Vision on the Bazel documentation page.

Using Bazel in a Python project is an uncommon setting, but we used Bazel as the project build system of the TFX template project. A brief introduction to the reason is as follows.

First of all, it works really well with Protobuf. Because Bazel is a language-independent build system, you can easily tie your Protobuf build artifacts as dependencies with other builds without worry. In addition, the Protocol Buffer repository itself uses Bazel, so it is easy to integrate it into the Bazel-based project.

The second reason is the special environment of the TensorFlow ecosystem. Many projects in the TensorFlow ecosystem use Bazel, and TFX also uses Bazel, so you can easily link builds with other projects (TensorFlow, TFX) using Bazel.

Internal Custom TFX Modules

As mentioned before, we’ve been building an internal library for the custom TFX modules (especially the custom components) that are frequently used across multiple projects. Anyone in Karrot can add their components and share them with the team.

For example, we are using ArgoCD to manage applications(e.g. TF Serving) in Kubernetes clusters, so if someone develops a component for deploying with ArgoCD, we can easily share it via an internal library. The library now contains several custom modules for our team for productivity.

The reason why we can share our custom features as an internal shared library is probably thanks to the modular structure of TFX. Through this, we were able to improve the overall productivity of the team easily. We can reuse most of the components that were developed from several projects, and develop new projects very easily.

Conclusion

TFX provides lots of great features to develop production ML pipelines. We’re using TFX on Kubeflow for ML pipelines to develop, experiment, and deploy in a better way, and it brings us many benefits. So we decided to introduce how we are using TFX in this blog post.

To learn more about Karrot, check out our website (Korea, US, and Canada). For the TFX, check out the TFX documentation page.

Read More

Video Classification on Edge Devices with TensorFlow Lite and MoViNet

Posted by Dan Kondratyuk, Liangzhe Yuan, Google Research and Khanh LeViet, TensorFlow Developer Relations

We are excited to announce MoViNets (pronounced “movie nets”), a family of new mobile-optimized model architectures for video classification. The models are trained on the Kinetics-600 dataset to be able to recognize 600 different human actions (such as playing trumpet, robot dancing, bowling, and more) and can classify video streams captured on a modern smartphone in real time. You can download the pre-trained TensorFlow Lite models from TensorFlow Hub or try it out using our Android and Raspberry Pi demo apps, as well as fine-tune your own MoViNets with the Colab demo and the code in the TensorFlow Model Garden.

Demo from the TensorFlow Lite video classification reference app

Video classification is a machine learning task that takes video frames as input and predicts a single class from a larger set of classes. Video action recognition is a type of video classification where the set of predicted classes consists of human actions that happened in the frames. Video action recognition is similar to image recognition in that both take input images and output the probabilities of the images belonging to each of the predefined classes. However, a video action recognition model has to look at both the content of each frame and the spatial relationships between adjacent frames to understand the actions in the video. For example, if you look at these still images, it’s difficult to tell what the person is doing.

But if you look at the full video, it becomes clear that the person is performing jumping jacks.

MoViNet Model Architecture

MoViNets are a family of convolutional neural networks which efficiently process video streams, outputting accurate predictions with a fraction of the latency of convolutional video classifiers like 3D ResNets or transformer-based classifiers like ViT.

Frame-based classifiers output predictions on each 2D frame independently, resulting in sub-optimal performance due to their lack of temporal reasoning. On the other hand, 3D video classifiers offer high accuracy predictions by processing all frames in a video clip simultaneously, at a cost of significant memory and latency penalties as the number of input frames increases. MoViNets offer key advantages from both 2D frame-based classifiers and 3D video classifiers while mitigating their disadvantages.

The following figure shows a typical approach to using 3D networks with multi-clip evaluation, where the predictions of multiple overlapping subclips are averaged together. Shorter subclips result in lower latency, but reduce the overall accuracy.

Diagram illustrating Multi-Clip Evaluation for 3D Video Networks

MoViNets take a hybrid approach, which proposes the use of causal convolutions in place of 3D convolutions, allowing intermediate activations to be cached across frames with a Stream Buffer. The Stream Buffer copies the input activations of all 3D operations, which is output by the model and then input back into the model on the next clip input.

Diagram illustrating Streaming Evaluation for MoViNets

The result is that MoViNets can receive one frame input at a time, reducing peak memory usage while resulting in no loss of accuracy, with predictions equivalent to inputting all frames at once like a 3D video classifier. MoViNets additionally leverage Neural Architecture Search (NAS) by searching for efficient configurations of models on video datasets (specifically Kinetics 600) across network width, depth, and resolution.

The result is a set of action classifiers that can output temporally-stable predictions that smoothly transition based on frame content. Below is an example plot of MoViNet-A2 making predictions on each frame on a video clip of skateboarding. Notice how the initial scene with a small amount of motion has relatively constant predictions, while the next scene with much larger motion causes a dramatic shift in predicted classes.

MoViNets need a few modifications to be able to run effectively on edge devices. We start with MoViNet-A0-Stream, MoViNet-A1-Stream, and MoViNet-A2-Stream, which represent the smaller models that can feasibly run in real time (20 fps or higher). To effectively quantize MoViNet, we adapt a few modifications to the model architecture – the hard swish activation is replaced by ReLU6, and Squeeze-and-Excitation layers are removed in the original architectures, which results in 3-4 p.p accuracy drop on Kinetics-600. We then convert the models to TensorFlow Lite and use integer-based post-training quantization (as well as float16 quantization) to reduce the model sizes and make them run faster on mobile CPUs. The integer-based post-training quantization process further introduces 2-3 p.p. accuracy loss. Compared to the original MoViNets, quantized MoViNets lag behind in accuracy on full 10-second Kinetics 600 clips (5-7 p.p. accuracy reduction in total), but in practice they are able to provide very accurate predictions on daily human actions, e.g., push ups, dancing, and playing piano. In the future, we plan to train with quantization-aware training to bridge this accuracy gap.

A video plotting the top-5 predictions of MoViNet-A2 over time on an example 8-second (25 fps) skateboarding video clip. Create your own plots with this Colab notebook.

We benchmark quantized A0, A1, and A2 on real hardware and the model inference time achieves 200, 120, and 60 fps respectively on Pixel 4 CPU. In practice, due to the input pipeline overhead, we see increased latency closer to 20-60 fps when running on Android with a camera as input.

Model

Quantization

Top-1 Accuracy (%)

Latency 
(ms, Pixel 4 CPU)

Model Size (MB)

Recommended Input

A0-Stream

int8

65.0

4.80

3.1

172 x 172, 5 fps

A1-Stream

int8

70.1

8.35

4.5

172 x 172, 5 fps

A2-Stream

int8

72.2

15.76

5.1

224 x 224, 5 fps

A0-Stream

float16

71.5

17.47

7.6

172 x 172, 5 fps

A1-Stream

float16

76.0

34.82

13

172 x 172, 5 fps

A2-Stream

float16

77.5

76.31

15

224 x 224, 5 fps

Train a Custom Model

You can train your own video classifier model using the MoViNet codebase in the TensorFlow Model Garden. The provided Colab notebook provides specific steps on how to fine-tune a pretrained video classifier on another dataset.

Future Steps

We are excited to see on-device online video action recognition powered by MoViNets, which demonstrate highly efficient performance. In the future, we plan to support quantize-aware training for MoViNets to mitigate the quantization accuracy loss. We also are interested in extending MoViNets as the backbone for more on-device video tasks, e.g. video object detection, video object segmentation, visual tracking, pose estimation, and more.

Acknowledgement

We would like to extend a big thanks to Yeqing Li for supporting MoViNets in TensorFlow Model Garden, Boqing Gong, Huisheng Wang, and Ting Liu for project guidance, Lu Wang for code reviews, and the TensorFlow Hub team for hosting our models.

Read More

How to migrate from BoostedTrees Estimators to TensorFlow Decision Forests

Posted by Mathieu Guillame-Bert and Josh Gordon for the TensorFlow team

Decision forest models like random forests and gradient boosted trees are often the most effective tools available for working with tabular data. They provide many advantages over neural networks, including being easier to configure, and faster to train. Using trees greatly reduces the amount of code required to prepare your dataset, as they natively handle numeric, categorical, and missing features. And they often give good results out-of-the-box, with interpretable properties.

Although we usually think of TensorFlow as a library to train neural networks, a popular use case at Google is to use TensorFlow to create decision forests.

An animation of a decision tree classifying data.

This article provides a migration guide if you were previously creating tree-based models using tf.estimator.BoostedTrees, which was introduced in 2019. The Estimator API took care of much of the complexity of working with models in production, including distributed training and serialization. However, it is no longer recommended for new code.

If you are starting a new project, we recommend that you use TensorFlow Decision Forests (TF-DF). This library provides state-of-the-art algorithms for training, serving and interpreting decision forest models, with many benefits over the previous approach, notably regarding quality, speed, and ease of use.

To start, here are equivalent examples using the Estimator API and TF-DF to create a boosted tree model.

Previously, this is how you would train a gradient boosted tree models with tf.estimator.BoostedTrees (no longer recommended)

python
import tensorflow as tf

# Dataset generators
def make_dataset_fn(dataset_path):
def make_dataset():
data = ... # read dataset
return tf.data.Dataset.from_tensor_slices(...data...).repeat(10).batch(64)
return make_dataset

# List the possible values for the feature "f_2".
f_2_dictionary = ["NA", "red", "blue", "green"]

# The feature columns define the input features of the model.
feature_columns = [
tf.feature_column.numeric_column("f_1"),
tf.feature_column.indicator_column(
tf.feature_column.categorical_column_with_vocabulary_list("f_2",
f_2_dictionary,
# A special value "missing" is used to represent missing values.
default_value=0)
),
]

# Configure the estimator
estimator = boosted_trees.BoostedTreesClassifier(
n_trees=1000,
feature_columns=feature_columns,
n_classes=3,
# Rule of thumb proposed in the BoostedTreesClassifier documentation.
n_batches_per_layer=max(2, int(len(train_df) / 2 / FLAGS.batch_size)),
)

# Stop the training is the validation loss stop decreasing.
early_stopping_hook = early_stopping.stop_if_no_decrease_hook(
estimator,
metric_name="loss",
max_steps_without_decrease=100,
min_steps=50)

tf.estimator.train_and_evaluate(
estimator,
train_spec=tf.estimator.TrainSpec(
make_dataset_fn(train_path),
hooks=[
# Early stopping needs a CheckpointSaverHook.
tf.train.CheckpointSaverHook(
checkpoint_dir=input_config.raw.temp_dir, save_steps=500),
early_stopping_hook,
]),
eval_spec=tf.estimator.EvalSpec(make_dataset_fn(valid_path)))

How to train the same model using TensorFlow Decision Forests

python
import tensorflow_decision_forests as tfdf

# Load the datasets
# This code is similar to the estimator.
def make_dataset(dataset_path):
data = ... # read dataset
return tf.data.Dataset.from_tensor_slices(...data...).batch(64)

train_dataset = make_dataset(train_path)
valid_dataset = make_dataset(valid_path)

# List the input features of the model.
features = [
tfdf.keras.FeatureUsage("f_1", keras.FeatureSemantic.NUMERICAL),
tfdf.keras.FeatureUsage("f_2", keras.FeatureSemantic.CATEGORICAL),
]

model = tfdf.keras.GradientBoostedTreesModel(
task = tfdf.keras.Task.CLASSIFICATION,
num_trees=1000,
features=features,
exclude_non_specified_features=True)

model.fit(train_dataset, valid_dataset)

# Export the model to a SavedModel.
model.save("project/model")

Remarks

  • While not explicit in this example, early stopping is automatically enabled and configured.
  • The dictionary of the “f_2” features is automatically built and optimized (e.g. rare values are merged into an out-of-vocabulary item).
  • The number of classes (3 in this example) is automatically determined from the dataset.
  • The batch size (64 in this example), has no impact on the model training. Larger values are often preferable as it makes reading the dataset more efficient.

TF-DF is all about ease of use, and the previous example can be further simplified and improved, as shown next.

How to train a TensorFlow Decision Forests (recommended solution)

import tensorflow_decision_forests as tfdf
import pandas as pd

# Pandas dataset can be used easily with pd_dataframe_to_tf_dataset.
train_df = pd.read_csv("project/train.csv")

# Convert the Pandas dataframe into a TensorFlow dataset.
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="my_label")

model = tfdf.keras.GradientBoostedTreeModel(num_trees=1000)
model.fit(train_dataset)

Remarks

  • We did not specify the semantics (e.g. numerical, or categorical) of the features. In this case, the semantics will be automatically inferred.
  • We also didn’t list which input features to use. In this case, all the columns (except for the label) will be used. The list and semantics of the input feature is visible in the training logs, or with the model inspector API.
  • We did not specify any validation dataset. Each algorithm will optionally extract a validation dataset from the training examples as best for the algorithm. For example, by default, GradientBoostedTreeModel uses 10% of the training data for validation if no validation dataset is provided.

Now, let’s look at a couple differences between the Estimator API and TF-DF.

Differences between the Estimator API and TF-DF

Type of algorithms

TF-DF is a collection of decision forest algorithms. This includes (but is not limited to) the Gradient Boosted Trees available with the Estimator API. Notably, TF-DF also supports Random Forest (great for nosy datasets) and a CART implementation (great for model interpretation).

In addition, for each of those algorithms, TF-DF includes many variations found in the literature and validated experimentally [1, 2, 3].

Exact vs approximate splits

The TF1 GBT Estimator is an approximated tree learning algorithm. Informally, the Estimator builds trees by only considering a random subset of examples and a random subset of the conditions at each step.

By default, TF-DF is an exact tree training algorithm. Informally, TF-DF considers all the training examples and all the possible splits at each step. This is a more common and often better performing solution.

While sometimes faster on larger datasets (>10B examples x features), the estimator approximation are often less accurate (as more trees need to be grown to reach the same quality). In a small dataset (<100M examples x features), the form of approximated training implemented in the Estimator can even be slower than exact training.

TF-DF also supports various types of “approximated” tree training. The recommended approach is to use exact training, and optionally test approximated training on large datasets.

Inference

The Estimator runs model inference using the top-down tree routing algorithm. TF-DF uses an extension of the QuickScorer algorithm.

While both algorithms return the exact same results, the top-down algorithm is less efficient because of exceeding branching predictions and cache misses. TF-DF inference is generally 10x faster on the same model.

For latency critical applications TF-DF offers a C++ API. It provides often ~1µs/example/core inference time. This is often a 50x-1000x speed-up over TF SavedModel inference (especially on small batches).

Multi-head models

The Estimator supports multi-head models (a model that outputs multiple predictions). TF-DF (currently) does not support multi-head models directly, however, using the Keras Functional API, multiple TF-DF models trained in parallel can be assembled into a multi-head model.

Learning more

You can learn more about TensorFlow Decision Forests by visiting the website. If you’re new to this library, the beginner example is a good place to start. Experienced TensorFlow users can visit this guide for important details about the difference between using decision forests and neural networks in TensorFlow, including how to configure your training pipeline, and tips on Dataset I/O. You can also see Migrate from Estimator to Keras APIs for more info on migrating from Estimators to Keras in general.

Read More

How LinkedIn Personalized Performance for Millions of Members using Tensorflow.js

A guest post by LinkedIn

Mark Pascual, Sr. Staff Engineer

Nitin Pasumarthy, Staff Engineer

Introduction

The Performance team at LinkedIn optimizes latency to load web and mobile pages. Faster sites improve customer engagement and eventually revenue to LinkedIn. This concept is well documented by many other companies too who have had similar experiences but how do you define the optimal trade off between page load times and engagement?

The relationship between speed and engagement is non-linear. Fast loading sites, after a point, may not increase engagement by further reducing their load times. At LinkedIn we have used this relationship between engagement and speed to selectively customize the features on LinkedIn Lite – a lighter, faster version of LinkedIn, specifically built for mobile web browsers.

To do this, we trained a deep neural network to identify if a request to LinkedIn would result in a fast page load in real time. Based on the performance quality result predicted by this model we change the resolution of all images on a given user’s news feed before the resulting webpage was sent to the client. This led to an increase in the magnitude of billions for extra Feed Viral Actions (+0.23%) taken, millions more Engaged Feed Users (+0.16%) and Sponsored Revenue increased significantly for us too (+0.76%).

Image Quality Comparison: Image on the left uses 4x more memory than the one on the right which is less than ideal to send to users on slow network connections or when the device may be low on resources. Prior to using an ML model, we only showed the low resolution image which was not great for users that had capacity for higher quality images on newer devices.

We described in great detail why many of our performance optimization experiments failed back in 2017 and how we used those learnings to build a Performance Quality Model (PQM) in our Personalizing Performance: Adapting Application in real time to member environments blog.

PQM’s bold goal is to predict various performance metrics (e.g. page load time) of any web / mobile page using both device and network characteristics of end users to empower (web) developers to build impactful application features that are otherwise tricky to implement (like the one we described above).

We are happy to announce that we are open sourcing our first performance quality model that is trained on millions of RUM data samples from around the world free to use for your own website performance optimizations! Learn more and get started here.

In the rest of this blog, we will go over how our team of full stack developers deployed this PQM in production that works at Linkedin scale! We wish to prove that deploying TensorFlow.js ML models today is both easy and beneficial for those working on the Node.js stack.

TensorFlow.js: Model Deployment in Node.js

At the time of our production deployment, LinkedIn’s TensorFlow model deployment machinery was still being developed. Furthermore, using TensorFlow Serving was not yet a feasible option for us. So even though we had a model ready for use, we needed to figure out a way to deploy it.

As LinkedIn is primarily a Java/JVM stack for serving external traffic, it might seem like TensorFlow Java would be ideal, but it was still experimental and didn’t have the API stability guarantees that we require.

We looked at our options and realized that we already use Node.js (behind the JVM) as part of our frontend web serving stack in order to perform server side optimizations when serving HTML pages. The architecture for this is unique in that we use the JVM to manage an external pool of Node.js processes to perform “work,” e.g., the previously mentioned server side optimizations. The “work” can really be anything that Node.js can perform. In our use case, this enables us to use TensorFlow.js in an architecture that was already proven.

We repurposed our frontend stack to use Node.js to deploy our custom model and ended up with great results. In terms of performance, our mixed stack of Java and Node.js easily met our SLAs. The 50th and 90th percentile production latencies as measured (a) from a client (within the datacenter), (b) from on host instrumentation, and (c) in terms of only Node.js performing inference using TensorFlow.js are shown in the table below.

50th Percentile

90th Percentile

From client (within datacenter)

10 ms

12 ms

On host

8 ms

10 ms

Inference in Node.js

2 ms

3 ms

The resulting architecture is shown above in Figure 1 below.

The API request that requires a prediction is received by the JVM server and is routed to our Rest.li infrastructure which in turn routes the request to our performance prediction resource. To handle the request, the PaaS resource performs some feature generation based on the inputs and then makes an RPC call out to the Node.js process for the prediction.

The N Node.js processes are long-lived. They are started upon JVM startup and have already loaded the desired model using tf.node.loadSavedModel(). When a process receives a request for a prediction, it simply takes the input features, calls tf_model.predict(), and returns the result. Here is a simplified version of the Node.js code:

const tf = require(‘@tensorflow/tfjs-node’);

async function main() {
// load the model when the process starts so it’s always ready
const model = await tf.node.loadSavedModel(‘model_dir’);

function predict(rawInput) {
return tf.tidy(() => {
// prepare the inputs as tensor arrays
const x = {}
for (const feature of Object.keys(predictionInput)) {
x[feature] = tf.tensor([input[feature]], [1, 1]);
}

const output = model.predict(x, {});
const probs = Array.from(output.probabilities.dataSync());
const classes = Array.from(output.all_class_ids.dataSync());
const result = Object.fromEntries(classes.map((classId, i) => [classId, probs[i]]));
return result; // {0: 0.8, 1: 0.15, 2: 0.05} probability of each performance quality
});
}

// Register our ‘predict’ RPC handler (pseudo-code)
// process is an abstraction of the Node.js side of the communication channel
// with the JVM
process.registerHandler(‘predict’, input => {
const result = predict(input);
return Promise.resolve(result);
});
}

main();

The API request that requires a prediction is received by the JVM server and is routed to our Rest.li infrastructure which in turn routes the request to our performance prediction resource. To handle the request, the PaaS resource performs some feature generation based on the inputs and then makes an RPC call out to the Node.js process for the prediction.

The N Node.js processes are long-lived. They are started upon JVM startup and have already loaded the desired model using tf.node.loadSavedModel(). When a process receives a request for a prediction, it simply takes the input features, calls tf_model.predict(), and returns the result. Here is a simplified version of the Node.js code:

Express could replace Rest.li’s role and the feature generation pieces would need to be ported to Node.js, but everything else remains the same. As you can see, the architecture is cleaner and requires less mental hoops to manage both Java and Node.js in the same stack.

An Unexpected Win

In the architecture we described above, the external processes do not have to be Node.js. The library that we use to manage the external process is pretty straightforward to implement in most technologies. In fact, we could have chosen Python for the external processes as it’s popular for this ML use case. So what are the reasons we stuck with Node.js? Well, there’s two: (1) we already had a Node.js implementation for the external process infrastructure and would have had to develop a new one for Python, and (2) it turns out that Node.js is also slightly faster at making the predictions due to the pre/post processing benefitting from the JIT compiler of JavaScript.

In order to prove this to ourselves, we took samples (~100k) from our real world prediction inputs and ran them against both Node.js and Python. The test bed was not exactly our production stack (we didn’t include the JVM side), but it was close enough for our purposes. The results are shown below:

Stack

50th percentile

Delta (from Python)

Python

1.872 ms

0%

Node.js

1.713 ms

-8.47%

The results show that Node.js is almost 10% faster at performing inference for our specific model. Of course, performance may vary based on model architectures and the amount of pre and post processing being performed in Node. These results were from our model running on a typical production machine. Results may also vary due to model complexity, machine differences, and so on.

Checkout our README in the open source repo to find out how we tested the model in Python and Node.js.

Looking to the future

Our current unique architecture does have some areas for improvement. Probably the biggest opportunity is to address the uniqueness of this multi stack architecture itself. The mix of both Java and Node.js technologies adds additional cognitive overhead and complexity during design, development, debugging, operations, maintenance – however as previously stated you could move the whole stack to Node to simplify matters, so this is a solvable problem.

Another potential area for improvement comes from currently using a single threaded architecture on the Node.js side. Because of this, only a single prediction currently occurs at a time so latency sometimes includes some amount of queueing time. This can potentially be worked around by using Node worker threads for parallel execution that may be considered in future versions of this implementation. In our particular case, however, prediction is very fast even as it stands, so we do not feel the need to explore this right now.

Summary

The availability of TensorFlow.js gave us an easy option to deploy our model to serve production use cases when other options were not quite suitable or available to us. While our unique requirements resulted in using non-standard architecture (the mixture of the JVM and Node.js), TensorFlow.js can be used to an even greater effect in a homogeneous Node.js serving stack resulting in a very clean and performant architecture. With our open source performance quality model, a full stack JS engineer can personalize performance and improve their user engagement and we look forward to seeing how others use our open sourced model to do just that on their own websites.

Acknowledgements

This success would not be possible without the tremendous work by Prasanna Vijayanathan and Niranjan Sharma. A huge thank you to Ritesh Maheshwari and Rahul Kumar for their support and encouragement. Special thanks to Ping Yu (Google) and Jason Mayes (Google) for their continued support and feedback.

Read More

Intro mPOD DxTrack: A low-cost healthcare device using TensorFlow Lite Micro

A guest post by Jeffrey Ly, CEO & Joanna Ashby, CMO of mPOD, Inc.

mPOD is a NIH-funded pre-seed startup headquartered out of Johnson & Johnson’s Innovation (JLABS) in New York City. In this article, we’d like to share with you a hardware device we have developed independently at mPOD leveraging TensorFlow Lite Micro (TFLM) as a core technology, called DxTrack.

mPOD DxTrack leverages TFLM and low cost hardware to enable accurate, rapid and objective interpretation of currently available lateral flow assays (LFAs) in less than 10 seconds. LFAs serve as diagnostic tools because they are low-cost and simple to use without specialized skills or equipment. Most recently popularized by COVID-19 rapid antigen tests, LFAs are also used extensively testing for pregnancy, disease tracking, STDs, food intolerances, and therapeutic drugs along with an extensive array of biomarkers totaling billions of tests sold each year. The mPOD Dxtrack is applicable to use with any type of visually read lateral flow assay, demonstrating a healthcare use case for TFLM that can directly impact our everyday lives.

The LFA begins with a sample (nasal swab, saliva, urine, blood, etc) loaded at (1) in the figure below. Once the sample has flowed to the green conjugate zone (2), it is labeled with a signaling moiety. Through capillary action, the sample will continue flowing until it is immobilized at (3), with these LFA tests, two lines indicate a positive result, one line indicates a negative result.

Figure 1. Side (A) & Top (B) view of a lateral flow assay (LFA) sample where at (1) the sample (nasal swab, saliva, urine, blood, etc) is loaded before flowing to the green zone (2), where the target is labeled with a signaling moiety. Through capillary action, the sample will continue flowing until it is immobilized at (3) to form the test line. Excess material is absorbed at (4).
Figure 2. These are the 3 possible classes results for a lateral flow assay (LFA) test.
Figure 3. This is a diagram NOWDiagnostics ADEXUSDx lateral flow assay (LFA) designed to collect and run saliva sample in point-of-care (POC) and over-the-counter (OTC) settings.

When used correctly, these tests are very effective; however self-testing presents challenges for the lay user to interpret. Significant variability is present between devices, making it difficult to tell if the test line you see is negative …or a faint positive?

Figure 4. A visualization of how the TinyML model on the mPOD DxTrack break interprets and classifies different lateral flow assay (LFA) results.

To address this challenge, we developed mPOD DxTrack, an over-the-counter (OTC) LFA reader that improves the utility of lateral flow assays by enabling rapid and objective readings with a simple, under $5 (Cost-of-Goods) globally-deployable device. The mPOD DxTrack aims to read lateral flow assay tests using ML to accomplish two goals: 1) enable rapid and objective readings of LFAs and 2) streamline digital reporting. Critically, TinyML allows for the software on the mPOD DxTrack to be deployed on low-cost (less-than $5) hardware that can be widely distributed – which is difficult with existing LFA readers which rely on high-cost/high complexity hardware that cost hundreds to thousands of dollars per unit. Ultimately, we believe that TinyML will enable the mPOD DxTrack to catch missed positive test results by removing human bias and increasing confidence in lateral flow device testing, reducing user error, and increasing overall result accuracy.

Figure 5. Assembly view of the mPOD DxTrack with lateral flow assay (LFA) cassette.

Technical Dive

Key Considerations

  • Achieving high accuracy 99% overall accuracy, (99% sensitivity, 99% specificity) for model performance when interpreting live-run LFA strips.
  • Ensuring the model can maintain that level of performance while fitting the hardware constraints.

Model size constraints for TinyML

Deployment of the DxTrack TinyML model on the Pico4ML Dev kit is constrained by 2 pieces of hardware: Flash memory and SRAM. The Pico4ML Dev kit has 2MB of flash memory to host the .uf2 file and 264kb of SRAM that accommodate the intermediate arrays (among other things) of the model. Ensuring the model size stays within these bounds is critical because while the code can successfully compile, run on the host machine and even successfully flash on the Pico4Ml Dev Kit, it will hang during set-up and not execute the main loop.

Rather than guess and check the size of intermediate arrays (a process we initially took with little reproducible success), we ended up developing a workflow that enabled us to quantify the model’s arena size by first using the interpreter function. See below, where this function was called during setup:

TfLiteStatus setup_status = ScreenInit(error_reporter);
if (setup_status != kTfLiteOk){
while(1){TF_LITE_REPORT_ERROR(error_reporter, "Set up failedn");};
}
arena_size = interpreter->arena_used_bytes();
printf("Arena_Size Used: %zu n", arena_size);

When printed out, this is what the value from the interpreter function should look during Pico4ML Dev kit boot-up:

DEV_Module_Init OK                                                              
Arena_Size Used: 93500
sd_spi_go_low_frequency: Actual frequency: 122070
V2-Version Card
R3/R7: 0x1aa
R3/R7: 0xff8000
R3/R7: 0xc0ff8000
Card Initialized: High Capacity Card
SD card initialized
SDHC/SDXC Card: hc_c_size: 15237
Sectors: 15603712
Capacity: 7619 MB
sd_spi_go_high_frequency: Actual frequency: 12500000

With this value available to us, we are then able to set the appropriate TensorArenaSize. As you can see from above, the model uses 93500 bytes of SRAM. By setting the TensorArenaSize to just above that amount 99×1024 = 101376 bytes, we are able to allocate enough memory to host the model without going over the hardware limits (which also causes the Pico4ML Dev Kit to freeze).

// An area of memory to use for input, output, and intermediate arrays.
constexpr int kTensorArenaSize = 99* 1024; // 136 * 1024; //81 * 1024;
static uint8_t tensor_arena[kTensorArenaSize];

Transforming from Unquantized to Quantized Models

Now that we have a reproducible methodology to quantify and deploy the model onto the Pico4ML Dev Kit, our next challenge is ensuring that the model can achieve the accuracy we require while still fitting with the size constrained by the hardware. For reference, the mPOD DxTrack platform is designed to interpret a 96×96 image. In the original model design, we were able to achieve > 99.999% accuracy with our model, but the intermediate layer is 96x96x32 at fp32 which requires over 1 MB of memory – it would never fit on the Pico4ML Dev Kit’s 264KB of SRAM. In order to achieve the size requirement for the model, we needed to take the model from unquantized to quantized; our best option was to utilize full int8 quantization. In essence, instead of treating the tensor values as floating points (float32), we correlate those values to integers (int8). Unfortunately, this decreased the model size 4-fold, allowing it to fit onto the Pico4ML Dev Kit’s rounding error from fp32 to int8 compounded, resulting in dramatically reduced model performance.

ALT TEXT

To combat this drop in model performance, we examined the effect of two different quantization strategies to improve performance: Post-training quantization (PTQ) and Quantization-aware training (QAT).

Below, we compare 3 different models to understand which quantization strategy is best. For reference:

  • Model 1: 2-layer convolutional network
  • Model 2: 3-layer convolutional network
  • Model 3: 4-layer convolutional network

As we can see, Quantization-aware training (QAT) uniformly beats the post-training quantization (PTQ) method and it became part of our workflow moving forward.

What performance can we achieve now?

Tested across over 800 real-world test runs, the mPOD DxTrack can preliminary achieve an overall accuracy of 98.7%. This version of the model is currently being evaluated by our network of manufacturing partners who we work closely with. Currently we are assembling a unique dataset of images as part of a patient-focused data pipeline to learn from each manufacturing partnership and building bespoke models.

Our preliminary work has also helped us correlate model performance with appropriately large dataset size to achieve the performance high enough accuracy for our healthcare application. Per the figure attached, the model needs to be trained on a quality dataset of at least 15,000 images. Our commercial-ready target is likely to require datasets that are greater than 100,000 images.

ALT TEXT

To learn more about mPOD Inc, please visit our website at www.mpod.io. If you’re interested in learning more about TinyML, we recommend checking out this book and this course.

Read More

Accelerating TensorFlow Lite Micro on Cadence Audio Digital Signal Processors

Posted by Raj Pawate (Cadence) and Advait Jain (Google)

Digital Signal Processors (DSPs) are a key part of any battery-powered device offering a way to process audio data with a very low power consumption. These chips run signal processing algorithms such as audio codecs, noise canceling and beam forming.

Increasingly these DSPs are also being used to run neural networks such as wake-word detection, speech recognition, and noise suppression. A key part of enabling such applications is the ability to execute these neural networks as efficiently as possible.

However, productization paths for machine learning on DSPs can often be ad-hoc. In contrast, speech, audio, and video codecs have worldwide standards bodies such as ITU and 3GPP creating algorithms for compression and decompression addressing several aspects of quality measurement, fixed point arithmetic considerations and interoperability.

TensorFlow Lite Micro (TFLM) is a generic open-sourced inference framework that runs machine learning models on embedded targets, including DSPs. Similarly, Cadence has invested heavily in PPA-optimized hardware-software platforms such as Cadence Tensilica HiFi DSP family for audio and Cadence Tensilica Vision DSP family for vision.

Google and Cadence – A Multi-Year Partnership for Enabling AI at the Edge

This was the genesis of the collaboration between the TFLM team and the Audio DSP teams at Cadence, starting in 2019. The TFLM team is focusing on leveraging the broad TensorFlow framework and developing a smooth path from training to embedded and DSP deployment via an interpreter and reference kernels. Cadence is developing a highly optimized software library, called NeuralNet library (NNLIB), that leverages the SIMD and VLIW capabilities of their low-power HiFi DSPs. This collaboration started with three optimized kernels for one Xtensa DSP, and now encompasses over 50 kernels across a variety of platforms such as HiFi 5, HiFi 4, HiFi 3z, Fusion F1 as well as Vision DSPs such as P6, and includes the ability to offload to an accelerator, if available.

Additionally, we have collaborated to add continuous integration for all the optimized code targeted for the Cadence DSPs. This includes infrastructure that tests that every pull request to the TFLM repository passes all the unit tests for the Tensilica toolchain with various HiFix and Vision P6 cores. As such, we ensure that the combined TFLM and NNLIB open source software is both tightly integrated and has good automated test coverage.

Performance Improvements

Most recently, we have collaborated on adding optimizations for models that are quantized with int16 activations. Specifically in the domain of audio, int16 activations can be critical for the quality of quantized generative models. We expect that these optimized kernels will enable a new class of ML-powered audio signal processing. The table below shows a few operators that are required for implementing a noise suppression neural net. We show a 267x improvement in cycle count for a variant of SEANet, an example noise suppression neural net.

The following table shows the improvement with the optimized kernels relative to the reference implementations as measured with the Xtensa instruction set simulation tool.

Operator

Improvement

Transpose Conv

458x

Conv2D

287x

Sub

39x

Add

24x

Leaky ReLU

18x

Srided_Slice

10x

Pad

6x

Overall Network

267x

How to use these optimizations

All of the code can be used from the TFLite Micro GitHub repository.

To use HiFi 3z targeted TFLM optimizations, the following conditions need to be met:

  • the TensorFlow Lite (TFLite) flatbuffer model is quantized with int16 activations and int8 weights
  • it uses one or more of the operators listed in the table above
  • TFLM is compiled with OPTIMIZED_KERNEL_DIR=xtensa

For example, you can run Conv2D kernel integration tests with reference C++ code with:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa TARGET_ARCH=hifi4 XTENSA_CORE= test_integration_tests_seanet_conv

And compare that to the optimized kernels by adding OPTIMIZED_KERNEL_DIR=xtensa:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa TARGET_ARCH=hifi4 OPTIMIZED_KERNEL_DIR=xtensa XTENSA_CORE= test_integration_tests_seanet_conv

Looking Ahead

While the work thus far has been primarily focused on convolutional neural networks, Google and Cadence are also working together to develop an optimized LSTM operator and have released a first example of an LSTM-based key-word recognizer. We expect to expand on this and continue to bring optimized and production-ready implementations of the latest developments in AI/ML to Tensilica Xtensa DSPs.

Acknowledgements

We would like to acknowledge a number of our colleagues who have contributed to making this collaboration successful.

Cadence: Manjunath CP, Bhanu Prakash Bandaru Venkata, Anirban Mandal

Google: Advait Jain, Deiqang Chen, Lawrence Chan, Marco Tagliasacchi, Nat Jeffries, Nick Kreeger, Pete Warden, Rocky Rhodes, Ting Yan, Yunpeng Li, Victor Ungureanu

Read More

Highlights from TensorFlow’s 2021 exploreCSR awards

Posted by Josh Gordon, Jocelyn Becker, and Sloan Davis for the TensorFlow team

Increasing the number of students pursuing computer science research is a priority at Google, especially for students from historically marginalized groups in the field. Since 2018, Google’s exploreCSR awards have aided higher education efforts that support students interested in pursuing graduate studies and research careers in computing.

The TensorFlow team is proud to provide additional funding to support this important program. To date, we have awarded more than 20 professors with funding to support their education and outreach work in machine learning.

We’d like to highlight examples of the many (and often, unique) outreach programs the 2021 award recipients have created so far. These range from research experiences with robotics, aquatic vehicles, federated learning, and offline digital libraries to mentored small group workshops on data science and programming skills. They’re sorted alphabetically by university below.

If you’re interested in creating your own programs like these with support from Google, keep an eye on the exploreCSR website for the next round of applications opening in June 2022.

Laura Hosman and Courtney Finkbeiner, Arizona State University

The SolarSPELL initiative at Arizona State University will host a workshop series thanks to support from exploreCSR to encourage students underrepresented in computer science research in their academic journey. The SolarSPELL initiative produces an offline, solar-powered digital library designed to bring educational content to resource-constrained locations that may lack electricity, internet connectivity, and/or traditional libraries.

The exploreCSR workshop series, titled “SolarSPELL exploreCSR: Computing for Good”, involves 6 weeks of sessions using SolarSPELL as a case study for how students can apply machine learning to tackle real-world problems and develop solutions for social good. Students will meet SolarSPELL’s co-director and learn about the history of the SolarSPELL initiative; learn about graduate programs available at ASU; and hear from guest panelists from industry.

A solar-powered, offline digital library.

Aside from the information sessions, students will also gain hands-on experience working in teams and problem solving for real-world topics. The SolarSPELL team will present the students with three different challenges for student teams to develop a proposed solution using machine learning. Students will then be eligible to apply for paid summer fellowship positions with SolarSPELL to develop and integrate one of the proposed machine learning models into SolarSPELL’s technology.

SolarSPELL is a student-driven initiative, so the solutions that the exploreCSR students develop will be implemented in our digital libraries to improve hundreds of library users’ experiences around the world. With libraries in 10 countries in the Pacific Islands and East Africa, and plans to expand to Latin America and the Middle East, these students will have a far-reaching impact.

Daehan Kwak, Kean University

My colleague Xudong Zhang and I created an undergraduate research study group centered on computer vision, with projects underway on student attention detection, mask and social distancing detection, and pill recognition for healthcare scenarios. As one example, a student created a pill detection application using data from the National Library of Medicine pillbox. This can be used, for example, by high-volume distribution pharmacies to be more efficient and accurate, or by retirement homes to verify the pills a resident is taking. We’re pleased to share that the pill recognition project won third place in the Kean Business Plan Competition and was accepted to be presented at Posters on the Hill 2022.

Matthew Roberts, Macquarie University

The School of Computing at Macquarie University is working to lower the barrier to entry for students who are new to experimenting with ML by employing real-world examples. This month, around fifty students will spend the week testing their ideas for solving autonomous aquatic vehicles challenges (for example, navigation) under guidance from Macquarie University researchers. They will be developing their ideas with a sophisticated simulation environment, and the best solutions will be ready for deployment to real hardware testing in the water later in the year.

A MacSim simulation of the Sydney Regatta Center (created by VRX), a placeholder for a machine learning model, is making random predictions, ready for improvements the students come up with.

Accurately simulated sensors like cameras and LIDAR can be subjected to various models, allowing people to experiment with even more sophisticated ideas to solve complex problems. After our first year in exploreCSR, the adjustments we made to our simulator and the workshop will generate new ideas and light a spark for machine learning research early in students’ careers.

Pooyan Fazli, San Francisco State University

60+ students from 10 universities and colleges attended our 2-day virtual exploreCSR workshop. Participants were from San Francisco State University, CSU East Bay, CSU San Marcos, CSU Stanislaus, Foothill College, Northwestern University, San Diego State University, Sonoma State University, UC San Diego, and the University of San Francisco.

We had two invited speakers and two panels on mentorship and career pathways with 10 panelists from Google Research, Stanford, Emory University, Virginia Tech, and the University of Copenhagen.

As part of this workshop, we organized hands-on activities to introduce students to different aspects of AI and its applications for social good, such as with climate change. We also had mini-presentations and discussions on AI fairness, accountability, transparency and ethics in different areas, such as robotics, educational data mining, and impacts on underserved communities.

Following the workshop, selected students will participate in a research project under the guidance of graduate students and faculty during the spring semester. Through the research projects, we have a two-fold aim: to help students develop a sense of belonging in the AI and machine learning research community, and to illuminate a pathway for them to pursue graduate studies in AI/ML that explores the potential of developing responsible AI toward social good.

The research projects will begin with eight weekly meetups and hands-on training on Python programming with open-source publicly available materials. Then, students will engage in applied research projects that focus on AI applications for social good, such as health, environment, safety, education, climate change, and accessibility.

Farzana Rahman, Syracuse University

Earlier this year, the Electrical Engineering and Computer Science department of Syracuse University hosted RESORC (REsearch Exposure in Socially Relevant Computing), an exploreCSR program, for the second time. This program provided research exposure to 78 undergraduate students from SU and nearby institutions targeting populations historically underrepresented in computing. The goal of these two workshops was to give students an opportunity to learn machine learning using open-source tools, and to gain experience with data science workflows including collecting and labeling data, training a model, and carefully evaluating it. The ML workshops were the mostly highly rated sessions of the RESORC program.

Erin Hestir and Leigh Bernacchi, University of California, Merced

Since 2019, University of California, Merced has partnered with Merced College and California State University Stanislaus on the Google exploreCSR program ¡Valle! Get Your Start in Tech!, serving 32 Central Valley of California undergraduates in STEM annually to build a sense of belonging, practice professional networking, and develop technical skills. Participants convene on Zoom and in-person this semester. Valle students typically come from historically underrepresented groups, and the program is designed to support their pursuits of computational research, graduate school and computer science related careers. Many have gone on to achieve just that!

This year we added additional training thanks to Google Research to support machine learning applications for social good. This program is open to all Valle participants as well as partner schools, inclusive of graduate and undergraduate students in all STEM fields, and will be taught by creative graduate students in computer science from UC Merced. Each workshop will be taught by a near-peer mentor—a practice that supports mutual success in academics—and the mentor will coach teams to develop ML projects for social good.

The goal of the program is to overcome some of the trepidation scientists and students may have about computational science and machine learning through teamwork, fun and a higher purpose. Students will be able to develop their skills and interest, focusing on ML applications to climate, sustainability, agriculture and food, and diversity in tech and aviation.

Basak Guler, University of California, Riverside

At the University of California, Riverside, we created an undergraduate research study group focused on federated and distributed machine learning. Federated learning has become widely popular in recent years due to its communication efficiency and on-device learning architecture. Our study group meets on a weekly basis, and students learn about the principles of federated and distributed learning, state-of-the-art federated learning algorithms, recent applications from financial services to healthcare, as well as recent challenges and advances in privacy, security, and fairness. Student projects provide opportunities for undergraduate students to be involved in machine learning research, and learn from the experiences of both faculty and graduate students. This program can facilitate their transition from undergraduate to graduate degrees, and prepare them for positions of leadership in industry, government, public service, and academia.

Gonzalo A. Bello, University of Illinois at Chicago

The computer science department is hosting a series of exploreCSR workshops, including exploreCSR: Exploring Data Science Research, to introduce students to data science and machine learning research. These workshops aim to encourage students from historically underrepresented groups to pursue graduate studies and careers in research in the field of computer science. UIC students from all majors were encouraged to apply, including those who haven’t taken any computer science courses. Each semester, 60 students were selected out of more than 120 who applied, and 10 teaching assistants and a professor mentored students. In addition to lectures, students work on hands-on projects together where they explore, visualize, and build models using real-world data from the city of Chicago.

Melanie Moses and Humayra Tasnim, The University of New Mexico

The UNM Google exploreCSR activity for 2021-2022 is a semester-long course called Swarmathon: The Next Generation. The students will learn technical skills like developing machine learning models for object recognition in robots, and soft skills including team building, research skills, and discussions with faculty and external speakers. The UNM exploreCSR program builds on 5 years of training students in a NASA-sponsored robotics competition called the Swarmathon (2014-2019). In 2019/2020 we developed a series of exploreCSR Swarmathon: TNG workshops which included a faculty panel, an industry mentor, an open-source tutorial, and a day-long workshop to enable “Swarmie” robots to classify and automatically retrieve objects.

A glimpse of our robots in action.

This year, in our exploreCSR Swarmathon: TNG course, students will have additional time to actively engage in developing and tuning their own machine learning models to test in the Swarmie robots. They will develop object detection models using convolutional neural networks (CNNs). They will be provided with a dataset of images of objects (shown below) taken from the robot camera and a simple model. The students will further develop the model and training data and then test their models on actual robots in real-time to see how much they can improve object recognition models to classify and retrieve the proper specified objects.

Different shaped cubes for detection.

Students will learn first-hand the reality gap between simulations and real-world experiments. This will encourage them to develop their own mini-research projects to enhance their model performance to resolve that gap. The exploreCSR-funded Swarmathon: TNG course will provide students with the opportunity to actively engage in hands-on robotics research. We hope the experience of defining a research objective, conducting a set of experiments, testing a model, and seeing results play out in our robotics arena will motivate students to attend graduate school and consider research careers.

Swarmie with a cube in its gripper.

Daniel Mejía, The University of Texas at El Paso

We’re building a series of workshops open to undergraduate students of all majors to introduce them to applied machine learning and research topics, starting with foundational concepts in math and a newfound way of approaching a problem through the eyes of a data scientist. These workshops are open to all students, including those who do not have any prior experience. We hope to encourage students to consider pursuing graduate studies, especially those who may have not previously considered it. I believe that the earlier students are exposed, the more likely that they will pursue a graduate degree.

Henry Griffith, The University of Texas at San Antonio

At the University of Texas at San Antonio, we’re creating a portfolio of programs to enhance the persistence of first year Electrical and Computer Engineering students into research computing pathways. By integrating our programming with our Introduction to Electrical and Computer Engineering course, which has a total annual enrollment of approximately 200 students, we have the opportunity to achieve tremendous scale with our efforts. Our programs include an undergraduate research experience, a near-peer mentoring program, and group study projects – all designed to develop students’ professional and technical skills and to accelerate their progression into research opportunities.

John Akers, University of Washington

Our exploreCSR workshop, CSNext, is scheduled to begin this April. It’s a 4-week online program of workshops, seminars, and project work designed to encourage undergraduate students – particularly those from historically underrepresented groups – to consider and successfully apply to graduate schools in computer science. Participants will hear presentations from several University of Washington labs, such as Computer Vision/Graphics (GRAIL), Security and Privacy, and Human-Computer Interaction. There will be presentations on deep learning and on current graduate-level research, a panel discussion from current UW CSE grad students from varying backgrounds, opportunities to meet current graduate students from several UW CSE labs, and participants will be led through small-group exercises learning about active research from graduate student mentors. Participants will also learn about graduate school application processes and resources, led by staff from UW CSE Graduate Student Services.

Learning more

If you’re interested in creating your own programs like these with support from Google, keep an eye on the exploreCSR website for the next round of applications opening in June 2022.

Read More

Boost your model’s accuracy using self-supervised learning with TensorFlow Similarity

Posted by Elie Bursztein and Owen Vallis, Google

TensorFlow similarity now supports key self-supervised learning algorithms to help you boost your model’s accuracy when you don’t have a lot of labeled data.

Basic Self-Supervised Training.

Often when training a new machine learning classifier, we have a lot more unlabeled data, such as photos, than labeled examples. Self-supervised learning techniques aim at leveraging those unlabeled data to learn useful data representations to boost classifier accuracy via a pre-training phase on those unlabeled examples. The ability to tap into abundant unlabeled data can significantly improve model accuracy in some cases.

Perhaps the most well known example of successful self-supervised training are transformer models, such as BERT, that learn meaningful language representations by pre-training on very large quantities of text, e.g., wikipedia or the web.

Self-supervised learning can be applied to any type of data and at various data scales. For example, if you have only a few hundred labeled images, using self-supervised learning can boost your model accuracy by pre-training on a medium sized dataset such as ImageNet. For example, SimCLR uses the ImageNet ILSVRC-2012 dataset for training the representations and then evaluates the transfer learning performance on 12 other image datasets such as CIFAR, Oxford-IIIT Pets, Food-101, and others. Self-supervised learning works at larger scales as well, where pre-training on billions of examples improves accuracy as well, including text transformer and vision transformer.

High level overview of how self-supervised learning works for images.

At its core, self-supervised learning works by contrasting two augmented “views” of the same example. The model objective is to maximize the similarity between these views to learn representations that are useful for down-stream tasks, such as training a supervised classifier. In practice, after pre-training on a large corpus of unlabeled images, training an image classifier is done by adding a single softmax dense layer with a on top of the frozen pre-trained representation and training as usual using a small number of labeled examples.

Examples of pairs of augmented views on CIFAR10 from the hello world notebook.

TensorFlow Similarity currently provides three key approaches for learning self-supervised representations: SimCLR, SimSiam, Barlow Twins, that work out of the box. TensorFlow Similarity also provides all the necessary components to implement additional forms of unsupervised learning. These include, callbacks, metrics, and data samplers.

You can start to explore how to leverage a self-supervised learning hello world notebook that demonstrates how to double the accuracy on CIFAR10.

Read More

TFRT: A Progress Update

Posted by Mingsheng Hong, TFRT Tech Lead/Manager & Eric Johnson, TFRT Product Manager

Roughly two years ago, we announced an ambitious new Machine Learning (ML) runtime effort called TFRT (short for TensorFlow Runtime). We simultaneously provided a deep dive of the initial technical design and open-sourced its codebase.

Driven by trends in the ML ecosystem – larger and bigger models, ML being deployed to more diverse execution environments, and the need to keep up with continued research and modeling innovations – TFRT was started with the following set of goals in mind:

  • Deliver faster and cheaper execution for ML models
  • Enable more flexible deployment
  • Provide more modular and extensible infrastructure to facilitate innovations in ML infra and modeling

In this post, we share our progress to date, the experiences and lessons we’ve learned over the past two years of development, as well as what you can expect going forward.

Progress to Date

The last two years of development have largely been focused on implementing and validating our ambitious ideas by enabling Google’s most important internal workloads for users such as Ads and Search. To date, we have deployed TFRT broadly inside Google on a variety of training and inference workloads, and obtained great results.

Technical Lessons

How have we been able to achieve the above? Here are some interesting technical lessons that we learned, beyond what was in the original design:

First, async support is important for some of the key workloads (e.g. overlapping compute and I/O, and driving heterogeneous devices), while fast sync execution is critical for many other workloads, including small, “embedded” ML models.

We spent a lot of effort in designing and refining AsyncValue, a key low level abstraction in TFRT, which allows the host runtime to asynchronously drive devices, as well as invoking kernels. This led to improved device utilization due to the ability to overlap more computation and communication across hosts and devices. For example, we were able to successfully run bulk inference of an 80B-parameter model on one TPU chip with good performance by splitting the model into multiple stages and using TFRT to overlap variable transfer of the next stage with TPU computation of the current stage.

On the other hand, small CPU models that are embedded in application servers, invoked within the application process instead of via RPC/REST calls, remain critical for some of Google’s business workloads from users like Ads. For these models, the async-first internal design of TFRT initially caused a performance and resource regression. We worked with the Ads team to successfully address it, by extending the TFRT design with a synchronous interpreter, as well as an experimental memory planning optimization, to avoid heap allocation during kernel execution. We are working on productizing this extension.

This diagram below showcases the impact of the resulting TFRT design over a benchmark, as compared to “Current TF” which ran the old runtime before TFRT’s deployment. This benchmark focused on executing a tiny CPU model, where a large number of small matmuls executed sequentially. Notably, the optimized execution in TFRT (265 ns) is approaching the optimal baseline we set up (204 ns), via hand-written C++ code without any ML runtime overhead.

Second, while faster runtime execution is critical, optimizing the input program to reduce execution complexity is important as well.

Note that while compiler-based graph optimization should be performed when TF SavedModel is saved to the disk whenever possible, there are also important inference-time compiler optimizations that can only be performed with the knowledge of being in an inference context (e.g. when training variables remain constant).

As we were onboarding ML models onto TFRT, we had the chance to examine some of the models in depth, and identified new ways of rewriting and simplifying the program, before its execution. The simplified program, along with a faster execution of each kernel in the graph program, led to a nice compounding effect in the reduction of the execution latency and resource cost.

For example, in the left hand side graph program below, we were able to hoist the scalar op normalization computation (e.g. divide a float value by the max value of its domain), identical across all 18 input scalars, above the “concat” op, therefore enabling vectorized execution of the normalization, over a concatenated 1D float tensor.

While it is possible to perform this optimization at model training time as well, the compiler+runtime used to produce the trained model did not include this optimization.

In addition, we also find it critical to hoist computation from model execution time to load time whenever possible (e.g. const folding).

Third, cost-based execution is not just for SQL queries.

We developed a simple compile-time cost model (analogous to SQL query optimizer’s cost model) for TF op kernels, and applied cost-based optimization for ML model execution (see stream analysis), and achieved a better load balancing of the kernel execution across a set of threadpool threads. In contrast, TF1 has a runtime-based cost model, in which each operation’s runtime cost is profiled and used to guide that operation’s scheduling. In TFRT, we moved the cost analysis to compile-time, thus removing runtime cost. Also, our compiler approach allows the entire computational graph to be analyzed, thereby resulting in scheduling decisions that are optimal at a more global scope.

See this tech talk for more similarities between data and ML infra.

Looking Ahead

While we’ve certainly made some strong progress, especially with respect to our first goal – faster and cheaper execution – we admittedly still have work to do on enabling a more modular design and enabling more flexible deployments via hardware integration.

In terms of modularity, with the initial integration successes such as JAX’s adoption of TFRT device runtimes (e.g. CPU), we will continue to explore how TFRT could support workloads beyond just TensorFlow. We expect some of the TFRT components will also benefit the PyTorch/XLA workloads going forward.

Moreover, while we have successfully integrated CPU and TPU (with upcoming integration into Cloud TPU), the two most important device types at Google for ML computation, with NVIDIA GPU also in progress.

With respect to training workload, TFRT has been used as building blocks for Google’s large scale distributed training framework which are currently in active development.

As we look to the future, our organization has been exploring its integration with Pixel’s hardware SOC devices such as Google Tensor. In addition, due to TFRT’s proven success for Google’s internal workloads, it is also being integrated into new venues such as GCP’s Vertex AI and Waymo.

Special Thanks

The TFRT team has really enjoyed working on this new, ambitious infrastructure project. It has often felt like bootstrapping a new startup. With that in mind, we would like to give a huge shout out to everyone who has advised, contributed to and supported TFRT through this incredible 2-year journey:

(alphabetically) Adi Agrawal, Andrew Bernard, Andrew Leaver, Andy Selle, Ayush Dubey, Bangda Zhou, Bramandia Ramadhana, Catherine Payne, Ce Zheng, Chiachen Chou, Chao Xie, Christina Sorokin, Chuanhao Zhuge, Dan Hurt, Dong Lin, Eugene Zhulenev, Ewa Matejska, Hadi Hashemi, Haoliang Zhang, HanBin Yoon, Haoyu Zhang, Hongmin Fan, Jacques Pienaar, Jeff Dean, Jeremy Lau, Jordan Soyke, Jing Dong, Juanli Shen, Kemal El Moujahid, Kuangyuan Chen, Mehdi Amini, Ning Niu, Peter Gavin, Phil Sun, Pulkit Bhuwalka, Qiao Zhang, Raziel Alvarez, Russell Power, Sanjoy Das, Shengqi Zhu, Smit Hinsu, Tatiana Shpeisman, Tianrun Li, Tim Davis, Tom Black, Victor Akabutu, Vilobh Meshram, Xiao Yu, Xiaodan Song, Yiming Zhang, YC Ling, Youlong Chen, and Zhuoran Liu.

We would like to give special thanks to Chris Lattner for his initial technical leadership in bootstrapping this project, Martin Wicke for his support in TFRT throughout the first year, Alex Zaks for his support in TFRT during the second year and seeing through the impactful landing for Google’s ML serving workloads.

Read More