How Digitec Galaxus trains and serves millions of personalized newsletters per week with TFX

Posted by Christian Sager (Product Owner, Digitec Galaxus) and Anant Nawalgaria (ML Specialist, Google)

gif from blog titled how Digitec Galaxus trains and serves millions of personalized newsletters per week with TFX showing an animation of a cloud and people

In the retail industry it is important to be able to engage and excite users by serving personalized content on newsletters at scale. It is important to do this in a manner which leverages existing trends, while exploring and unearthing potentially new trends with an even higher user engagement. This project was done as a collaboration between Digitec Galaxus and Google, by designing a system based on Contextual Bandits to personalize newsletters for more than 2 million users every week.

To accomplish this, we leveraged several products in the TensorFlow ecosystem and Google Cloud including TF Agents, TensorFlow Extended (TFX) running on Vertex AI , to build a system that personalizes newsletters in a scalable, modularized and cost effective manner with low latency. In this article, we’ll highlight a few of the pieces, and point you to resources you can use to learn more.

About Digitec Galaxus

Digitec Galaxus AG is the largest online retailer in Switzerland. It offers a wide range of products to its customers, from electronics to clothes. As an online retailer, we naturally make use of recommendation systems, not only on our home or product pages but also in our newsletters. We have multiple recommendation systems in place already for newsletters, and have been extensive early adopters of the Google Cloud recommendations AI. Because we have multiple recommendation systems and very large amounts of data, we are faced with the following complications.

1. Personalization

We have over 12 recommenders that it uses in the newsletters, however we would like to contextualize these by choosing different recommenders (which in turn select the items) for different users. Furthermore, we would like to exploit existing trends as well as experiment with new ones.

2. Latency

We would like to ensure that the ranked list of recommenders can be retrieved with sub 50 ms latency.

3. End-to-end easy to maintain and generalizable/modular architecture

We wanted the solution to be architected using an easy to maintain, platform invariant, complete with all MLops capabilities required to train and use contextual bandits models. It was also important to us that it is built in a modular fashion such that it can be adapted easily to other use cases which have in mind such as recommendations on the homepage, Smartags and more.

Before we get to the details of how we built a machine learning infrastructure capable of dealing with all requirements, we’ll dig a little deeper into how we got here and what problem we’re trying to solve.

Using contextual bandits

Digitec Galaxus has multiple recommendation systems in place already. Because we have multiple recommendation systems, it is sometimes difficult to choose between them in a personalized fashion. Hence we reached out to Google seeking assistance with implementing Contextual Bandit driven recommendations, which personalizes our homepage as well as our newsletter. Because we only send newsletters to registered users, we can incorporate features for every user.

We chose TFAgents to implement the contextual bandit model. Training and serving pipelines were orchestrated by Vertex AI pipelines running TFX, which in turn used TFAgents for the development of the contextual bandit models. Here’s an overview of our approach.

TFAgents to implement the contextual bandit model

Rewarding subscribes, and penalizing unsubscribes

Given some features (context) about the user, and each of the 12 available recommenders, we aim to suggest best recommender (action) which increases the chance (reward) of the user clicking (reward = 1) on at least one of the recommendations by the selected recommender, and minimizes the chance of incurring a click which leads to unsubscribe (reward = -1).

By formulating the problem and reward function in this manner, we hypothesized that the system would optimize for increasing clicks, while still showing relevant (and not click-baity) content to the user in order to sustain the potential increase in performance. This is because the reward functions penalizes an event when a user unsubscribes, which a click-baity content is likely to lead to. The problem was then tackled by using contextual bandits because of the fact that they excel at exploiting trends that work well, as well as exploring and uncovering potentially even better-performing trends.

Serving millions of users every week with low latency

A diagram showing the high-level architecture of the recommendation training and prediction systems on GCP.
A diagram showing the high-level architecture of the recommendation training and prediction systems on GCP.

There’s a lot of detail here, as the architecture shown in the diagram covers three phases of ML development, training, and serving. Here are some of the key pieces.

Model development

Vertex Notebooks are used as data science environments for experimentation and prototyping, in addition to implementing model training and scoring components and pipelines. The source code is version controlled in GitHub. A continuous integration (CI) pipeline is set up to run unit tests, build pipeline components, and store the container images to Cloud Container Registry.

Training

The training pipeline is executed using TFX on Vertex Pipelines. In essence, the pipeline trains the model using new training data extracted from BigQuery, validates the produced model, and stores it in the model registry. In our system, the model registry is curated in Cloud Storage. The training pipeline uses Dataflow for large scale data extraction, validation, processing and model evaluation, as well as Vertex Training for large scale distributed training of the model. In addition, AI Platform Pipelines stores artifacts produced by the various pipeline steps to Cloud Storage, and information about these artifacts is stored in an ML metadata database in Cloud SQL.

Serving

Predictions are produced using a batch prediction pipeline, and stored in Cloud Datastore for consumption. The batch prediction pipeline is made using TFX and runs on Vertex Pipelines. The pipeline uses the most recent model in the model registry to score the serving queries from BigQuery. A Cloud Function is provided as a REST/HTTP endpoint to retrieve predictions from Datastore.

Continuous Training Pipeline

A diagram of the TFX pipeline for the training workflow.
A diagram of the TFX pipeline for the training workflow.

There are many components used in our TFX-based Continuous training workflow, training is currently done on an on-demand basis, but later on it is planned to be executed on a bi-weekly cadence. Here is a little bit of detail on the important ones.

Raw Data

Our data consists of multiple datasets stored in heterogeneous formats across BigQuery tables and other formats, that are then joined in denormalized fashion by the customer into a single BigQuery table for training. To help avoid bias and drift in our model we train the model on a rolling window of 4 weeks cadence with one overlapping week per training cycle. This was a simple design choice as it was very straightforward to implement, as BigQuery has good compatibility as a source with TFX, and also allows the user to do some basic data preprocessing and cleaning during fetching.

BigQueryExampleGen

We first leverage BigQuery by leveraging built-in functions to preprocess the data. By embedding our own specific processes into the query calls made by the ExampleGen component, we were able to avoid building out a separate ETL that would exist outside the scope of a TFX pipeline. This ultimately proved to be a good way to get the model in production more quickly. This preprocessed data is then split into training and eval and converted to tf.Examples via the ExampleGen component.

Transform

This component does the necessary feature engineering and transformations necessary to handle strings, fill in missing values, log-normalize values, setup embeddings etc. The major benefit here is that the resulting transformation is ultimately prepended to the computational graph, so that the exact same code is used for training and serving. The Transform component runs on Cloud Dataflow in production.

Trainer

The Trainer component trains the model using TF-Agents. We leverage parallel training on Vertex Training to speed things up. The model is designed such that the user id passes in from the input to the output unaltered, so that it can be used as part of the downstream serving pipeline. The Trainer component runs on Vertex Training in production.

Evaluator

The Evaluator compares the existing production model to the model received by the Trainer and prepares the metrics required by the validator component to bless the “better” one for use in production. The model gating criteria is based on the AUC scores as well as counterfactual policy evaluation and possibly other metrics in the future. It is easy to implement custom metrics which meet the business requirements owing to the extensibility of the evaluator component. The Evaluator runs on Vertex AI.

Pusher

The Pusher’s primary function is to send the blessed model to our TFServing deployment for production. However, we added functionality to use the custom metrics produced in the Evaluator to determine decisioning criteria to be used in serving, and attach that to the computational graph. The level of abstraction available in TFX components made it easy to make this custom modification. Overall, the modification allows the pipeline to operate without a human in the loop so that we are able to make model updates frequently, while continuing to deliver consistent performance on metrics that are important to our business.

HyperparametersGen

This is a custom TFX component which creates a dictionary with hyperparameters (e.g., batch size, learning rate) and stores the dictionary as an artifact. The hyperparameters are passed as input to the trainer.

ServingModelResolver

This custom component takes a serving policy (which includes exploration) and a corresponding eval policy (without exploration), and resolves which policy will be used for serving.

Pushing_finalizer

This custom component copies the pushed/blessed model from the TFX artifacts directory to a curated destination.

The out-of-box components from TFX provided most of the functionality we require, and it is easy to create some new custom components to make the entire pipeline satisfy our requirements. There are also other components of the pipeline such as StatisticsGen (which also runs on Dataflow).

Batch Prediction Pipeline

A diagram showing the TFX pipeline for the batch prediction workflow.
A diagram showing the TFX pipeline for the batch prediction workflow.

Here are a few of the key pieces of our batch prediction system.

Inference Dataset

Our inference dataset has nearly identical format to the training dataset, except that it is emptied and repopulated with new data daily.

BigQueryExampleGen

Just like for the Training pipeline, we use this component to read data from BigQuery and convert it into tf.Examples.

Model Importer

This component imports the computation graph exported by the Pusher component of the training pipeline. As mentioned above, since it contains the whole computation graph generated by the training pipeline, including feature transformation and the tf.Agents model (including the exploration/exploitation aspect), this is very portable and prevents train/test skew.

BulkInferrer

As the name implies, this component uses the imported computation graph to perform mass inference on the inference dataset. It runs on Cloud Dataflow in production and makes it very easy to scale.

PredictionStorer

This is a custom Python Component which takes the inference results from Bulkinfererrer, post-processes them to format/filter the fields as required, and persists it to Cloud Datastore. This runs on Cloud Dataflow in production as well.

Serving is done via cloud functions which take the user ids as input, and returns the precomputed results for each userId stored in DataStore with sub 50 ms latency.

Extending the work so far

In the few months since implementation of the first version we have been making dozens of improvements to the pipeline, everything from changing the architecture/approach of the original model, to changing the way the model’s results are used in the downstream application to generate newsletters. Moreover, each of these improvements brings new value to us more quickly than we’ve been able to in the past.

Since our initial implementation of this reference architecture, we have released a simple Vertex AI pipeline based github code samples to implementing recommender systems using TF Agents here. By using this template and guide, it will help them build recommender systems using contextual bandits on GCP in a scalable, modularized, low latency and cost effective manner. It’s quite remarkable how many of the existing TFX components that we have in place carry over to new projects, and even more so how drastically we’ve reduced the time it takes to get a model in production. As a result, even the software engineers on our team without much ML expertise feel confident in being able to reuse this architecture and adapt it to more use cases. The data scientists are able to spend more of their time optimizing the parameters and architectures of the models they produce, understanding their impact on the business, and ultimately delivering more value to the users and the business.

Acknowledgements

None of this would have been possible without the joint collaboration of the following Googlers: Khalid Salama, Efi Kokiopoulou, Gábor Bartók and Digitec Galaxus’s team of engineers.

A Google Cloud blog on this project can be found here.

Read More

Using the TensorFlow-Agents Bandits Library for Recommendations

Posted by Gábor Bartók and Efi Kokiopoulou, Google Research

This article assumes you have some prior experience with reinforcement learning and/or multi-armed bandits. If you’re new to the subject, a good starting point is the Bandits Wikipedia entry, or for a bit more technical and in-depth introduction, this book.

In this blog post we introduce the TensorFlow-Agents Bandits library. This library offers a comprehensive list of the most popular bandit algorithms along with a variety of test problems on which the algorithms can be run. The test problems (called bandit environments) include some synthetic environments as well as environments converted from real-life (classification or recommendation) datasets.

One of the latter is the MovieLens environment, which utilizes this dataset. In this blog post, we will guide you through the usage of the TF-Agents Bandits library with the help of the MovieLens Environment.

Multi-Armed Bandits

Multi-Armed Bandits is a machine learning framework in which an agent repeatedly selects actions from a set of actions and collects rewards by interacting with the environment. The goal of the agent is to accumulate as much reward as possible, within a given time horizon. The name “bandit” comes from the illustrative example of finding the best slot machine (one-armed bandit) from a set of machines with different payoffs. The actions are also known as “arms”.

image of slot machines
Image from Wikipedia

There are two more important concepts to be aware of: “context”, and “regret”. In many real life scenarios, it’s not enough to find the best action that on average provides the highest reward: we want to find the best action depending on the situation/context. To extend the bandits framework in this direction, we introduce the notion of “context”. Before the agent has to select an action, it receives the context that provides information about the current round. Then the agent’s goal is to find the policy that selects the highest-rewarding action for the given context.

In bandits literature, the notion of “regret” is very important. The regret can be informally defined as the difference in performance between the optimal policy and the learned policy. Typically the performance is measured in terms of cumulative reward (i.e., sum of rewards across several rounds); otherwise, one may also refer to the “instantaneous regret” which is the regret the agent suffers at a certain round. Bandit algorithms typically come with performance guarantees in terms of upper bound on the regret given a family of bandit problems.

Example: Movie Recommendation

Consider the following scenario. You are tasked with recommending movies to users of a movie streaming service. In every round you receive information about the user. Your task is to choose from a handful of movies for the user with the goal of choosing one that the user will enjoy and give a high rating.

A Recommendation Dataset

For illustration purposes, we will turn the well-known MovieLens dataset into a bandit problem. The dataset consists of ~100K ratings from 943 users on 1682 movies. Our first step to turn this dataset into a contextual bandit problem is to construct the matrix `A` of user/movie ratings, where `A_ij` is the rating of user `i` of movie `j`. Since we have the ratings to a few movies only from each user, one issue with the ratings matrix `A` is that it is very sparse i.e., only a few entries `A_ij` are available; all the other entries are unknown. To address this sparsity issue, we construct a low-rank SVD decomposition `A ~= U*V’` (low-rank matrix decomposition in recommender systems is a popular approach for collaborative filtering, see e.g., Koren et al. 2009). This way, the rows of `U` are context features. Then, the movies to be recommended to the user are the set of actions, represented as rows of `V`. The reward for recommending movie `j` to user `i` can then be calculated as the inner product of the corresponding rows of `U_i` and `V_j`. Therefore, using the low-rank SVD decomposition to compute rewards gives us the ability to approximate the reward even for movies that were not recommended to the users; hence their rating was unknown.

TF-Agents Bandits

Now let’s see how the above problem is modeled and solved with the help of the TF-Agents Bandits library. TF-Agents is a modular library that has building blocks for every aspect of Reinforcement Learning and Bandits. A problem can be expressed in terms of an “environment”. An environment is a class that generates observations (aka contexts), and also outputs a reward after being presented with actions. In the case of the MovieLens environment, an observation is a random row of the matrix `U`, while the reward is given after an algorithm has chosen an action (i.e., row of the matrix `V`, a movie in our case). The implementation of the MovieLens environment can be found here. It’s worth noting here that it is rather simple to implement a bandit environment in TF-Agents. For a walkthrough, we refer the reader to our Bandits Tutorial.

Algorithms

Bandit algorithms in TF-Agents have two main building blocks: “policies” and “agents”. A policy is a function that, given an observation, chooses an action. The agent is responsible for learning a good policy: given examples of (observation, action, reward) tuples, it trains the policy so that it chooses better actions. The TF-Agents Bandits library offers a comprehensive list of the most popular algorithms, including linear methods as well as nonlinear ones (e.g., those with neural network-based value functions). Let’s see how LinUCB tackles the MovieLens problem!

The LinUCB algorithm

In short, the LinUCB algorithm keeps track of running average rewards for all actions, along with confidence intervals around the estimates. In every turn, the algorithm chooses the action that has the highest upper confidence bound on its reward estimate.

In the TF-Agents library, the LinUCB algorithm is built from a LinearBanditPolicy with an “Optimistic Exploration Strategy”, and a LinearBanditAgent responsible for updating the estimates. Note that the exploration strategy can be changed from “Optimistic” to “Sampling”, in which case the algorithm becomes Linear Thompson Sampling.

So let’s see how LinUCB performs on the MovieLens environment! We ran LinUCB on the MovieLens environment (with 100 actions and SVD decomposition rank 20) and we get results on TensorBoard:

(Note that all of the below plots are based on averaging five runs, the shadows show standard deviations. A rolling average smoothing is also applied on the curves.)

Linear Thompson Sampling

Linear Thompson Sampling

As mentioned above, with a slight modification of LinUCB, we get an implementation for Linear Thompson Sampling (LinTS). If we run LinTS on the same problem (implementation here), we get a very similar result to that of LinUCB (see joint graph further down).

NeuralEpsilonGreedy

Let’s compare these results with another agent, say, the NeuralEpsilonGreedy agent. As the name suggests, this agent uses a neural network to estimate the rewards, and adds uniform exploration with probability `epsilon`. This exploration strategy is known as “epsilon-greedy” since the method is greedy most of the time but with probability `epsilon` it explores by picking an action uniformly at random. If we run Neural Epsilon Greedy and put the results from the three algorithms, we get:

NeuralEpsilonGreedy graph

It’s interesting to also look at how often the methods pick suboptimal actions. This is shown below:

SuboptimalArmsMetric

We see that LinUCB and LinTS have very similar performance, which is not very surprising, as they are very similar algorithms. On the other hand, Neural epsilon-Greedy is not doing very well on this problem. After fifty thousand iterations, the metrics are still far away from that of the linear methods. Note, nevertheless, that even the epsilon-Greedy algorithm manages to find the best movie about half the time, out of 100, still not bad!

To be fair, it’s expected that linear algorithms do better than non-linear ones on this problem, as the problem is linear (by the reward calculation construction).

As for the difference between the two linear algorithms, it seems that LinUCB struggles in the beginning a little bit, but in the long run it is slightly (not significantly) better than LinTS.

Recommendation with Arm Features

The MovieLens example above has some shortcomings: its actions are a selection of movies, algorithms have to learn a distinct model for every movie, and it’s also hard to introduce new movies in the system. To this end, we change the environment a little bit: instead of treating every movie as an independent action, we model the movies with features, similarly to users: the rows of `V` will be the movie features. Then the model only has to learn one reward function, whose input is both the user features `u` and the movie features `v`. This way we can have an unlimited number of movies in the system, and we can introduce new movies on the fly. This version of the environment can be found here.

Agents Running on Per Arm Feature Environments

Most of the agents implemented in our library have the functionality of running on environments that have features for its actions (we call these environments “per-arm environments”).

Now let’s see how the different algorithms behave on the per-arm version of the MovieLens environment. We ran the arm-feature versions of the three algorithms: LinUCB, LinTS, and eps-Greedy. The result is quite different from the previous section: Here the linear methods seem to fail to find the relationship between actions and rewards, while the neural approach gives similar results to that of the non-arm feature problem.

RegretMetric
SubOptimalArmsMetric

The neural algorithm still finds the best action ~45% of the time, while the linear algorithms only ~30% of the time.

Your New Bandit Algorithm

If you haven’t found what you are looking for in the list of agents within the library, it’s possible, and not too complicated, to implement your own algorithm. You need to:

  • subclass tf_agents.policies.TFPolicy and
  • subclass tf_agents.agents.TFAgent.

TFPolicy

To define a policy, one needs to implement its private member function _distribution(…). In short, this function takes an observation and outputs a distribution of actions (or simply an action in case of a deterministic policy).

TFAgent

As stated above, an agent is responsible for training the policy. To this end, subclasses of TF-Agents’ TFAgent (sorry) have to implement the private member function _train() (among others, some details are omitted for clarity). This function takes batches of training data, and trains the policy.

Your New Bandit Environment

If you want to test your (new) algorithm and have an idea for an environment, it’s also simple to implement it in TF-Agents. A Bandit environment has two main roles: (i) to generate observations, and (ii) to return a reward after the agent chooses an action. One can easily create an environment class by defining these two functions.

Recap

In this blog post, we introduced the TF-Agents Bandit library and showed how to tackle a recommendation problem with it. If you want to play around with the environments and agents used in this post, you can go directly to this executable to run these agents and more. If you want to explore the library or just want to read more about it, we suggest starting with this tutorial. And if you’re interested in learning more about making recommendations on this MovieLens dataset, you can also check out another great library called TensorFlow Recommenders.

Collaborators

The TF-Agents Bandits library has been built in collaboration with Jesse Berent, Tzu-Kuo Huang, Kishavan Bhola, Sergio Guadarrama‎, Anoop Korattikara, Oscar Ramirez, Eugene Brevdo, and many others from the TF-Agents team.

Read More

Build fast, sparse on-device models with the new TF MOT Pruning API

Posted by Yunlu Li and Artsiom Ablavatski

Introduction

Pruning is one of the core optimization techniques provided in the TensorFlow Model Optimization Toolkit (TF MOT). Not only does it help to significantly reduce model size, but it can also be used to accelerate CPU inference on mobile and web. With modern compute intensive models, the area of pruning as a model optimization technique has drawn significant attention, demonstrating that dense networks can be easily pruned (i.e. a fraction of the weights set to zero) with negligible quality degradation. Today, we are excited to announce a set of updates to TF MOT Pruning API that simplify pruning and enable developers to build sparse models for fast on-device inference.

Updates to TF MOT

TensorFlow has long standing support for neural network pruning via TensorFlow Model Optimization Toolkit (TF MOT) Pruning API. The API, featured in 2019, introduced essential primitives for pruning, and enabled researchers throughout the world with new optimization techniques. Today we are happy to announce experimental updates to the API that further advance model pruning. We are releasing tools that simplify the control of pruning and enable latency reduction for on-device inference.

The TF MOT Pruning API has extensive functionality that provides the user with tools for model manipulation:

  • prune_low_magnitude function applies PruneLowMagnitude wrapper to every layer in the model
  • PruneLowMagnitude wrapper handles low-level pruning logic
  • PruningSchedule controls when pruning is applied
  • PruningSummaries callback logs the pruning progress

These abstractions allow to control almost any aspect of model pruning (i.e. how to prune (PruneLowMagnitude), when to prune (PruningSchedule) and how to track the progress of the pruning (PruningSummaries) with the exception of what to prune, i.e. where PruneLowMagnitude wrapper is applied. We are happy to release an extension of TF MOT PruningPolicy, a class that controls which parts of the model the PruneLowMagnitude wrapper is applied to. The instance of PruningPolicy is used as an argument in the prune_low_magnitude function and provides the following functionalities:

  • Controls where the pruning wrapper should be applied on per-layer basis through the allow_pruning function
  • Checks that the whole model supports pruning via ensure_model_supports_pruning function

PruningPolicy is an abstract interface, and it can have many implementations depending on the particular application. For latency improvements on CPU via XNNPACK, the concrete implementation PruneForLatencyOnXNNPack applies the pruning wrapper only to the parts of the model that can be accelerated via sparse on-device inference while leaving the rest of the network untouched. Such selective pruning allows an application to maintain model quality while targeting parts of the model that can be accelerated by sparsity.

The below example showcases the PruneForLatencyOnXNNPack policy in action on

MobileNetV2 (the full example is available in a recently introduced colab):

import tensorflow as tf
import tensorflow_model_optimization as tfmot
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

# See the implementation of the function below.
model = load_model_for_pruning()

model_for_pruning = prune_low_magnitude(
model, pruning_policy=tfmot.sparsity.keras.PruneForLatencyOnXNNPack())

In order to comply with the constraints of XNNPACK sparse inference the Keras implementation of MobileNetV2 model requires slight modification of the padding for the first convolution operation:

def load_model_for_pruning():
input_tensor = tf.keras.layers.Input((224, 224, 3))
input_tensor = tf.keras.layers.ZeroPadding2D(1)(input_tensor)
model = tf.keras.applications.MobileNetV2(input_tensor=input_tensor)

def clone_fn(layer):
if layer.name == 'Conv1':
# The default padding `SAME` for the first convolution is incompatible
# with XNNPACK sparse inference.
layer.padding = 'valid'
# We ask the model to rebuild since we've changed the padding parameter.
layer.built = False
return layer

return tf.keras.models.clone_model(model, clone_function=clone_fn)

The PruneForLatencyOnXNNPack policy applies the pruning wrapper only to convolutions with 1×1 kernel size since only these layers can be accelerated on CPU by as much as 2x using XNNPACK. The rest of the layers are left untouched allowing the network to recover after quality degradation at the pruning step. Also, the policy verifies that the model is amenable to being pruned by using the ensure_model_supports_pruning method. Once the sparse model has been trained and converted, we recommend using the TensorFlow Lite benchmark utility in debug mode to confirm that the final model is compatible with XNNPack’s sparse inference backend.

We hope that this newly introduced experimental API will be useful in practice and we will continue to improve its stability and flexibility in the future.

Compression and Latency Improvements

Model compression is another major benefit of applying pruning to a model. Using a smart compression format allows efficient storage of model weights which leads to a significant size reduction.

TFLite adopted the TACO format to encode sparse tensors. Compared to widely used formats like CSR and CSC, the TACO format has several advantages:

  1. It supports flexible traversal order to store a tensor in row-major or column-major formats easily.
  2. It supports multi-dimensional sparse tensors like the 4-D filter of a convolution op.
  3. It can represent block structure as the inner dimension of the tensor (example of a 4×4 tensor with 2×2 inner block structure).

We also adapted the format to use flexible data types for the metadata storing the indices of non-zero elements. This reduces the storage overhead for small tensors, or tensors with compact data types like int8_t.

In order to realize size reductions in practice during the model conversion, the tf.lite.Optimize.EXPERIMENTAL_SPARSITY optimization needs to be applied. This optimization handles examining the model for sparse tensors and converting them to an efficient storage format. It also works seamlessly with quantization and you can combine them to achieve more aggressive model compresion. The full example of such a conversion is shown below:

# Remove the pruning wrappers from the model. 
model = tfmot.sparsity.keras.strip_pruning(model)

converter = tf.lite.TFLiteConverter.from_keras_model(model)
# We apply float16 quantization together with sparsity optimization that
# compactly stores pruned weights.
converter.optimizations = [
tf.lite.Optimize.EXPERIMENTAL_SPARSITY, # Enables size reduction optimization.
tf.lite.Optimize.DEFAULT # Enables quantization at conversion.
]
converter.target_spec.supported_types = [tf.float16]
tflite_buffer = converter.convert()

After applying the tf.lite.Optimize.EXPERIMENTAL_SPARSITY optimization together with PruneForLatencyOnXNNPack pruning policy, a ~2x size reduction can be achieved as is demonstrated in Figure 1:

Ablation study of MobileNetV2 model size (float32 and float16 types) with different sparsity levels using PruneForLatencyOnXNNPack pruning policy.
Figure 1. Ablation study of MobileNetV2 model size (float32 and float16 types) with different sparsity levels using PruneForLatencyOnXNNPack pruning policy. Only the 1×1 convolutional layers are pruned and the rest of the layers are left dense.

In addition to size reduction, pruning can provide inference acceleration on CPU via XNNPACK. Using the PruneForLatencyOnXNNPack pruning policy, we’ve conducted an ablation study of CPU inference latency for a MobileNetV2 model on Pixel 4 using TensorFlow Lite benchmark with the use_xnnpack option enabled:

Ablation study of CPU inference speed of MobileNetV2 model with different sparsity levels on a Pixel 4 device.
Figure 2. Ablation study of CPU inference speed of MobileNetV2 model with different sparsity levels on a Pixel 4 device.

This study in Figure 2 demonstrates 1.7x latency improvement when running on mobile devices using XNNPACK. The strategies for training the sparse MobileNetV2 model together with hyperparameters and pre-trained checkpoints are described in Elsen et al.

Pruning techniques & tips

Pruning aware training is a key step in model optimization. Many hyperparameters are involved in training and some of them like the pruning schedule and learning rate can have a dramatic impact on the final quality of the model. Though many strategies have been proposed, a simple yet effective 3-steps strategy (see Table 1) achieves strong performance for the majority of our use cases. The strategy builds on top of the well-proven approach from Zhu & Gupta and produces good results without extensive re-training:

Step

Learning rate

Duration

Notes

1. Pre-training or

using pre-trained weights (optional)

The same as for the regular dense network: starting from high value (possibly with warm-up) and ending with low value

The same as for the regular dense network

Paired with weight decay regularization this step helps the model to push unimportant weights towards 0 for pruning in the next step

2. Iterative pruning

Constant, mean of the learning rate values for the regular training

30 to 100 epochs

Iterative pruning step during which weights become sparse

3. Fine-tuning

The same as at the first stage but without warm up stage

The same as at the first stage

Helps to mitigate quality degradation after the pruning step

3-step schedule for training the sparse model

The strategy inevitably leads to a substantial increase (~3x) in the training time. However, paired with the PolynomialDecay pruning schedule, this 3-step strategy achieves limited or no quality degradation with significantly pruned (>70%) neural networks.

Pruned models in MediaPipe

Together with the updates to the TF MOT Pruning API, we are happy to release pruned models for some of the MediaPipe solutions. The released models include pose and face detectors as well as a pruned hand tracking model. All of these models have been trained with the newly introduced functionality using the 3-steps pruning strategy. Compared with dense baselines the released pruned models have demonstrated significant model size reduction as well as superior performance when running on CPU via XNNPACK. Quality-wise the pruned models achieve similar metrics including in the evaluation on our fairness datasets (see model cards for details). Side-by-side demos of the solutions are shown below:

MediaPipe example showing female waving at camera
MediaPipe example showing person jumping
Figure 3. Comparison of dense (left) and sparse (right) models in the end-to-end examples of face (top) and pose (bottom) detection

Pruning for GPU

While exploiting sparsity on GPUs can be challenging, recent work has made progress in improving the performance of sparse operations on these platforms. There is momentum for adding first-class support for sparse matrices and sparse operations in popular frameworks, and state-of-the-art GPUs have recently added hardware acceleration for some forms of structured sparsity. Going forward, improvements in software and hardware support for sparsity in both training and inference will be a key contributor to progress in the field.

Future directions

TF MOT offers a variety of model optimization methods, many of which have proven to be essential for efficient on-device model inference. We will continue to expand the TF MOT Pruning API with algorithms beyond low magnitude pruning, and also investigate the combination of pruning and quantization techniques to achieve even better results for on-device inference. Stay tuned!

Acknowledgments

Huge thanks to all who worked on this project: Karthik Raveendran, Ethan Kim, Marat Dukhan‎, Trevor Gale, Utku Evci, Erich Elsen, Frank Barchard, Yury Kartynnik‎, Valentin Bazarevsky, Matsvei Zhdanovich, Juhyun Lee, Chuo-Ling Chang, Ming Guang Yong, Jared Duke‎ and Matthias Grundmann.

Read More

Real-World ML with Coral: Manufacturing

Posted by Michael Brooks, Coral

For over 3 years, Coral has been focused on enabling privacy-preserving Edge ML with low-power, high performance products. We’ve released many examples and projects designed to help you quickly accelerate ML for your specific needs. One of the most common requests we get after exploring the Coral models and projects is: How do we move to production?

With this in mind we’re introducing the first of our use-case specific demos. These demos are intended to to take full advantage of the Coral Edge TPU™ with high performance, production-quality code that is easily customizable to meet your ML requirements. In this demo we focus on use cases that are specific to manufacturing; worker safety and quality grading / visual inspection.

Demo Overview

The Coral manufacturing demo targets a x86 or powerful ARM64 system with OpenGL acceleration that processes and displays two simultaneous inputs. The default demo, using the included example videos, looks like this:

two gifs side by side demonstrating the Coral manufacturing demo

The two examples being run are:

  • Worker Safety: Performs generic person detection (powered by COCO-trained SSDLite MobileDet) and then runs a simple algorithm to detect bounding box collisions to see if a person is in an unsafe region.
  • Visual Inspection: Performs apple detection (using the same COCO-trained SSDLite MobileDet from Worker Safety) and then crops the frame to the detected apple and runs a retrained MobileNetV2 that classifies fresh vs rotten apples.

By combining these two examples, we are able to demonstrate multiple Coral features that can enable this processing, including:

  • Co-compilation
  • Cascading models (using the output of one model to feed another)
  • Classification retraining
  • Real time processing of multiple inputs

Creating The Demo

When designing a new ML application, it is critical to ensure that you can meet your latency and accuracy requirements. With the two applications described here, we went through the following process to choose models, train these models, and deploy to the EdgeTPU – this process should be used when beginning any Coral application.

Choosing the Models

When deciding on a model to use, the new Coral Model Page is the best place to start. For this demo, we know that we need a detection model (which will be used for detection of both people and apples) as well as a classification model.

Detection

When picking a detection model from the Detection Model Page, there are four aspects to a model we want to look for:

  1. Training Dataset: In the case of the models page, all of our normal detection models use the COCO dataset. Referring to the labels, we can find both apples and people, so we can use just the one model for both detection tasks.
  2. Latency: We will need to run at least 3 inferences per frame and need this to keep up with our input (30 FPS). This means we need our detection to be as fast as possible. From the models page, we can see two good options: SSD MobileNet v2 (7.4 ms) and MobileDet (8.0 ms). This is the first point where we see the clear advantage of Coral – looking at the benchmarks at the bottom of our x86+USB CTS Output we can see even on a powerful workstation this would be 90 ms and 123 ms respectively.
  3. Accuracy/Precision: We also want as accurate a model as possible. This is evaluated using the primary challenge metric from COCO evaluation metrics. We see here MobileDet (32.8%) clearly outpeforms MobileNet V2 (25.7%).
  4. Size: In order to fully co-compile this detection model with the classification model below, we need to ensure that we can fit both models in the 8MB of cache on the Edge TPU. This means we want as small a model as possible. MobileDet is 5.1 MB vs MobileNet V2 is 6.6 MB.

With the above considerations, we chose SSDLite MobileDet.

Classification

For the fresh-or-rotten apple classification, there are many more options on the Coral Classification Page. What we want to check is the same:

  1. Training Dataset: We’ll be retraining on our new dataset, so this isn’t critical in this application.
  2. Latency: We want the classification to be as fast as possible. Luckily many of the models on our page are extremely fast relative to the 30 FPS frame rate we demand. With this in mind we can eliminate all the Inception models and ResNet-50.
  3. Accuracy: Accuracy for Top-1 and Top-5 is provided. We want to be as accurate as possible for Top-1 (since we are only checking fresh vs rotten) – but still need to consider latency. With this in mind we eliminate MobileNet v1.
  4. Size: As mentioned above, we want to ensure we can fit both the detection and classification models (or as much as possible) so we can easily eliminate the EfficientNet options.

This leaves us with MobileNet v2 and MobileNet v3. We opted for v2 due to existing tutorials on retraining this model.

Retraining Classification

With the model decisions taken care of, now we need to retain the classification model to identify fresh and rotten apples. Coral.ai offers training tutorials in CoLab (uses post-training quantization) and Docker (uses quantization aware training) formats – but we’ve also included the retraining python script in this demo’s repo.

Our Fresh/Rotten data comes from the “Fruits fresh and rotten for classification” dataset – we simply omit everything but apples.

In our script, we first load the standard Keras MobileNetV2 – freezing the first 100 layers and adding a few extra layers at the end:

base_model = tf.keras.applications.MobileNetV2(input_shape=input_shape,
include_top=False,
classifier_activation='softmax',
weights='imagenet')
# Freeze first 100 layers
base_model.trainable = True
for layer in base_model.layers[:100]:
layer.trainable = False
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(units=2, activation='softmax')
])
model.compile(loss='categorical_crossentropy',
optimizer=tf.keras.optimizers.RMSprop(lr=1e-5),
metrics=['accuracy'])
print(model.summary())

Next, with the dataset download into ./dataset we train our model:

train_datagen = ImageDataGenerator(rescale=1./255,
zoom_range=0.3,
rotation_range=50,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
horizontal_flip=True,
fill_mode='nearest')
val_datagen = ImageDataGenerator(rescale=1./255)
dataset_path = './dataset'
train_set_path = os.path.join(dataset_path, 'train')
val_set_path = os.path.join(dataset_path, 'test')
batch_size = 64
train_generator = train_datagen.flow_from_directory(train_set_path,
target_size=input_size,
batch_size=batch_size,
class_mode='categorical')
val_generator = val_datagen.flow_from_directory(val_set_path,
target_size=input_size,
batch_size=batch_size,
class_mode='categorical')
epochs = 15
history = model.fit(train_generator,
steps_per_epoch=train_generator.n // batch_size,
epochs=epochs,
validation_data=val_generator,
validation_steps=val_generator.n // batch_size,
verbose=1)

Note that we’re only using 15 epochs. When retraining on another dataset it is very likely more will be required. With the apple dataset, we can see this model quickly hits very high accuracy numbers:

image of training and validation accuracy and loss

For your own dataset and model more epochs will likely be needed (the script will generate the above plots for validation).

We now have a Keras model that works for our apple quality inspector. In order to run this on a Coral Edge TPU, the model must be quantized and converted to TF Lite. We’ll do this using post-training quantization – quantizing based on a representative dataset after training:

def representative_data_gen():
dataset_list = tf.data.Dataset.list_files('./dataset/test/*/*')
for i in range(100):
image = next(iter(dataset_list))
image = tf.io.read_file(image)
image = tf.io.decode_jpeg(image, channels=3)
image = tf.image.resize(image, input_size)
image = tf.cast(image / 255., tf.float32)
image = tf.expand_dims(image, 0)
yield [image]
model.input.set_shape((1,) + model.input.shape[1:])
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()

The script will then compile the model and evaluate both the Keras and TF Lite models – but we’ll need to take one extra step beyond the script: We must use the Edge TPU Compiler to co-compile the classification model with our detection model.

Co-compiling the models

We now have two quantized TF Lite models: classifier.tflite and the default CPU model for MobileDet taken from the Coral model page. We can compile them together to ensure that they share the same caching token – when either model is requested the parameter data will already be cached. This simply requires passing both models to the compiler:

edgetpu_compiler ssdlite_mobiledet_coco_qat_postprocess.tflite classifier.tflite 
Edge TPU Compiler version 15.0.340273435

Models compiled successfully in 1770 ms.

Input model: ssdlite_mobiledet_coco_qat_postprocess.tflite
Input size: 4.08MiB
Output model: ssdlite_mobiledet_coco_qat_postprocess_edgetpu.tflite
Output size: 5.12MiB
On-chip memory used for caching model parameters: 4.89MiB
On-chip memory remaining for caching model parameters: 2.74MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 125
Operation log: ssdlite_mobiledet_coco_qat_postprocess_edgetpu.log

Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.
Number of operations that will run on Edge TPU: 124
Number of operations that will run on CPU: 1
See the operation log file for individual operation details.

Input model: classifier.tflite
Input size: 3.07MiB
Output model: classifier_edgetpu.tflite
Output size: 3.13MiB
On-chip memory used for caching model parameters: 2.74MiB
On-chip memory remaining for caching model parameters: 0.00B
Off-chip memory used for streaming uncached model parameters: 584.06KiB
Number of Edge TPU subgraphs: 1
Total number of operations: 72
Operation log: classifier_edgetpu.log
See the operation log file for individual operation details.

There are two things to note in this log. First is that we see one operation is run on the CPU for the detection model – this is expected. The TF Lite SSD PostProcess will always run on CPU. Second, we couldn’t quite fit everything on the on-chip memory, the classifier has 584 kB of off-chip memory needed. This is fine – we’ve substantially reduced the amount of IO time needed. Both models will now be in the same folder, but because we co-compiled them they are aware of each other and the cache will persist parameters for both models.

Customizing For Your Application

The demo is optimized and ready for customization and deployment. It can be cross compiled for other architectures (currently it’s only for x86 or ARM64) and statically links libedgetpu to allow this binary to be deployed to many different Linux systems with an Edge TPU.

There are many things that can be done to customize the model to your application:

  • The quickest changes are the inputs, which can be adjusted via the --visual_inspection_input and --worker_safety_input flags. The demo accepts mp4 files and V4L2 camera devices.
  • The worker safety demo can be further improved with more complicated keepout algorithms (including consideration of angle/distance from camera) as well as retraining on overhead data. Currently the demo checks only the bottom of the bounding box, but the flag --safety_check_whole_box can be used to compare to the whole box (for situations like overhead cameras).
  • The apple inspection demonstrates simple quality grading / inspection – this cascaded model approach (using detection to determine bounding boxes and feeding into another model) can be applied to many different uses. By retraining the detection and classification model this can be customized for your application.

Conclusion

The Coral Manufacturing Demo demonstrates how Coral can be used in a production environment to solve multiple ML needs. The Coral accelerator provides a low-cost and low-power way to add enough ML compute to run both tasks in parallel without over-burdening the host. We hope that you can use the Coral Manufacturing Demo as a starting point to bringing Coral intelligence into your own manufacturing environment.

To learn more about ways edge ML can be used to benefit day to day operations across a variety of industries, visit our Industries page. For more information about Coral Products and Partner products with Coral integrated, please visit us at Coral.ai.

Read More

The TensorFlow Developer Certificate turns 1!

Posted by Alina Shinkarsky and Jocelyn Becker on behalf of the TensorFlow Team

The TensorFlow Developer Certificate exam is celebrating its first anniversary with a big milestone: more than 3,000 people have passed the exam! Successful test takers received the official TensorFlow Developer Certificate and badge. They also had the opportunity to showcase their skill set on social networks such as LinkedIn and the TensorFlow Certificate Network, where recruiters are able to seek out entry-level TensorFlow developers.

The TF Certificate program bridges the gap between the demand from companies for data-knowledgeable, production ML-capable engineers — and the students and developers around the world interested in getting a job in ML.

The goal of this program is to provide developers around the world with the opportunity to demonstrate their skills in ML in an increasingly AI-driven global job market. This is a foundational certificate for students, developers, and data scientists who want to demonstrate practical machine learning skills through building and training basic models using TensorFlow.

We’ve followed up with folks who have taken the exam to understand the impact on their professional lives.

Fran shared, “Lost my job due to COVID 1 month before taking the exam, hired by Google in August, I think this cert helped a lot for my resume to be selected for interviews :)“

Fran is now a Conversational AI engineer, and has been working at Google for over 6 months.

photo of a googler wearing a noogler hat

Tia shared, “I was a stay-at-home mom when I started to learn Machine Learning at Google Bangkit Academy back in 2020. Bangkit supported us to take TF Certification and with this certificate, I was able to get back to work after years of hiatus. My current role is Machine Learning Curriculum Developer in Dicoding, an online technology education platform in Indonesia.”

Check out the short video below to hear Tia’s story.

Are you interested in earning the TensorFlow Developer Certificate? Learn more about the TensorFlow Developer Certificate on our website, including information on exam criteria, exam cost, and a stipend application to ensure taking this certificate exam is accessible, regardless of income.

If you’ve taken the exam and have feedback or would like to share your story, we would love to hear from you!

We look forward to growing this community of TensorFlow Developer Certificate recipients, and are immensely thankful to the continued contributions of our open source community!

Note: Participating in the program and/or obtaining this certificate are not endorsements of a participant’s employability nor guarantee of future work performance.

Read More

TensorFlow Hub for Real World Impact

Posted by Luiz GUStavo Martins and Elizabeth Kemp on behalf of the TensorFlow Hub team

As a developer, when you’re faced with a challenge, you might think: “How can I solve this?”, or, “What is the right tool for the job?” For a growing number of situations, machine learning can help! ML can solve many tasks that would be very challenging using classical programming, for example: detecting objects in images, classifying sound events, or understanding text.

But training machine learning models may take a lot of time, use large amounts of data, require deep expertise in the field, and be resource intensive. What if instead of starting from scratch, someone has already solved the same problem you have? Or at least solved a very similar problem that could give you a good starting point? This is where TensorFlow Hub can help you!

TensorFlow Hub is an open repository of pre-trained machine learning models ready for fine-tuning and deployable anywhere, from servers to mobile devices, microcontrollers and browsers.

Developers are using models available from TF Hub to solve real world problems across many domains, and at Google I/O 2021 we highlighted some examples of developers changing the world using models from TensorFlow Hub.

In this article, we’ll cover these use cases as well, with links so you can check them out.

Images

Image classification is one of the most popular use cases for machine learning. The development of this field helped the whole of machine learning by showing very good results and pushing the boundaries of research.

TensorFlow Hub has many models for the image problem domain for tasks like image classification, object detection, image segmentation, pose detection, style transfer and many others.

Many of the available models have a visualizer, like the one below, right on their documentation page, enabling you to try the model without any code or downloading anything.

TFHub makes Transfer Learning simpler and easier to experiment with many state of the art models like MobilenetV3, EfficientNet V2 to find the best one for your data. A real world use case can be seen in this CropNet tutorial to create the best model possible to detect diseases in cassava leaves and deploy it on device for use in the field.

Text

Understanding text has always been a very challenging task for computers because of all the context that is necessary, and the large number of words and phrases. Many state of the art Natural Language Processing (NLP) models are available on TensorFlow Hub and ready for you to use.

One example is the BERT family of models. Using them from TFHub is easier than ever. Aside from the base BERT model, there are more advanced versions and in many languages ready to be used like you can see here in Making BERT Easier with Preprocessing Models From TensorFlow Hub.

One good example is the MuRIL model that is a multilingual BERT model trained on 17 Indian languages used by developers to solve local NLP challenges in India.

An animation of the preprocessing model that makes it easy for you to input text into BERT
An animation of the preprocessing model that makes it easy for you to input text into BERT.

Developers can also use the TF Hub spam detection model for detecting spam comments in online forums. The model is available for TF.js and TFLite for running in the browser and on-device.

Audio

TF Hub has many audio models that you can use on desktop, for on-device inference on mobile devices, or on the web. There are also audio embedding models that can be used with transfer learning which you can adapt to your own data.

gif of dog next to microphone and sound waves

Developers are using audio classification to understand what’s happening on a forest (How ZSL uses ML to classify gunshots to protect wildlife) or inside the ocean (Acoustic Detection of Humpback Whales Using a Convolutional Neural Network) or even closer to us, understanding what is happening in your own home (Important household sounds become more accessible).

Video

Video processing is increasingly important and TensorFlow Hub also has models for this domain like the MoViNet collection that can do video classification or the I3D for action recognition.

gif of video processor in TensorFlow Hub

TFHub also has tutorials for Action recognition, Video Interpolation and Text-to-video retrieval.

Summary

Reusing code is usually better than re-writing it. The same applies to machine learning models. If you can use a pre-trained model for a task, it can save you time, resources, and help you make an impact in the world. TensorFlow Hub has thousands of models available for you to deploy or customize to your task with transfer learning.

If you want to know more about how to use TensorFlow Hub and find great tutorials, check out the documentation at tensorflow.org/hub. To find models for your own real world impact, search on tfhub.dev.

Let us know what you build and also share with the community. You can talk to the team on the TensorFlow Forum and find a community that is eager to help!

Read More

Google demonstrates leading performance in latest MLPerf Benchmarks

Cross posted with the Google Cloud blog by Tao Wang, Software Engineer, Aarush Selvan, Product Manager.

The latest round of MLPerf benchmark results have been released, and Google’s TPU v4 supercomputers demonstrated record-breaking performance at scale. This is a timely milestone since large-scale machine learning training has enabled many of the recent breakthroughs in AI, with the latest models encompassing billions or even trillions of parameters (T5, Meena, GShard, Switch Transformer, and GPT-3).

Google’s TPU v4 Pod was designed, in part, to meet these expansive training needs, and TPU v4 Pods set performance records in four of the six MLPerf benchmarks Google entered using TensorFlow and JAX. These scores are a significant improvement over our winning submission from last year and demonstrate that Google once again has the world’s fastest machine learning supercomputers. These TPU v4 Pods are already widely deployed throughout Google data centers for our internal machine learning workloads and will be available via Google Cloud later this year.

Speedup of Google’s best MLPerf Training v1.0 TPU v4 submission over the fastest non-Google submission in any availability category - in this case, all baseline submissions came from NVIDIA. Comparisons are normalized by overall training time regardless of system size. Taller bars are better.
Figure 1: Speedup of Google’s best MLPerf Training v1.0 TPU v4 submission over the fastest non-Google submission in any availability category – in this case, all baseline submissions came from NVIDIA. Comparisons are normalized by overall training time regardless of system size. Taller bars are better.1

Let’s take a closer look at some of the innovations that delivered these ground-breaking results and what this means for large model training at Google and beyond.

Google’s continued performance leadership

Google’s submissions for the most recent MLPerf demonstrated leading top-line performance (fastest time to reach target quality), setting new performance records in four benchmarks. We achieved this by scaling up to 3,456 of our next-gen TPU v4 ASICs with hundreds of CPU hosts for the multiple benchmarks. We achieved an average of 1.7x improvement in our top-line submissions compared to last year’s results. This means we can now train some of the most common machine learning models in a matter of seconds.

Figure 2: Speedup of Google’s MLPerf Training v1.0 TPU v4 submission over Google’s MLPerf Training v0.7 TPU v3 submission (exception: DLRM results in MLPerf v0.7 were obtained using TPU v4). Comparisons are normalized by overall training time regardless of system size. Taller bars are better. Unet3D not shown since it is a new benchmark for MLPerf v1.0.
Figure 2: Speedup of Google’s MLPerf Training v1.0 TPU v4 submission over Google’s MLPerf Training v0.7 TPU v3 submission (exception: DLRM results in MLPerf v0.7 were obtained using TPU v4). Comparisons are normalized by overall training time regardless of system size. Taller bars are better. Unet3D not shown since it is a new benchmark for MLPerf v1.0.2

We achieved these performance improvements through continued investment in both our hardware and software stacks. Part of the speedup comes from using Google’s fourth-generation TPU ASIC, which offers a significant boost in raw processing power over the previous generation, TPU v3. 4,096 of these TPU v4 chips are networked together to create a TPU v4 Pod, with each pod delivering 1.1 exaflop/s of peak performance.

Figure 3: A visual representation of 1 exaflop/s of computing power. If 10 million laptops were running simultaneously, then all that computing power would almost match the computing power of 1 exaflop/s.

In parallel, we introduced a number of new features into the XLA compiler to improve the performance of any ML model running on TPU v4. One of these features provides the ability to operate two (or potentially more) TPU cores as a single logical device using a shared uniform memory access system. This memory space unification allows the cores to easily share input and output data – allowing for a more performant allocation of work across cores. A second feature improves performance through a fine-grained overlap of compute and communication. Finally, we introduced a technique to automatically transform convolution operations such that space dimensions are converted into additional batch dimensions. This technique improves performance at the low batch sizes that are common at very large scales.

Enabling large model research using carbon-free energy

Though the margin of difference in topline MLPerf benchmarks can be measured in mere seconds, this can translate to many days worth of training time on the state-of-the-art models that comprise billions or trillions of parameters. To give an example, today we can train a 4 trillion parameter dense Transformer with GSPMD on 2048 TPU cores. For context, this is over 20 times larger than the GPT-3 model published by OpenAI last year. We are already using TPU v4 Pods extensively within Google to develop research breakthroughs such as MUM and LaMDA, and improve our core products such as Search, Assistant and Translate. The faster training times from TPUs result in efficiency savings and improved research and development velocity. Many of these TPU v4 Pods will be operating at or near 90% carbon free energy. Furthermore, cloud datacenters can be ~1.4-2X more energy efficient than typical datacenters, and the ML-oriented accelerators – like TPUs – running inside them can be ~2-5X more effective than off-the-shelf systems.

We are also excited to soon offer TPU v4 Pods on Google Cloud, making the world’s fastest machine learning training supercomputers available to customers around the world, and we recently released an all-new Cloud TPU system architecture that provides direct access to TPU host machines, greatly improving the user experience.

Want to learn more?

Read how to get started using TPUs to train your model. We are excited to see how you will expand the machine learning frontier with access to exaflops of TPU computing power!

¹ All results retrieved from www.mlperf.org on June 30, 2021. MLPerf name and logo are trademarks. See www.mlperf.org for more information. Chart uses results 1.0-1067, 1.0-1070, 1.0-1071, 1.0-1072, 1.0-1073, 1.0-1074, 1.0-1075, 1.0-1076, 1.0-1077, 1.0-1088, 1.0-1089, 1.0-1090, 1.0-1091, 1.0-1092.

² All results retrieved from www.mlperf.org on June 30, 2021. MLPerf name and logo are trademarks. See www.mlperf.org for more information. Chart uses results 0.7-65, 0.7-66, 0.7-67, 1.0-1088, 1.0-1090, 1.0-1091, 1.0-1092..

Read More

2021 Request for Proposals: Faculty awards to support machine learning courses, diversity, and inclusion

Posted by Josh Gordon for the TensorFlow team

Google AI and the TensorFlow team have a funding opportunity open to universities. If you’re a faculty member interested in teaching machine learning courses, and/or leading or contributing to diversity initiatives, please read on to learn more. We have parallel goals for these awards, and you may apply for funding with one or both in mind.

  • We want to support you as you lead or contribute to diversity initiatives that widen access to ML education for currently underrepresented groups in computer science. We are especially interested in programs that include a clear focus on cultivating and retaining a “critical mass” of students, and that encourage students to pursue graduate study and/or future careers in ML.
  • We want to support you as you design, develop, and teach undergraduate or graduate level machine learning courses that include examples with open-source libraries. We would like to support courses that teach practical, in-demand skills, and to support courses that prepare students to solve new and challenging problems using ML in applied areas, such as healthcare, journalism, basic science, and others.

TensorFlow logo

We especially welcome proposals that combine these goals, proposals that include cross-institutional collaborations, and proposals that include collaborations between faculty and graduate students. We look forward to hearing your ideas!

To learn more, please see the RFP at goo.gle/tensorflow-rfp. Please note that the submission deadline is July 30th, 2021. For questions, you can reach out to tensorflow-rfp@google.com.

Read More

Meet TensorFlow community leads around the world

Posted by Joana Carrasqueira and Lynette Gaddi, Program Managers at Google

The TensorFlow community keeps growing every day, and includes many thousands of developers, educators, and researchers around the world. If you’d like to get involved with the community, there are many different organizations you can check out.

These include Special Interest Groups (SIGs), TensorFlow User Groups (TFUGs), and Google Developer Groups (GDGs). There are also many Google Developer Experts (GDEs) you can get in touch with. They’re knowledgeable about ML and help others in their community, and are a great point of contact to find future local events.

We spend a lot of time working with community leads, and in this article, we’d like to share some of their stories with you. We had the wonderful opportunity to interview several leads from different areas – including a SIG Lead, a Machine Learning GDE, and two TensorFlow User Group organizers, so you can learn about their background, how they got involved in the community, and how you can too.

TensorFlow branded banner with orange elements

Karl Lessard

TensorFlow SIG Lead for JVM

Montreal, Canada

Image of Karl Lessard

Karl has been working in software engineering and consulting for more than 20 years in various fields, including computer graphics and communications. He is now working full-time at Expedia in Montreal, focusing on delivering solutions for complex linguistic and localization challenges.

What does being a community leader mean to you?

What really matters is that all members of the group enjoy contributing to the project, and making it as fulfilling as possible for them. Because that’s what open-source is to me: a playground for grownups building something useful to the world! Being a community leader comes with a bunch of technical responsibilities, too, but for me that’s the most important thing.

How did you get involved in the TF community?

I started designing a few proposals to enhance the TensorFlow Java client, which at that time was offering minimal support for running model inference on Android devices. My proposed changes were welcomed by Google (special thanks to Asim Shankar), and I’ve submitted multiple pull requests over a couple years since then.

There was increasing general interest in supporting TensorFlow on the JVM following that, and I met the engineering team (and many others from the community) at a TensorFlow Dev summit in 2019 to suggest the idea of starting a group focusing on this topic. That’s how SIG JVM was born.

How do you contribute technically as a SIG Lead?

I still contribute to the design and the code of the project (like in the beginning), but I also review most of the pull requests, plan video calls, and make sure proposed changes are done with respect to the global vision of the project shared by other members, and they are being discussed properly and broadly.

Do you have any advice on how to get involved in the community?

If you can make it to the TensorFlow Dev/Contributor Summit, do it. That’s definitely a good place to meet people sharing the same interests as you. Also, you can get involved in the various discussions related to the topics of your interests. SIGs forums are a good place to start, and you can also get in touch with others on the new TensorFlow Forum. Finally, don’t be shy to make change proposals and/or to submit a few pull requests!

Ruqiya Bin Safi

Google Cloud GDE

Saudi Arabia

image of Ruqiya Bin Safi

Ruqiya Bin Safi is a Software Engineer that is interested in Artificial Intelligence, Machine Learning and Deep Learning as well as Data Science. Ruqiya started learning Machine Learning many years ago. She seeks to spread knowledge about Machine Learning and new technologies.

What does being a community leader mean to you?

It means a responsibility I take on, a goal I’ve accomplished, and a dream I fulfill. It means that I contribute to the development of my community and helping others. To be a community leader means sharing useful knowledge so that everyone can benefit. Leadership is also a give and take: the community gives to me, and I give back to the community. It’s a cooperation. We share similar goals and interests, and vision and mission. And we seek to employ what we’ve learned to develop tools that make the world a better place.

How did you get involved in the TF community?

I’ve always loved learning new technologies and using them to solve problems. I started as a software engineer, and found machine learning increasingly interesting as I studied it, and eventually made it a focus. I got involved with the TF community through a Women Techmakers (WTM) event. I then joined a local WTM community to help others learn about ML, and later applied and was accepted into the GDE program.

How do you contribute as a ML GDE?

I love giving talks, and most of my contributions were as speaker and technical trainer through various tech talks, panel discussions, and workshops. My goal is to help and motivate people to learn more about ML. I also run a deep learning monthly workshop that aims to help participants gain foundational knowledge of common deep learning techniques as well as practical experience in building neural networks with TensorFlow. I also really enjoy mentoring for Google for Startups Accelerator MENA program as well as some hackathons, and I write articles from time to time about machine learning and TensorFlow.

Do you have any advice on how to get involved in the community?

One of the best ways is to try to learn something new, and then share what you have learned with your community through blogging or a local meetup. Love what you do, and as much as you can, find work that you enjoy – and trust in yourself as you become more active and involved.

Armel Yara

TensorFlow User Group, Organizer

Abidja, West Africa

Photo of Armel Yara

Armel is a developer and TensorFlow community leader in Francophone Africa, where he organizes and hosts developer events in multiple languages (including large events like TensorFlow Everywhere SSA), and manages machine learning projects for local business

What does being a community leader mean to you?

Being a community leader means for me sharing experiences, being available for others, and listening to their needs and expectations.

How did you get involved in the TF community?

I get involved in the TF community by sharing the latest news about the TF community on my blog and working on open source projects.

How do you contribute as a TFUG lead?

I organize events and give technical sessions at universities and online.

Do you have any advice on how to get involved in the community?

My advice to become more active and get more involved in the community is to look over the membership expectations, share projects that you have built, and use them to motivate others. Lead by example, let others know how much you enjoy what you do and showcase your work.

Nijat Zeynalov

TensorFlow User Group, Organizer

Azerbaijan

photo of Nijat Zeynalov

Nijat is a Certified TensorFlow Developer and a first year master student at University of Tartu. He’s passionate about data science and machine learning, and organizing events to help others.

What does being a community leader mean to you?

I learned about leadership by running a local community where we aimed to provide free support to anyone interested in coding. In my mind, being a community leader means to help inspire others, and also to foster a community of respect that enables and encourages contributions of others. I strongly believe that leadership can be learned with practice.

How did you get involved in the TF community?

While I was preparing for the TensorFlow Developer certificate, I found a user groups page and I thought – why not set up a local user group for our country. I understood the responsibility of being a community leader, and I contacted a few TensorFlow User Group organizers to learn more about it. Their positive feedback about the overall impression made my decision even easier and motivated me to get started, and it’s been a great experience ever since.

How do you contribute as a community organizer?

We regularly discuss the latest TensorFlow updates in the user group, and we organise “Paper Reading Meetings” where we read and discuss one deep learning paper as a group. This has been a really great way for people to share their knowledge and ask questions. Additionally, in March, as a “TensorFlow User Group – Azerbaijan”, we held the 5-hour long “TensorFlow Everywhere – 2021” event which was the country’s largest machine learning event to date.

It was a pleasure to speak with Karl, Ruqiya, Armel and Nijat (thank you again for your time and contributions!) We hope their stories inspire you to get involved, and take on a leadership role in your local community in the future. If you’d like, you can start a conversation on the TensorFlow Forum and share how you got involved in the TensorFlow Community, and meet others. And check out the top of this post for more links to user and special interest groups.

Read More

Easier object detection on mobile with TensorFlow Lite

Posted by Khanh LeViet, Developer Advocate on behalf of the TensorFlow Lite team

Example of object detection on mobile

At Google I/O this year, we are excited to announce several product updates that simplify training and deployment of object detection models on mobile devices:

  • On-device ML learning pathway: a step-by-step tutorial on how to train and deploy a custom object detection model on mobile devices with no machine learning expertise required.
  • EfficientDet-Lite: a state-of-the-art object detection model architecture optimized for mobile devices.
  • TensorFlow Lite Model Maker for object detection: train custom models in just a few lines of code.
  • TensorFlow Lite Metadata Writer API: simplify metadata creation to generate custom object detection models compatible with TFLite Task Library.

Despite being a very common ML use case, object detection can be one of the most difficult to do. We’ve worked hard to make it easier for you, and in this blog post we’ll show you how to leverage the latest offerings from TensorFlow Lite to build a state-of-the-art mobile object detector using your own domain data.

On-device ML learning pathway: learn how to train and deploy custom TensorFlow Lite object detection model in 12 minutes.

Training a custom object detection model and deploying it to an Android app has become super easy with TensorFlow Lite. We released a learning pathway that teaches you step-by-step how to do it.

In the video, you can learn the steps to build a custom object detector:

  1. Prepare the training data.
  2. Train a custom object detection model using TensorFlow Lite Model Maker.
  3. Deploy the model on your mobile app using TensorFlow Lite Task Library.

There’s also a codelab with source code on GitHub for you to run through the code yourself. Please try it out and let us know your feedback!

EfficientDet-Lite: the state-of-the-art model architecture for object detection on mobile devices

Running machine learning models on mobile devices means we always need to consider the trade-off between model accuracy vs. inference speed and model size. The state-of-the-art mobile-optimized model doesn’t only need to be more accurate, but it also needs to run faster and be smaller. We adapted the neural architecture search technique published in the EfficientDet paper, then optimized the model architecture for running on mobile devices and came up with a novel mobile object detection model family called EfficientDet-Lite.

EfficientDet-Lite has 5 different versions: Lite0 to Lite4. The smaller version runs faster but is not as accurate as the larger version. You can experiment with multiple versions of EfficientNet-Lite and choose the one that is most suitable for your use case.

Model architecture

Size(MB)*

Latency (ms)**

Average Precision***

EfficientDet-Lite0

4.4

37

25.69%

EfficientDet-Lite1

5.8

49

30.55%

EfficientDet-Lite2

7.2

69

33.97%

EfficientDet-Lite3

11.4

116

37.70%

EfficientDet-Lite4

19.9

260

41.96%

SSD MobileNetV2 320×320

6.7

24

20.2%

SSD MobileNetV2 FPNLite 640×640

4.3

191

28.2%

* Size of the integer quantized models.

** Latency measured on Pixel 4 using 4 threads on CPU.

*** Average Precision is the mAP (mean Average Precision) on the COCO 2017 validation dataset.

We have released the EfficientDet-Lite models trained on the COCO dataset to TensorFlow Hub. You also can train EfficientDet-Lite custom models using your own training data with TensorFlow Lite Model Maker.

TensorFlow Lite Model Maker: train a custom object detection using transfer learning in a few lines of code

TensorFlow Lite Model Maker is a Python library that significantly simplifies the process of training a machine learning model using a custom dataset. It leverages transfer learning to enable training high quality models using just a handful of images.

Model Maker accepts datasets in the PASCAL VOC format and the Cloud AutoML’s CSV format. As you can create your own dataset using open-source GUI tools such as LabelImg or makesense.ai, everyone can create training data for Model Maker without writing a single line of code.

Once you have your training data, you can start training a TensorFlow Lite custom object detectors.

# Step 1: Choose the model architecture
spec = model_spec.get('efficientdet_lite2')

# Step 2: Load your training data
train_data, validation_data, test_data = object_detector.DataLoader.from_csv('gs://cloud-ml-data/img/openimage/csv/salads_ml_use.csv')

# Step 3: Train a custom object detector
model = object_detector.create(train_data, model_spec=spec, validation_data=validation_data)

# Step 4: Export the model in the TensorFlow Lite format
model.export(export_dir='.')

# Step 5: Evaluate the TensorFlow Lite model
model.evaluate_tflite('model.tflite', test_data)

Check out this notebook to learn more.

TensorFlow Lite Task Library: deploying object detection models on mobile in a few lines of code

TensorFlow Lite Task Library is a cross-platform library which simplifies TensorFlow Lite model deployments on mobile. Custom object detection models trained with TensorFlow Lite Model Maker can be deployed to an Android app in just a few lines of Kotlin code:

// Step 1: Load the TensorFlow Lite model
val detector = ObjectDetector.createFromFile(context, "model.tflite")

// Step 2: Convert the input Bitmap into a TensorFlow Lite's TensorImage object
val image = TensorImage.fromBitmap(bitmap)

// Step 3: Feed given image to the model and get the detection result
val results = detector.detect(image)

See our documentation to learn more about the customization options in Task Library, including how to configure the minimum detection threshold or the maximum number of detected objects.

TensorFlow Lite Metadata Writer API: simplify deployment of custom models trained with TensorFlow Object Detection API

Task Library relies on the model metadata bundled in the TensorFlow Lite model to execute the preprocessing and postprocessing logic required to run inference using the model. They include how to normalize the input image, or how to map the class id to human readable labels. Models trained using Model Maker have these metadata by default, making them compatible with Task Library. But if you train a TensorFlow Lite object detection model using a training pipeline other than Model Maker, you can add the metadata using TensorFlow Lite Metadata Writer API.

For example, if you train a model using TensorFlow Object Detection API, you can add metadata to the TensorFlow Lite model using this Python code:

LABEL_PATH = 'label_map.txt'
MODEL_PATH = "ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/model.tflite"
SAVE_TO_PATH = "ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/model_with_metadata.tflite"

# Step 1: Specify the preprocessing parameters and label file
writer = object_detector.MetadataWriter.create_for_inference(
writer_utils.load_file(MODEL_PATH), input_norm_mean=[0],
input_norm_std=[255], label_file_paths=[LABEL_PATH])

# Step 2: Export the model with metadata
writer_utils.save_file(writer.populate(), SAVE_TO_PATH)

Here we specify the normalization parameters (input_norm_mean=[0], input_norm_std=[255]) so that the input image will be normalized into the [0..1] range. You need to specify normalization parameters to be the same as in the preprocessing logic used during the model training.

See this notebook for a full tutorial on how to convert models trained with the TensorFlow Object Detection API to TensorFlow Lite and add metadata.

What’s next

Our goal is to make machine learning easier to use for every developer, with or without machine learning expertise. We are working with the Model Garden team to bring more object detection model architectures to Model Maker. We will also continue to work with researchers in Google to make future state-of-the-art object detection models available via Model Maker, shortening the path from cutting-edge research to production for everyone. Stay tuned for more updates!

Read More