Optimizing Peptides in TensorFlow 2

Optimizing Peptides in TensorFlow 2

A guest post by Somesh Mohapatra, Rafael Gómez-Bombarelli of MIT

Introduction

A polymer is a material made up of long repeating chains of molecules, like plastic or rubber. Polymers are made up of subunits (monomers) that are chemically bound to one another. The chemical composition and arrangement of monomers dictate the properties of the polymer. A few examples of polymers in everyday use are water bottles, non-stick teflon coatings, and adhesives.

Figure 1. Conceptually, you can think of Peptimizer as generating a sequence of amino acids, then predicting a property of the peptide, then optimizing the sequence.

Peptides are short polymer chains made up of amino acids, analogous to words composed of letters. They are widely used for therapeutic applications, such as for the delivery of gene therapy by cell-penetrating peptides. Thanks to their modular chemistry amenable to automated synthesis and expansive design space, peptides are increasingly preferred over more conventional small molecule drugs, which are harder to synthesize. However, the vast sequence space (in terms of the amino acid arrangement) acts as an impediment to the design of functional peptides.

Synthetic accessibility, apart from functionality optimization, is a challenge. Peptides and other functional polymers with a precise arrangement of monomers are synthesized using methods such as flow chemistry. The synthesis involves monomer-by-monomer addition to a growing polymer chain. This process necessitates a high reaction yield for every step, thus making the accessibility of longer chains challenging.

Conventional approaches for optimization of functional polymers, such as peptides, in a lab environment involve the heuristic exploration of chemical space by trial-and-error. However, the number of possible polymers rises exponentially as mn, where m is the number of possible monomers, and n is the polymer length.

As an alternative to doing an experiment in a lab, you can design functional polymers using machine learning. In our work on optimizing cell-penetrating activity and synthetic accessibility, we design peptides using Peptimizer, a machine learning framework based on TensorFlow. Conceptually, you can think of Peptimizer as generating a sequence of amino acids, then predicting a property of the peptide, then optimizing the sequence.

Peptimizer can be used for the optimization of functionality (other than cell-penetrating activity as well) and synthetic accessibility of polymers. We use topological representations of monomers (amino acids) and matrix representations of polymer chains (peptide sequences) to develop interpretable (attribute the gain in property to a specific monomer and/or chemical substructure) machine learning models. The choice of representation and model architecture enables inference of biochemical design principles, such as monomer composition, sequence length or net charge of polymer, by using gradient-based attribution methods.

Key challenges for applying machine learning to advance functional peptide design include limited dataset size (usually less than 100 data points), choosing effective representations, and the ability to explain and interpret models.

Here, we use a dataset of peptides received from our experimental collaborators to demonstrate the utility of the codebase.

Optimization of functionality

Based on our work on designing novel and highly efficient cell-penetrating peptides, we present a framework for the discovery of functional polymers (Figure 1). The framework consists of a recurrent neural network generator, convolutional neural network predictor, and genetic algorithm optimizer.

The generator is trained on a dataset of peptide sequences using Teacher Forcing, and enables sampling of novel sequences similar to the ones in the training dataset. The predictor is trained over matrix representations of sequences and experimentally determined biological activity. The optimizer is seeded with sequences sampled utilizing the generator. It optimizes by evaluating an objective function that involves the predicted activity and other parameters such as length and arginine content. The outcome is a list of optimized sequences with high predicted activity, which may be validated in wet-lab experiments.

Each of these components can be accessed from the tutorial notebook to train on a custom dataset. The scripts for the individual components have been designed in a modular fashion and can be modified with relative ease.

Optimization of synthetic accessibility

Apart from functionality optimization, Peptimizer allows for the optimization of synthetic accessibility of a wild-type sequence (Figure 2). The framework consists of a multi-modal convolutional neural network predictor and a brute force optimizer. The predictor is trained over experimental synthesis parameters such as pre-synthesized chain, incoming monomer, temperature, flow rate, and catalysts. The optimizer evaluates single-point mutants of the wild-type sequence for higher theoretical yield.

The choice of a brute force optimizer for optimization of synthetic accessibility is based on the linearly growing sequence space (m x n) for the variations of the wild-type sequence. This sequence space is relatively small in comparison to the exponentially growing sequence space (mn) encountered in optimization of functionality.

This framework may be adapted for other stepwise chemical reaction platforms with in-line monitoring by specifying the different input and output variables and respective data types. It can be accessed using a tutorial notebook.

Figure 2. Outline of synthetic accessibility optimization.

Interpretability of models

A key feature of Peptimizer is the gradient-based attribution for the interpretation of model predictions (Figure 3). Taking the gradient of the predicted activity with the input sequence representation, we visualize both positive and negative activations for each input feature. Fingerprint indices corresponding to substructures that positively contribute to the activity have higher activation in the heatmap. This activation heatmap is averaged along the topological fingerprints axis to find key substructures or chemical motifs that contribute positively/negatively to the predicted activity. Averaging over the monomer position axis, we obtain the relative contribution of each monomer to the predicted functionality of the polymer. These visualizations provide in-depth insight into sequence-activity relationships and add to the contemporary understanding of biochemical design principles.

Figure 3. (left) Positive gradient activation heatmap, and (right) activated chemical substructure, for functional peptide sequence.

Outlook

Optimization of functional polymers using Peptimizer can inform experimental strategies and lead to significant savings in terms of time and costs. We believe that the tutorial notebooks will help bench scientists in chemistry, materials science, and the broader field of sequence design to run machine learning models over custom datasets, such as Khazana. In addition, the attribution methods will provide insights into the high-dimensional sequence-activity relationships and elucidation of design principles.

Experimental collaboration

This work was done in collaboration with the lab of Bradley Pentelute (Department of Chemistry, MIT). The collaborators for the optimization of functionality and synthetic accessibility were Carly Schissel and Dr. Nina Hartrampf, respectively. We thank them for providing the dataset, experimental validation, and the discussion during the development of the models.

Acknowledgment

We would like to acknowledge the support of Thiru Palanisamy and Josh Gordon at Google for their help with the blog post collaboration and with providing active feedback.Read More

Even Faster Mobile GPU Inference with OpenCL

Even Faster Mobile GPU Inference with OpenCL

Posted by Juhyun Lee and Raman Sarokin, Software Engineers

While the TensorFlow Lite (TFLite) GPU team continuously improves the existing OpenGL-based mobile GPU inference engine, we also keep investigating other technologies. One of those experiments turned out quite successful, and we are excited to announce the official launch of OpenCL-based mobile GPU inference engine for Android, which offers up to ~2x speedup over our existing OpenGL backend, on reasonably sized neural networks that have enough workload for the GPU.

Figure 1. Duo’s AR effects are powered by our OpenCL backend.

Improvements over the OpenGL Backend

Historically, OpenGL is an API designed for rendering vector graphics. Compute shaders were added with OpenGL ES 3.1, but its backward compatible API design decisions were limiting us from reaching the full potential of the GPU. OpenCL, on the other hand, was designed for computation with various accelerators from the beginning and is thus more relevant to our domain of mobile GPU inference. Therefore, we have looked into an OpenCL-based inference engine, and it brings quite a lot of features that let us optimize our mobile GPU inference engine.

Performance Profiling: Optimizing the OpenCL backend was much easier than OpenGL, because OpenCL offers good profiling features and Adreno supports them well. With these profiling APIs, we are able to measure the performance of each kernel dispatch very precisely.

Optimized Workgroup Sizes: We have observed that the performance of TFLite GPU on Qualcomm Adreno GPUs is very sensitive to workgroup sizes; picking the right workgroup size can boost the performance, whereby picking the wrong one can degrade the performance by an equal amount. Unfortunately, picking the right workgroup size is not trivial for complex kernels with complicated memory access patterns. With the help of the aforementioned performance profiling features in OpenCL, we were able to implement an optimizer for workgroup sizes, which resulted in up to 50% speedup over the average.

Native 16-bit Precision Floating Point (FP16): OpenCL supports FP16 natively and requires the accelerator to specify the data type’s availability. Being a part of the official spec, even some of the older GPUs, e.g. Adreno 305 from 2012, can operate at their full capabilities. OpenGL, on the other hand, relies on hints which the vendors can choose to ignore in their implementations, leading to no performance guarantees.

Constant Memory: OpenCL has a concept of constant memory. Qualcomm added a physical memory that has properties that makes it ideal to be used with OpenCL’s constant memory. This turned out to be very efficient for certain special cases, e.g. very thin layers at the beginning or at the end of the neural network. OpenCL on Adreno is able to greatly outperform OpenGL’s performance by having a synergy with this physical constant memory and the aforementioned native FP16 support.

Performance Evaluation

Below, we show the performance of TFLite on the CPU (single-threaded on a big core), on the GPU using our existing OpenGL backend, and on the GPU using our new OpenCL backend. Figure 2 and Figure 3 depict the performance of the inference engine on select Android devices with OpenCL on a couple of well-known neural networks, MNASNet 1.3 and SSD MobileNet v3 (large), respectively. Each group of 3 bars are to be observed independently which shows the relative speedup among the TFLite backends on a device. Our new OpenCL backend is roughly twice as fast as the OpenGL backend, but does particularly better on Adreno devices (annotated with SD), as we have tuned the workgroup sizes with Adreno’s performance profilers mentioned earlier. Also, the difference between Figure 2 and Figure 3 visualizes that OpenCL performs even better on larger networks.

Figure 2. Inference latency of MNASNet 1.3 on select Android devices with OpenCL.
Figure 3. Inference latency of SSD MobileNet v3 (large) on select Android devices with OpenCL.

Seamless Integration through the GPU Delegate

One major hurdle in employing the OpenCL inference engine is that OpenCL is not a part of the standard Android distribution. While major Android vendors include OpenCL as part of their system library, it is possible that OpenCL is not available for some users. For these devices, one needs to fall back to the OpenGL backend which is available on every Android device.

To make developers’ life easy, we have added a couple of modifications to the TFLite GPU delegate. We first check the availability of OpenCL at runtime. If it is available, we employ the new OpenCL backend as it is much faster than the OpenGL backend; if it is unavailable or couldn’t be loaded, we fall back to the existing OpenGL backend. In fact, the OpenCL backend has been in the TensorFlow repository since mid 2019 and seamlessly integrated through the TFLite GPU delegate v2, so you might be already using it through the delegate’s fallback mechanism.

Acknowledgements

Andrei Kulik, Matthias Grundman, Jared Duke, Sarah Sirajuddin, and special thanks to Sachin Joglekar for his contributions to this blog post.
Read More

Creating Sounds Of India: An on device, AI powered, musical experience built with TensorFlow

Creating Sounds Of India: An on device, AI powered, musical experience built with TensorFlow

Posted by Anusha Ramesh, Product Manager TFX, David Zats, Software Engineer TFX, Ping Yu, Software Engineer TensorFlow.js, Lamtharn (Hanoi) Hantrakul, AI Resident Magenta

Introduction

Sounds of India is a unique and fun interactive musical experience launching for India’s 74th Independence Day, inspired by Indian tradition and powered by machine learning. When users throughout India (and around the world) sing the Indian National Anthem into the microphone of their mobile devices, machine learning models transform their voices into a range of classical Indian musical instruments live in the browser. The entire process of creating this experience took only 12 weeks, showing how rapidly developers can take models from research to production at scale using the TensorFlow Ecosystem.

The Research: Magenta’s Differentiable Digital Signal Processing (DDSP)

Magenta is an open source research project within Google AI exploring the role of machine learning in the creative process. Differentiable Digital Signal Processing or DDSP is a new open source library fusing modern machine learning with interpretable signal processing. Instead of training a pure deep learning model like WaveNet to render waveforms sample-by-sample, we can train lightweight models that output time varying control signals into these differentiable DSP modules (hence the extra “D” in DDSP) which synthesize the final sound. Both recurrent and convolutional models incorporating DDSP in TensorFlow Keras layers can efficiently generate audio 1000 times faster than their larger autoregressive counterparts, with 100x reduction in model parameters and training data requirements. One particularly fun application of DDSP is Tone Transfer, which transforms sounds into musical instruments. Try it by first training a DDSP model on 15 minutes of a target saxophone. You can then sing a melody and the trained DDSP model will re-render it as a saxophone. For Sounds of India, we applied this technology to three classical Indian instruments: the Bansuri, the Shehnai, and the Sarangi.

Train with TFX, deploy to the browser with TensorFlow.js

TFX

TensorFlow Extended (TFX) is an end-to-end platform for production ML, which includes preparing data, training, validating, and deploying models in production environments. TFX was used to train the models responsible for transforming the user’s voice to one of the instruments, and these models were then converted to TensorFlow.js for deployment on a standard Web browser.

Deploying to the browser provides a seamless experience for users to interact with the machine learning model: simply click a hyperlink and load the page just like any other website. No complicated installs necessary. By executing client side in the browser, we are able to perform inference right at the source of the sensor data, minimising latency and reducing server costs associated with large graphics cards, CPU, and memory. Moreover, given the application uses your voice as input, user privacy is quite important. Since the entire end-to-end experience happens client-side and in the browser, absolutely no sensor or microphone data is sent to the server side.

Browser-based machine learning models are often optimized to be as small as possible to minimize bandwidth used. In this case, the ideal hyperparameters for each musical instrument can also vary drastically. We leveraged TFX to perform large-scale training and tuning over hundreds of models to determine the smallest size for each instrument. As a result, we were able to dramatically reduce their memory footprints. For example, the Bansuri instrument model had a reduction in its on-disk size of ~20x without a noticeable impact on sound quality.

TFX also empowered us to perform rapid iteration over different model architectures (GRU, CNN), different types of inputs (loudness, RMS energy), and varying musical instrument data sources. Each time, we were able to quickly and effectively run the TFX pipeline to produce a new model with the desired characteristics.

TensorFlow.js

Creating a TensorFlow.js DDSP model was uniquely challenging because of the need to hit tight performance and model quality targets. We needed the model to be highly efficient at performing tone transfer so that it could effectively run on mobile devices. At the same time, any degradation in model quality would quickly lead to audio distortions and a poor user experience.

We started by exploring a wide range of TensorFlow.js backends and model architectures. The WebGL backend is the most optimized, while the WebAssembly backend works well on low end phones. Given the computational requirements of DDSP, we settled on a Convnet-based DDSP model and leveraged the WebGL backend.

To minimize the model download time, we studied the topology of the model, and compressed a large set of constant tensors with Fill/ZeroLike ops, which reduced the size from 10MB to 300KB.

We also focused on three key areas to make the TensorFlow.js model ready for production scale deployment on devices: inference performance, memory footprint, and numerical stability.

Inference Performance Optimization
DDSP models contain both a neural network and a signal synthesizer. The synthesizer part has many signal processing ops that require large amounts of computation. To improve performance on mobile devices, we re-wrote several kernels with special WebGL shaders to fully utilize the GPU. For example, a parallel version of the cumulative summation op reduced inference time by 90%.

Reduce memory footprint
Our goal is to be able to run the model on as many types of mobile devices as possible. Since many phones have very limited GPU memory, we need to make sure that model has a minimal memory footprint. We achieve this by disposing of intermediate tensors and adding a new flag to allow early disposal of GPU textures. Through these approaches we were able to reduce memory size by 60%.

Numerical stability
The DDSP model requires very high numerical precision in order to generate beautiful music. This is quite different from typical classification models, where a certain level of precision loss does not affect the final classifications. DDSP models used in this experience are generative models. Any loss in precision and discontinuities in the audio output are easily picked up by our sensitive ears. We encountered numerical stability problems with float16 WebGL texture. We therefore rewrote some of the key ops to reduce the overflow and underflow of the outputs. For example, in the Cumulative Summation op, we make sure cumulation is done within the shader with full float precision, and apply modulo calculation to avoid overflow before we write the output to a float16 texture.

Try it yourself!

You can try out the experience on your mobile phone at g.co/SoundsofIndia – and please share your results with us if you wish. We would love to see what you create with your voice.

If you are excited about how machine learning can augment creativity and innovation, you can learn more about Magenta through the team’s blog and contribute to their open source github, or check out #MadeWithTFJS for even more examples of browser-based machine learning from the TensorFlow.js community. If you are interested in training and deploying models at production scale using ML best practices, check out the Tensorflow Extended blog.

Acknowledgements

This project wouldn’t have been possible without the incredible effort of Miguel de Andrés-Clavera, Yiling Liu, Aditya Mirchandani, KC Chung, Alap Bharadwaj, Kiattiyot (Boon) Panichprecha, Pittayathorn (Kim) Nomrak, Phatchara (Lek) Pongsakorntorn, Nattadet Chinthanathatset, Hieu Dang, Ann Yuan, Sandeep Gupta, Chong Li, Edwin Toh, Jesse Engel and additional help from Michelle Carney, Nida Zada, Doug Eck, Hannes Widsomer and Greg Mikels. Huge thanks to Tris Warkentin and Mitch Trott for their tremendous support.
Read More

TensorFlow Model Optimization Toolkit — Weight Clustering API

TensorFlow Model Optimization Toolkit — Weight Clustering API

A guest post by Mohamed Nour Abouelseoud, and Anton Kachatkou at Arm

We are excited to introduce a weight clustering API, proposed and contributed by Arm, to the TensorFlow Model Optimization Toolkit.

Weight clustering is a technique to reduce the storage and transfer size of your model by replacing many unique parameter values with a smaller number of unique values. This benefit applies to all deployments. Along with framework and hardware-specific support, such as in the Arm Ethos-N and Ethos-U machine learning processors, weight clustering can additionally improve memory footprint and inference speed.

This work is part of the toolkit’s roadmap to support the development of smaller and faster ML models. You can see previous posts on post-training quantization, quantization-aware training, and sparsity for more background on the toolkit and what it can do.

Arm and the TensorFlow team have been collaborating in this space to improve deployment to mobile and IoT devices.

What is weight clustering?

Increasingly, Deep Learning applications are moving into more resource-constrained environments, from smartphones to agricultural sensors and medical instruments. This shift into resource-constrained environments led to efforts for smaller and more efficient model architectures as well as increased emphasis on model optimization techniques such as pruning and quantization.

Weight clustering is an optimization algorithm to reduce the storage and network transfer size of your model. The idea in a nutshell is explained in the diagram below.

Here’s an explanation of the diagram. Imagine, for example, that a layer in your model contains a 4×4 matrix of weights (represented by the “weight matrix” above). Each weight is stored using a float32 value. When you save the model, you are storing 16 unique float32 values to disk.

Weight clustering reduces the size of your model by replacing similar weights in a layer with the same value. These values are found by running a clustering algorithm over the model’s trained weights. The user can specify the number of clusters (in this case, 4). This step is shown in “Get centroids” in the diagram above, and the 4 centroid values are shown in the “Centroid” table. Each centroid value has an index (0-3).

Next, each weight in the weight matrix is replaced with its centroid’s index. This step is shown in “Assign indices”. Now, instead of storing the original weight matrix, the weight clustering algorithm can store the modified matrix shown in “Pull indices” (containing the index of the centroid values), and the centroid values themselves.

In this case, we have reduced the size from 16 unique floats, to 4 floats and 16 2-bit indices. The savings increase with larger matrix sizes.

Note that even if we still stored 16 floats, they now have just 4 distinct values. Common compression tools (like zip) can now take advantage of the redundancy in the data to achieve higher compression.

The technical implementation of clustering is derived from Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. See the paper for additional details on the gradient update and weight retrieval.

Clustering is available through a simple Keras API, in which any Keras model (or layer) can be wrapped and fine-tuned. See usage examples below.

Advantages of weight clustering

Weight clustering has an immediate advantage in reducing model storage and transfer size across serialization formats, as a model with shared parameters has a much higher compression rate than one without. This is similar to a sparse (pruned) model, except that the compression benefit is achieved through reducing the number of unique weights, while pruning achieves it through setting weights below a certain threshold to zero. Once a Keras model is clustered, the benefit of the reduced size is available by passing it through any common compression tool.

To further unlock the improvements in memory usage and speed at inference time associated with clustering, specialized run-time or compiler software and dedicated machine learning hardware is required. Examples include the Arm ML Ethos-N driver stack for the Ethos-N processor and the Ethos-U Vela compiler for the Ethos-U processor. Both examples currently require quantizing and converting optimized Keras models to TensorFlow Lite first.

Clustering can be done on its own or as part of a cascaded Deep Compression optimization pipeline to achieve further size reduction and inference speed.

Compression and accuracy results

Experiments were run on several popular models, demonstrating compression benefits of weight clustering. More aggressive optimizations can be applied, but at the cost of accuracy. Though the table below includes measurements for TensorFlow Lite models, similar benefits are observed for other serialization formats such as SavedModel.

The table below demonstrates how clustering was configured to achieve the results. Some models were more prone to accuracy degradation from aggressive clustering, in which case selective clustering was used on layers that are more robust to optimization.

Clustering a model

The clustering API is available in the TensorFlow Model Optimization Toolkit starting from release v0.4.0. To cluster a model, it needs to be fully trained first before passing it to the clustering API. A snippet of full model clustering is shown below.

import tensorflow_model_optimization as tfmot
cluster_weights = tfmot.clustering.keras.cluster_weights


pretrained_model = pretrained_model()

clustering_params = {
'number_of_clusters': 32,
'cluster_centroids_init': tfmot.clustering.keras.CentroidInitialization.LINEAR
}

clustered_model = cluster_weights(pretrained_model, **clustering_params)

# Fine-tune
clustered_model.fit(...)


# Prepare model for serving by removing training-only variables.
model_for_serving = tfmot.clustering.keras.strip_clustering(clustered_model)

...

To cluster select layers in a model, you can apply the same clustering method to those layers when constructing a model.

clustered_model = tf.keras.Sequential([
Dense(...),
cluster_weights(Dense(...,
kernel_initializer=pretrained_weights,
bias_initializer=pretrained_bias),
**clustering_params),
Dense(...)
])

When selectively clustering a layer, it still needs to have been fully trained; therefore, we use the layer’s kernel_initializer parameter to initialize the weights. Using tf.keras.models.clone_model is another option.

Documentation

To learn more about how to use the API, you can try this simple end-to-end clustering example colab to start. A more comprehensive guide with additional tips can be found here.

Acknowledgments

The feature and results presented in this post are the work of many people including the Arm ML Tooling team and our collaborators in Google’s TensorFlow Model Optimization Toolkit team.

From Arm – Anton Kachatkou, Aron Virginas-Tar, Ruomei Yan, Konstantin Sofeikov, Saoirse Stewart, Peng Sun, Elena Zhelezina, Gergely Nagy, Les Bell, Matteo Martincigh, Grant Watson, Diego Russo, Benjamin Klimczak, Thibaut Goetghebuer-Planchon.

From Google – Alan Chiao, Raziel Alvarez
Read More

Layerwise learning for Quantum Neural Networks

Layerwise learning for Quantum Neural Networks

Posted by Andrea Skolik, Volkswagen AG and Leiden University

In early March, Google released TensorFlow Quantum (TFQ) together with the University of Waterloo and Volkswagen AG. TensorFlow Quantum is a software framework for quantum machine learning (QML) which allows researchers to jointly use functionality from Cirq and TensorFlow. Both Cirq and TFQ are aimed at simulating noisy intermediate-scale quantum (NISQ) devices that are currently available, but are still in an experimental stage and therefore come without error correction and suffer from noisy outputs.

In this article, we introduce a training strategy that addresses vanishing gradients in quantum neural networks (QNNs), and makes better use of the resources provided by a NISQ device. If you’d like to play with the code for this example yourself, check out the notebook on layerwise learning in the TFQ research repository, where we train a QNN on a simulated quantum computer!

Quantum Neural Networks

Training a QNN is not that much different from training a classical neural network, just that instead of optimizing network weights, we optimize the parameters of a quantum circuit. A quantum circuit looks like the following:

Simplified QNN for a classification task with four qubits

The circuit is read from left to right, and each horizontal line corresponds to one qubit in the register of the quantum computer, each initialized in the zero state. The boxes denote parametrized operations (or “gates”) on qubits which are executed sequentially. In this case we have three different types of operations, X, Y, and Z. Vertical lines denote two-qubit gates, which can be used to generate entanglement in the QNN – one of the resources that lets quantum computers outperform their classical counterparts. We denote one layer as one operation on each qubit, followed by a sequence of gates that connect pairs of qubits to generate entanglement.

The figure above shows a simplified QNN for learning classification of MNIST digits.

First, we have to encode the data set into quantum states. We do this by using a data encoding layer, marked orange in the figure above. In this case, we transform our input data into a vector, and use the vector values as parameters d for the data encoding layers’ operations. Based on this input, we execute the part of the circuit marked in blue, which represents the trainable gates of our QNN, denoted by p.

The last operation in the quantum circuit is a measurement. During computation, the quantum device performs operations on superpositions of classical bitstrings. When we perform a readout on the circuit, the superposition state collapses to one classical bitstring, which is the output of the computation that we get. The so-called collapse of the quantum state is probabilistic, to get a deterministic outcome we average over multiple measurement outcomes.

In the above picture, marked in green, we perform measurements on the third qubit and use these to predict labels for our MNIST examples. We compare this to the true data label and compute gradients of a loss function just like in a classical NN. These types of QNNs are called “hybrid quantum-classical”, as the parameter optimization is handled by a classical computer, using e.g. the Adam optimizer.

Vanishing gradients, aka barren plateaus

It turns out that QNNs also suffer from vanishing gradients, just like classical NNs. Since the reason for vanishing gradients in QNNs is fundamentally different from classical NNs, a new term has been adopted for them: barren plateaus. Covering all details of this important phenomenon is out of the scope of this article, so we refer the interested reader to the paper that first introduced barren plateaus in QNN training landscapes or this tutorial on barren plateaus on the TFQ site for a hands-on example.

In short, barren plateaus occur when quantum circuits are initialized randomly – in the circuit illustrated above this means picking operations and their parameters at random. This is a fundamental problem for training parametrized quantum circuits, and gets worse as the number of qubits and the number of layers in a circuit grows, as we can see in the figure below.

Variance of gradients decays as a function of the number of qubits and layers in a random circuit

For the algorithm we introduce below, the key thing to understand here is that the more layers we add to a circuit, the smaller the variance in gradients will get. On the other hand, similarly to classical NNs, the QNN’s representational capacity also increases with its depth. The problem here is that in addition, the optimization landscape flattens in many places as we increase the circuit’s size, so it gets harder to find even a local minimum.

Remember that for QNNs, outputs are estimated from taking the average over a number of measurements. The smaller the quantity we want to estimate, the more measurements we will need to get an accurate result. If these quantities are much smaller compared to the effects caused by measurement uncertainty or hardware noise, they can’t be reliably determined and the circuit optimization will basically turn into a random walk.

To successfully train a QNN, we have to avoid random initialization of the parameters, and also have to stop the QNN from randomizing during training as its gradients get smaller, for example when it approaches a local minimum. For this, we can either limit the architecture of the QNN (e.g. by picking certain gate configurations, which requires tuning the architecture to the task at hand), or control the updates to parameters such that they won’t become random.

Layerwise learning

In our paper Layerwise learning for quantum neural networks, which is joint work by the Volkswagen Data:Lab (Andrea Skolik, Patrick van der Smagt, Martin Leib) and Google AI Quantum (Jarrod R. McClean, Masoud Mohseni), we introduce an approach to avoid initialization on a plateau as well as the network ending up on a plateau during training. Let’s look at an example of layerwise learning (LL) in action, on the learning task of binary classification of MNIST digits. First, we need to define the structure of the layers we want to stack. As we make no assumptions about the learning task at hand, we choose the same layout for our layers as in the figure above: one layer consists of random gates on each qubit initialized with zero, and two-qubit gates which connect qubits to enable generation of entanglement.

We designate a number of start layers, in this case only one, which will always stay active during training, and specify the number of epochs to train each set of layers. Two other hyperparameters are the number of new layers we add in each step, and the number of layers that are maximally trained at once. Here we choose a configuration where we add two layers in each step, and freeze the parameters of all previous layers, except the start layer, such that we only train three layers in each step. We train each set of layers for 10 epochs, and repeat this procedure ten times until our circuit consists of 21 layers overall. By doing this, we utilize the fact that shallow circuits produce larger gradients compared to deeper ones, and with this avoid initializing on a plateau.

This provides us with a good starting point in the optimization landscape to continue training larger contiguous sets of layers. As another hyperparameter, we define the percentage of layers we train together in the second phase of the algorithm. Here, we choose to split the circuit in half, and alternatingly train both parts, where the parameters of the inactive parts are always frozen. We call one training sequence where all partitions have been trained once a sweep, and we perform sweeps over this circuit until the loss converges. When the full set of parameters is always trained, which we will refer to as “complete depth learning” (CDL), one bad update step can affect the whole circuit and lead it into a random configuration and therefore a barren plateau, from which it cannot escape anymore.

Let’s compare our training strategy to CDL, which is one of the standard techniques used to train QNNs. To get a fair comparison, we use exactly the same circuit architecture as the one generated by the LL strategy before, but now update all parameters simultaneously in each step. To give CDL a chance to train, we optimize the parameters with zero instead of randomly. As we don’t have access to a real quantum computer yet, we simulate the probabilistic outputs of the QNN, and choose a relatively low value for the number of measurements that we use to estimate each prediction the QNN makes – which is 10 in this case. Assuming a 10kHZ sampling rate on a real quantum computer, we can estimate the experimental wall-clock time of our training runs as shown below:

Comparison of layerwise- and complete depth learning with different learning rates η. We trained 100 circuits for each configuration, and averaged over those that achieved a final test error lower than 0.5 (number of succeeding runs in legend).

With this small number of measurements, we can investigate the effects of the different gradient magnitudes of the LL and CDL approaches: if gradient values are larger, we get more information out of 10 measurements than for smaller values. The less information we have to perform our parameter updates, the higher the variance in the loss, and the risk to perform an erroneous update that will randomize the updated parameters and lead the QNN onto a plateau. This variance can be lowered by choosing a smaller learning rate, so we compare LL and CDL strategies with different learning rates in the figure above.

Notably, the test error of CDL runs increases with the runtime, which might look like overfitting at first. However, each curve in this figure is averaged over many runs, and what is actually happening here is that more and more CDL runs randomize during training, unable to recover. In the legend we show that a much larger fraction of LL runs achieved a classification error on the test set lower than 0.5 compared to CDL, and also did it in less time.

In summary, layerwise learning increases the probability of successfully training a QNN with overall better generalization error in less training time, which is especially valuable on NISQ devices. For more details on the implementation and theory of layerwise learning, check out our recent paper!

If you’d like to learn more about quantum computing and quantum machine learning in general, there are some additional resources below:

Read More

The Future of Machine Learning is Tiny and Bright

Posted by Josh Gordon, Developer Advocate

A new HarvardX TinyML course on edX.org

Prof. Vijay Janapa Reddi of Harvard, the TensorFlow Lite Micro team, and the edX online learning platform are sharing a series of short TinyML courses this fall that you can observe for free, or sign up to take and receive a certificate. In this article, I’ll share a bit about TinyML, what you can do with it, and the upcoming HarvardX program.

About TinyML

TinyML is one of the fastest-growing areas of Deep Learning. In a nutshell, it’s an emerging field of study that explores the types of models you can run on small, low-power devices like microcontrollers.

TinyML sits at the intersection of embedded-ML applications, algorithms, hardware and software. The goal is to enable low-latency inference at edge devices on devices that typically consume only a few milliwatts of battery power. By comparison, a desktop CPU would consume about 100 watts (thousands of times more!). Such extremely reduced power draw enables TinyML devices to operate unplugged on batteries and endure for weeks, months and possibly even years — all while running always-on ML applications at the edge/endpoint.

TinyML powering a simple speech recognizer. Learn how to build your own here.

Although most of us are new to TinyML, it may surprise you to learn that TinyML has served in production ML systems for years. You may have already experienced the benefits of TinyML when you say “OK Google” to wake up an Android device. That’s powered by an always-on, low-power keyword spotter, not dissimilar in principle from the one you can learn to build here.

The difference now is that TinyML is becoming rapidly more accessible, thanks in part to TensorFlow Lite Micro and educational resources like this upcoming HarvardX course.

TinyML unlocks many applications for embedded ML developers, especially when combined with sensors like accelerometers, microphones, and cameras. It is already proving useful in areas such as wildlife tracking for conservation and detecting crop diseases for agricultural needs, as well as predicting wildfires.

TinyML can also be fun! You can develop smart game controllers such as controlling a T-Rex dinosaur using a neural-network-based motion controller or enable a variety of other games. Using the same ML principles and technical chops, you could then imagine collecting accelerator data in a car to detect various scenarios (such as a wobbly tire) and alert the driver.

Chrome’s T-Rex dinosaur controlled using TensorFlow Lite for Microcontrollers.

Fun and games aside, as with any ML application— and especially when you are working with sensor data—it’s essential to familiarize yourself with Responsible AI. TinyML can support a variety of private ML applications because inference can take place entirely at the edge (data never needs to leave the device). In fact, many tiny devices have no internet connection at all.

More About the Short Courses

The HarvardX course is designed to be widely accessible to developers. You will learn what TinyML is, how it can serve in the world, and how to get started.

The courses begin with ML basics, including how to collect data, how to train basic models (think: linear regression), and so on. Next, they introduce deep learning basics (think: MNIST), then Tiny ML models for computer vision, and how to deploy them using TensorFlow Lite for Microcontrollers. Along the way, the courses cover case studies and important papers, and increasingly advanced applications.

In one workflow, you’ll build a TensorFlow model using Python in Colab (as always), then convert it to run in C on a microcontroller. The course will show how to optimize the ML models for severely resource-constrained devices (e.g., those with less than 100 KB of storage). And it includes various case studies that examine the challenges of deploying TinyML “into the wild.”

Take TinyML Home

We’re excited to work closely with Arduino and HarvardX to make this experience possible.

Arduino is preparing a TinyML kit, especially for the course.

An off-the-shelf TinyML kit from Arduino will be available to edX learners for purchase. It includes an Arm Cortex-M4 microcontroller with onboard sensors, a camera and a breadboard with wires—everything needed to unlock the initial suite of TinyML application capabilities, such as image, sound and gesture detection. Students will have the opportunity to invent the future.

We’ll feature the best student projects from the course right here on the TensorFlow blog.

We’re excited to see what you’ll create!

Sign-up here.

Read More

Train your TensorFlow model on Google Cloud using TensorFlow Cloud

Posted by Jonah Kohn and Pavithra Vijay, Software Engineers at Google

TensorFlow Cloud is a python package that provides APIs for a seamless transition from debugging and training your TensorFlow code in a local environment to distributed training in Google Cloud. It simplifies the process of training models on the cloud into a single, simple function call, requiring minimal setup and almost zero changes to your model. TensorFlow Cloud handles cloud-specific tasks such as creating VM instances and distribution strategies for your models automatically. This article demonstrates common use cases for TensorFlow Cloud, and a few best practices.

We will walk through classifying dog breed images provided by the stanford_dogs dataset. To make this easy, we will use transfer learning with ResNet50 trained on ImageNet weights. Please find the code from this post here on the TensorFlow Cloud repository.

Setup

Install TensorFlow Cloud using pip install tensorflow_cloud. Let’s start the python script for our classification task by adding the required imports.

import datetime
import os

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_cloud as tfc
import tensorflow_datasets as tfds

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Model

Google Cloud Configuration

TensorFlow Cloud runs your training job on Google Cloud using AI Platform services behind the scenes. If you are new to GCP, then please follow the setup steps in this section to create and configure your first Google Cloud Project. If you’re new to using the Cloud, first-time setup and configuration will involve a little learning and work. The good news is that after the setup, you won’t need to make any changes to your TensorFlow code to run it on the cloud!

  1. Create a GCP Project
  2. Enable AI Platform Services
  3. Create a Service Account
  4. Download an authorization key
  5. Create a Google Cloud Storage Bucket

GCP Project

A Google Cloud project includes a collection of cloud resources such as a set of users, a set of APIs, billing, authentication, and monitoring. To create a project, follow this guide. Run the commands in this section on your terminal.

export PROJECT_ID=<your-project-id>
gcloud config set project $PROJECT_ID

AI Platform Services

Please make sure to enable AI Platform Services for your GCP project by entering your project ID in this drop-down menu.

Service Account and Key

Create a service account for your new GCP project. A service account is an account used by an application or a virtual machine instance, and is used by Cloud applications to make authorized API calls.

export SA_NAME=<your-sa-name&rt;
gcloud iam service-accounts create $SA_NAME
gcloud projects add-iam-policy-binding $PROJECT_ID
--member serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com
--role 'roles/editor'

Next, we will need an authentication key for the service account. This authentication key is a means to ensure that only those authorized to work on your project will use your GCP resources. Create an authentication key as follows:

gcloud iam service-accounts keys create ~/key.json --iam-account $SA_NAME@$PROJECT_ID.iam.gserviceaccount.com

Create the GOOGLE_APPLICATION_CREDENTIALS environment variable.

export GOOGLE_APPLICATION_CREDENTIALS=~/key.json

Cloud Storage Bucket

If you already have a designated storage bucket, enter your bucket name as shown below. Otherwise, create a Google Cloud storage bucket following this guide. TensorFlow Cloud uses Google Cloud Build for building and publishing a docker image, as well as for storing auxiliary data such as model checkpoints and training logs.

GCP_BUCKET = "your-bucket-name"

Keras Model Creation

The model creation workflow for TensorFlow Cloud is identical to building and training a TF Keras model locally.

Resources

We’ll begin by loading the stanford_dogs dataset for categorizing dog breeds. This is available as part of the tensorflow-datasets package. If you have a large dataset, we recommend that you host it on GCS for better performance.

(ds_train, ds_test), metadata = tfds.load(
"stanford_dogs",
split=["train", "test"],
shuffle_files=True,
with_info=True,
as_supervised=True,
)

NUM_CLASSES = metadata.features["label"].num_classes

Let’s visualize the dataset:

print("Number of training samples: %d" % tf.data.experimental.cardinality(ds_train))
print("Number of test samples: %d" % tf.data.experimental.cardinality(ds_test))
print("Number of classes: %d" % NUM_CLASSES)

Number of training samples: 12000 Number of test samples: 8580 Number of classes: 120

plt.figure(figsize=(10, 10))
for i, (image, label) in enumerate(ds_train.take(9)):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(image)
plt.title(int(label))
plt.axis("off")

Preprocessing

We will resize and batch the data.

IMG_SIZE = 224
BATCH_SIZE = 64
BUFFER_SIZE = 2

size = (IMG_SIZE, IMG_SIZE)
ds_train = ds_train.map(lambda image, label: (tf.image.resize(image, size), label))
ds_test = ds_test.map(lambda image, label: (tf.image.resize(image, size), label))

def input_preprocess(image, label):
image = tf.keras.applications.resnet50.preprocess_input(image)
return image, label

Configure the input pipeline for performance

Now we will configure the input pipeline for performance. Note that we are using parallel calls and prefetching so that I/O doesn’t become blocking while your model is training. You can learn more about configuring input pipelines for performance in this guide.

ds_train = ds_train.map(
input_preprocess, num_parallel_calls=tf.data.experimental.AUTOTUNE
)

ds_train = ds_train.batch(batch_size=BATCH_SIZE, drop_remainder=True)
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)

ds_test = ds_test.map(input_preprocess)
ds_test = ds_test.batch(batch_size=BATCH_SIZE, drop_remainder=True)

Build the model

We will be loading ResNet50 with weights trained on ImageNet, while using include_top=False in order to reshape the model for our task.

inputs = tf.keras.layers.Input(shape=(IMG_SIZE, IMG_SIZE, 3))
base_model = tf.keras.applications.ResNet50(
weights="imagenet", include_top=False, input_tensor=inputs
)
x = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(NUM_CLASSES)(x)

model = tf.keras.Model(inputs, outputs)

We will freeze all layers in the base model at their current weights, allowing the additional layers we added to be trained.

base_model.trainable = False

Keras Callbacks can be used easily on TensorFlow Cloud as long as the storage destination is within your Cloud Storage Bucket. For this example, we will use the ModelCheckpoint callback to save the model at various stages of training, Tensorboard callback to visualize the model and its progress, and the Early Stopping callback to automatically determine the optimal number of epochs for training.

MODEL_PATH = "resnet-dogs"
checkpoint_path = os.path.join("gs://", GCP_BUCKET, MODEL_PATH, "save_at_{epoch}")
tensorboard_path = os.path.join(
"gs://", GCP_BUCKET, "logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
)
callbacks = [
tf.keras.callbacks.ModelCheckpoint(checkpoint_path),
tf.keras.callbacks.TensorBoard(log_dir=tensorboard_path, histogram_freq=1),
tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=3),
]

Compile the model

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-2)
model.compile(
optimizer=optimizer,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=["accuracy"],
)

Debug the model locally

We’ll train the model in a local environment first in order to ensure that the code works properly before sending the job to GCP. We will use tfc.remote() to determine whether the code should be executed locally or on the cloud. Choosing a smaller number of epochs than intended for the full training job will help verify that the model is working properly without overloading your local machine.

if tfc.remote():
epochs = 500
train_data = ds_train
test_data = ds_test
else:
epochs = 1
train_data = ds_train.take(5)
test_data = ds_test.take(5)
callbacks = None

model.fit(
train_data, epochs=epochs, callbacks=callbacks, validation_data=test_data, verbose=2
)
if tfc.remote():
SAVE_PATH = os.path.join("gs://", GCP_BUCKET, MODEL_PATH)
model.save(SAVE_PATH)

Model Training on Google Cloud

To train on GCP, populate the example code with your GCP project settings, then simply call tfc.run() from within your code. The API is simple with intelligent defaults for all the parameters. Again, we don’t need to worry about cloud specific tasks such as creating VM instances and distribution strategies when using TensorFlow Cloud. In order, the API will:

  • Make your python script/notebook cloud and distribution ready.
  • Convert it into a docker image with required dependencies.
  • Run the training job on a GCP cluster.
  • Stream relevant logs and store checkpoints.

The run() API provides significant flexibility for use, such as giving users the ability to specify custom cluster configuration, custom docker images. For a full list of parameters that can be used to call run(), see the TensorFlow Cloud readme.
Create a requirements.txt file with a list of python packages that your model depends on. By default, TensorFlow Cloud includes TensorFlow and its dependencies as part of the default docker image, so there’s no need to include these. Please create requirements.txt in the same directory as your python file. requirements.txt contents for this example are:

tensorflow-datasets
matplotlib

By default, the run API takes care of wrapping your model code in a TensorFlow distribution strategy based on the cluster configuration you have provided. In this example, we are using a single node multi-gpu configuration. So, your model code will be wrapped in a TensorFlow `MirroredStrategy` instance automatically.
Call run() in order to begin training on cloud. Once your job has been submitted, you will be provided a link to the cloud job. To monitor the training logs, follow the link and select ‘View logs’ to view the training progress information.

tfc.run(
requirements_txt="requirements.txt",
distribution_strategy="auto",
chief_config=tfc.MachineConfig(
cpu_cores=8,
memory=30,
accelerator_type=tfc.AcceleratorType.NVIDIA_TESLA_T4,
accelerator_count=2,
),
docker_image_bucket_name=GCP_BUCKET,
)

Visualize the model using TensorBoard

Here, we are loading the Tensorboard logs from our GCS bucket to evaluate model performance and history.

tensorboard dev upload --logdir "gs://your-bucket-name/logs" --name "ResNet Dogs"

Evaluate the model

After training, we can load the model that’s been stored in our GCS bucket, and evaluate its performance.

if tfc.remote():
model = tf.keras.models.load_model(SAVE_PATH)
model.evaluate(test_data)

Next steps

This article introduced TensorFlow Cloud, a python package that simplifies the process of training models on the cloud using multiple GPUs/TPUs into a single function, with zero code changes to your model. You can find the complete code from this article here. As a next step, you can find this code example and many others on the TensorFlow Cloud repository.Read More

TensorFlow 2 MLPerf submissions demonstrate best-in-class performance on Google Cloud

TensorFlow 2 MLPerf submissions demonstrate best-in-class performance on Google Cloud

Posted by Pankaj Kanwar, Peter Brandt, and Zongwei Zhou from the TensorFlow Team

MLPerf, the industry standard for measuring machine learning performance, has released the latest benchmark results from the MLPerf Training v0.7 round. We’re happy to share that Google’s submissions demonstrate leading top-line performance (fastest time to reach target quality), with the ability to scale up to 4,000+ accelerators and the flexibility of the TensorFlow 2 developer experience on Google Cloud.

In this blog post, we’ll explore the TensorFlow 2 MLPerf submissions, which showcase how enterprises can run valuable workloads that MLPerf represents on cutting-edge ML accelerators in Google Cloud, including widely deployed generations of GPUs and Cloud TPUs. Our accompanying blog post highlights our record-setting large-scale training results.

TensorFlow 2: designed for performance and usability

At the TensorFlow Developer Summit earlier this year, we highlighted that TensorFlow 2 would emphasize usability and real-world performance. When competing to win benchmarks, engineers have often relied on low-level API calls and hardware-specific code that may not be practical in everyday enterprise settings. With TensorFlow 2, we aim to provide high performance out of the box with more straightforward code, avoiding the significant issues that low-level optimizations can cause with respect to code reusability, code health, and engineering productivity.

Time to converge (in minutes) using Google Cloud VMs with 8 NVIDIA V100 GPUs from Google’s MLPerf Training v0.7 Closed submission in the “Available” category.

TensorFlow’s Keras APIs (see this collection of guides) offer usability and portability across a wide array of hardware architectures. For example, model developers can use the Keras mixed precision API and Distribution Strategy API to enable the same codebase to run on multiple hardware platforms with minimal friction. Google’s MLPerf submissions in the Available-in-Cloud category were implemented using these APIs. These submissions demonstrate that near-identical TensorFlow code written using high level Keras APIs can deliver high performance across the two leading widely-available ML accelerator platforms in the industry: NVIDIA’s V100 GPUs and Google’s Cloud TPU v3 Pods.

Note: All results shown in the charts are retrieved from www.mlperf.org on July 29, 2020. MLPerf name and logo are trademarks. See www.mlperf.org for more information. Results shown: 0.7-1 and 0.7-2.

Time to convergence (in minutes) using Google Cloud TPU v3 Pod slices containing 16 TPU chips from Google’s MLPerf Training v0.7 Closed submission in the “Available” category.

Looking under the hood: performance enhancements with XLA

Google’s submissions on GPUs and on Cloud TPU Pods leverage the XLA compiler to optimize TensorFlow performance. XLA is a core part of the TPU compiler stack, and it can optionally be enabled for GPU. XLA is a graph-based just-in-time compiler that performs a variety of different types of whole-program optimizations, including extensive fusion of ML operations.

Operator fusion reduces the memory capacity and bandwidth requirements for ML models. Furthermore, fusion reduces the launch overhead of operations, particularly on GPUs. Overall, XLA optimizations are general, portable, interoperate well with cuDNN and cuBLAS libraries, and can often provide a compelling alternative to writing low-level kernels by hand.

Google’s TensorFlow 2 submissions in the Available-in-Cloud category use the @tf.function API introduced in TensorFlow 2.0. The @tf.function API offers a simple way to enable XLA selectively, providing fine-grained control over exactly which functions will be compiled.

The performance improvements delivered by XLA are impressive: on a Google Cloud VM with 8 Volta V100 GPUs attached (each with 16 GB of GPU memory), XLA boosts BERT training throughput from 23.1 sequences per second to 168 sequences per second, a ~7x improvement. XLA also increases the runnable batch size per GPU by 5X. Reduced memory usage by XLA also enables advanced training techniques such as gradient accumulation.

Impact of enabling XLA (in minutes) on the BERT model using 8 V100 GPUs on Google Cloud as demonstrated by Google’s MLPerf Training 0.7 Closed submission compared to unverified MLPerf results on the same system with optimization(s) disabled.

State-of-the-art accelerators on Google Cloud

Google Cloud is the only public-cloud platform that provides access to both state-of-the-art GPUs and Cloud TPUs, which allows AI researchers and data scientists the freedom to choose the right hardware for every task.

Cutting-edge models such as BERT, which are extensively used within Google and industry-wide for a variety of natural language processing tasks, can now be trained on Google Cloud leveraging the same infrastructure that is used for training internal workloads within Google. Using Google Cloud, you can train BERT for 3 million sequences on a Cloud TPU v3 Pod slice with 16 TPU chips in under an hour at a total cost of under $32.

Conclusion

Google’s MLPerf 0.7 Training submissions showcase the performance, usability, and portability of TensorFlow 2 across state-of-the-art ML accelerator hardware. Get started today with the usability and power of TensorFlow 2 on Google Cloud GPUs, Google Cloud TPUs, and TensorFlow Enterprise with Google Cloud Deep Learning VMs.

Acknowledgements

The MLPerf submission on GPUs is the result of a close collaboration with NVIDIA. We’d like to thank all engineers at NVIDIA who helped us with this submission.
Read More

What's new in TensorFlow 2.3?

What’s new in TensorFlow 2.3?

Posted by Josh Gordon for the TensorFlow team

TensorFlow 2.3 has been released! The focus of this release is on new tools to make it easier for you to load and preprocess data, and to solve input-pipeline bottlenecks, whether you’re working on one machine, or many.

  • tf.data adds two mechanisms to solve input pipeline bottlenecks and improve resource utilization. For advanced users, the new service API provides a way to improve training speed when the host attached to a training device can’t keep up with the data consumption needs of your model. It allows you to offload input preprocessing to a CPU cluster of data-processing workers that run alongside your training job, increasing accelerator utilization. A second new feature is the tf.data snapshot API, which allows you to persist the output of your input preprocessing pipeline to disk, so you can reuse it on a different training run. This enables you to trade storage space to free up additional CPU time.
  • The TF Profiler adds two new tools as well: a memory profiler to visualize your model’s memory usage over time, and a Python tracer that allows you to trace Python function calls in your model. You can read more about these below (and if you’re new to the TF Profiler, be sure to check out this article).
  • TensorFlow 2.3 adds experimental support for the new Keras Preprocessing Layers API. These layers allow you to package your preprocessing logic inside your model for easier deployment – so you can ship a model that takes raw strings, images, or rows from a table as input. There are also new user-friendly utilities that allow you to easily create a tf.data.Dataset from a directory of images or text files on disk, in a few lines of code.
The new memory profiler

New features in tf.data

tf.data.service

Modern accelerators (GPUs, TPUs) are incredibly fast. To avoid performance bottlenecks, it’s important to ensure that your data loading and preprocessing pipeline is fast enough to provide data to the accelerator when it’s needed. For example, imagine your GPU can classify 200 examples/second, but your data input pipeline can only load 100 examples/second from disk. In this case, your GPU would be idle (waiting for data) 50% of the time. And, that’s assuming your input-pipeline is already overlapped with GPU computation (if not, your GPU would be waiting for data 66% of the time).
In this scenario, you can double training speed by using the tf.data.experimental.service to generate 200 examples/second, by distributing data loading and preprocessing to a cluster you run alongside your training job. The tf.data service has a dispatcher-worker architecture, with one dispatcher and many workers. You can find documentation on setting up a cluster here, and you can find a complete example here that shows you how to deploy a cluster using Google Kubernetes Engine.
Once you have a tf.data.service running, you can add distributed dataset processing to your existing tf.data pipelines using the distribute transformation:

ds = your_dataset()
ds = dataset.apply(tf.data.experimental.service.distribute(processing_mode="parallel_epochs", service=service_address))

Now, when you iterate over the dataset, data processing will happen using the tf.data service, instead of on your local machine.
Distributing your input pipeline is a powerful feature, but if you’re working on a single machine, tf.data has tools to help you improve input pipeline performance as well. Be sure to check out the cache and prefetch transformations – which can greatly speed up your pipeline in a single line of code.

tf.data snapshot

The tf.data.experimental.snapshot API allows you to persist the output of your preprocessing pipeline to disk, so you can materialize the preprocessed data on a different training run. This is useful for trading off storage space on disk to free up more valuable CPU and accelerator time.
For example, suppose you have a dataset that does expensive preprocessing (perhaps you are manipulating images with cropping or rotation). After developing your inputline pipeline to load and preprocess data:

dataset = create_input_pipeline()

You can snapshot the results to a directory by applying the snapshot transformation:

dataset = dataset.apply(tf.data.experimental.snapshot("/snapshot_dir"))

The snapshot will be created on disk when you iterate over the dataset for the first time. Subsequent iterations will read from snapshot_dir instead of recomputing dataset elements.
Snapshot computes a fingerprint of your dataset so it can detect changes to your input pipeline, and recompute outdated snapshots automatically. For example, if you modify a Dataset.map transformation or add additional images to a source directory, the fingerprint will change, causing the snapshot to be recomputed. Note that snapshot cannot detect changes to an existing file, though. Check out the documentation to learn more.

New features in the TF Profiler

The TF Profiler (introduced in TF 2.2) makes it easier to spot performance bottlenecks. It can help you identify when an application is input-bound, and can provide suggestions for what can be done to fix it. You can learn more about this workflow in the Analyze tf.data performance with the TF Profiler guide.
In TF 2.3, the Profiler has a few new capabilities and several usability improvements.

  • The new Memory Profiler enables you to monitor memory usage during training. If a training job runs out of memory, you can pinpoint when the peak memory usage occured and which ops consumed the most memory. If you collect a profile, the Memory Profiler tool appears in the Profiler dashboard with no extra work.
  • The new Python Tracer helps trace the Python call stack to provide additional insight on what is being executed in your program. It appears in the Profiler’s Trace Viewer. It can be enabled in programmatic mode using the ProfilerOptions or in sampling mode through the TensorBoard “capture profile” UI (you can find more information about these modes in this guide).

New Keras data loading utilities

In TF 2.3, Keras adds new user-friendly utilities (image_dataset_from_directory and text_dataset_from_directory) to make it easy for you to create a tf.data.Dataset from a directory of images or text files on disk, in just one function call. For example, if your directory structure is:

flowers_photos/
daisy/
dandelion/
roses/
sunflowers/
tulips/

You can use image_dataset_from_directory to create a tf.data.Dataset that yields batches of images from the subdirectories and labels:

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
“datasets/cats_and_dogs”,
validation_split=0.2,
subset="training",
seed=0,
image_size=(img_height, img_width),
batch_size=32)

If you’re starting a new project, we recommend using image_dataset_from_directory over the legacy ImageDataGenerator. Note this utility doesn’t perform data augmentation (this is meant to be done using the new preprocessing layers, described below). You can find a complete example of loading images with this utility (as well as how to write a similar input-pipeline from scratch with tf.data) here.

Performance tip

After creating a tf.data.Dataset (either from scratch, or using image_dataset_from_directory) remember to configure it for performance to ensure I/O doesn’t become a bottleneck when training a model. You can use a one-liner for this. With this line of code:

train_ds = train_ds.cache().prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

You create a dataset that caches images in memory (once they’re loaded off disk during the first training epoch), and overlaps preprocessing work on the CPU with training work on the GPU. If your dataset is too large to fit into memory, you can also use .cache(filename) to automatically create an efficient on-disk cache, which is faster to read than many small files.
You learn more in the Better performance with the tf.data API guide.

New Keras preprocessing layers

In TF 2.3, Keras also adds new experimental preprocessing layers that can simplify deployment by allowing you to include your preprocessing logic as layers inside your model, so they are saved just like other layers when you export your model.

  • Using the new TextVectorization layer, for example, you can develop a text classification model that accepts raw strings as input (without having to re-implement any of the logic for tokenization, standardization, vectorization, or padding server-side).
  • You can also use resizing, rescaling, and normalization layers to develop an image classification model that accepts any size of image as input, and that automatically normalizes pixel values to the expected range. And, you can use new data augmentation layers (like RandomRotation) to speed up your input-pipeline by running data augmentation on the GPU.
  • For structured data, you can use layers like StringLookup to encode categorical features, so you can develop a model that takes a row from a table as input. You can check out this RFC to learn more.

The best way to learn how to use these new layers is to try the new text classification from scratch, image classification from scratch, and structured data classification from scratch examples on keras.io.
Note that all of these layers can either be included inside your model, or can be applied to your tf.data input-pipeline via the map transformation. You can find an example here.
Please keep in mind, these new preprocessing layers are experimental in TF 2.3. We’re happy with the design (and anticipate they will be made non-experimental in 2.4) but realize we might not have gotten everything right on this iteration. Your feedback is very welcome. Please file an issue on GitHub to let us know how we can better support your use case.

Next steps

Check out the release notes for more information. To stay up to date, you can read the TensorFlow blog, follow twitter.com/tensorflow, or subscribe to youtube.com/tensorflow. If you’ve built something you’d like to share, please submit it for our Community Spotlight at goo.gle/TFCS. For feedback, please file an issue on GitHub. Thank you!Read More

Accelerating TensorFlow Lite with XNNPACK Integration

Accelerating TensorFlow Lite with XNNPACK Integration

Posted by Marat Dukhan, Google Research

Leveraging the CPU for ML inference yields the widest reach across the space of edge devices. Consequently, improving neural network inference performance on CPUs has been among the top requests to the TensorFlow Lite team. We listened and are excited to bring you, on average, 2.3X faster floating-point inference through the integration of the XNNPACK library into TensorFlow Lite.

To achieve this speedup, the XNNPACK library provides highly optimized implementations of floating-point neural network operators. It launched earlier this year in the WebAssembly backend of TensorFlow.js, and with this release we are introducing additional optimizations tailored to TensorFlow Lite use-cases:

  • To deliver the greatest performance to TensorFlow Lite users on mobile devices, all operators were optimized for ARM NEON. The most critical ones (convolution, depthwise convolution, transposed convolution, fully-connected), were tuned in assembly for commonly-used ARM cores in mobile phones, e.g. Cortex-A53/A73 in Pixel 2 and Cortex-A55/A75 in Pixel 3.
  • For TensorFlow Lite users on x86-64 devices, XNNPACK added optimizations for SSE2, SSE4, AVX, AVX2, and AVX512 instruction sets.
  • Rather than executing TensorFlow Lite operators one-by-one, XNNPACK looks at the whole computational graph and optimizes it through operator fusion. For example, convolution with explicit padding is represented in TensorFlow Lite via a combination of PAD operator and a CONV_2D operator with VALID padding mode. XNNPACK detects this combination of operators and fuses the two operators into a single convolution operator with explicitly specified padding.

The XNNPACK backend for CPU joins the family of TensorFlow Lite accelerated inference engines for mobile GPUs, Android’s Neural Network API, Hexagon DSPs, Edge TPUs, and the Apple Neural Engine. It provides a strong baseline that can be used on all mobile devices, desktop systems, and Raspberry Pi boards.
With the TensorFlow 2.3 release, XNNPACK backend is included with the pre-built TensorFlow Lite binaries for Android and iOS, and can be enabled with a one-line code change. XNNPACK backend is also supported in Windows, macOS, and Linux builds of TensorFlow Lite, where it is enabled via build-time opt-in mechanism. Following wider testing and community feedback, we plan to enable it by default on all platforms in an upcoming release.

Performance Improvements

XNNPACK-accelerated inference in TensorFlow Lite has already been used in Google products in production, and we observed significant speedups across a wide variety of neural network architectures and mobile processors. The XNNPACK backend boosted background segmentation in Pixel 3a Playground by 5X and delivered 2X speedup on neural network models in Augmented Faces API in ARCore.

We found that TensorFlow Lite benefits the most from the XNNPACK backend on small neural network models and low-end mobile phones. Below, we present benchmarks on nine public models covering common computer vision tasks:

  1. MobileNet v2 image classification [download]
  2. MobileNet v3-Small image classification [download]
  3. DeepLab v3 segmentation [download]
  4. BlazeFace face detection [download]
  5. SSDLite 2D object detection [download]
  6. Objectron 3D object detection [download]
  7. Face Mesh landmarks [download]
  8. MediaPipe Hands landmarks [download]
  9. KNIFT local feature descriptor [download]
Single-threaded inference speedup with TensorFlow Lite with the XNNPACK backend compared to the default backend across 5 mobile phones. Higher numbers are better.
Single-threaded inference speedup with TensorFlow Lite with the XNNPACK backend compared to the default backend across 5 desktop, laptop, and embedded devices. Higher numbers are better.

How Can I Use It?

The XNNPACK backend is already included in pre-built TensorFlow Lite 2.3 binaries, but requires an explicit runtime opt-in to enable it. We’re working to enable it by default in a future release.

Opt-in to XNNPACK backend on Android/Java

Pre-built TensorFlow Lite 2.3 Android archive (AAR) already include XNNPACK, and it takes only a single line of code to enable it in Interpreter.Options object:

Interpreter.Options interpreterOptions = new Interpreter.Options();
interpreterOptions.setUseXNNPACK(true);
Interpreter interpreter = new Interpreter(model, interpreterOptions);

Opt-in to XNNPACK backend on iOS/Swift

Pre-built TensorFlow Lite 2.3 CocoaPods for iOS similarly include XNNPACK, and a mechanism to enable it in the InterpreterOptions class:

var options = InterpreterOptions()
options.isXNNPackEnabled = true
var interpreter = try Interpreter(modelPath: "model/path", options: options)

Opt-in to XNNPACK backend on iOS/Objective-C

On iOS XNNPACK inference can be enabled from Objective-C as well via a new property in the TFLInterpreterOptions class:

TFLInterpreterOptions *options = [[TFLInterpreterOptions alloc] init];
options.useXNNPACK = YES;
NSError *error;
TFLInterpreter *interpreter =
[[TFLInterpreter alloc] initWithModelPath:@"model/path"
options:options
error:&error];

Opt-in to XNNPACK backend on Windows, Linux, and Mac

XNNPACK backend on Windows, Linux, and Mac is enabled via a build-time opt-in mechanism. When building TensorFlow Lite with Bazel, simply add --define tflite_with_xnnpack=true, and the TensorFlow Lite interpreter will use the XNNPACK backend by default.

Try out XNNPACK with your TensorFlow Lite model

You can use the TensorFlow Lite benchmark tool and measure your TensorFlow Lite model performance with XNNPACK. You only need to enable XNNPACK by the --use_xnnpack=true flag as below, even if the benchmark tool is built without the --define tflite_with_xnnpack=true Bazel option.

adb shell /data/local/tmp/benchmark_model 
--graph=/data/local/tmp/mobilenet_quant_v1_224.tflite
--use_xnnpack=true
--num_threads=4

Which Operations Are Accelerated?

The XNNPACK backend currently supports a subset of floating-point TensorFlow Lite operators (see documentation for details and limitations). XNNPACK supports both 32-bit floating-point models and models using 16-bit floating-point quantization for weights, but not models with fixed-point quantization in weights or activations. However, you do not have to constrain your model to the operators supported by XNNPACK: any unsupported operators would transparently fall-back to the default implementation in TensorFlow Lite.

Future Work

This is just the first version of the XNNPACK backend. Along with community feedback, we intend to add the following improvements:

  • Integration of the Fast Sparse ConvNets algorithms
  • Half-precision inference on the recent ARM processors
  • Quantized inference in fixed-point representation

We encourage you to leave your thoughts and comments on our GitHub and StackOverflow pages.

Acknowledgements

We would like to thank Frank Barchard, Chao Mei, Erich Elsen, Yunlu Li, Jared Duke, Artsiom Ablavatski, Juhyun Lee, Andrei Kulik, Matthias Grundmann, Sameer Agarwal, Ming Guang Yong, Lawrence Chan, Sarah Sirajuddin. Read More