Profiling XNNPACK with TFLite

Posted by Alan Kelly, Software Engineer

We are happy to share that detailed profiling information for XNNPACK is now available in TensorFlow 2.9.1 and later. XNNPACK is a highly optimized library of floating-point neural network inference operators for ARM, WebAssembly, and x86 platforms, and it is the default TensorFlow Lite CPU inference engine for floating-point models.

The most common and expensive neural network operators, such as fully connected layers and convolutions, are executed by XNNPACK so that you get the best performance possible from your model. Historically the profiler would measure the runtime for the entire section of delegated graph, meaning that the runtime of all delegated operators was accumulated in one result, making it difficult to identify the individual operations that were slow.

Previous TFLite profiling results when XNNPACK was used. The runtime of all delegated operators was accumulated in one row.

If you are using TensorFlow Lite 2.9.1 or later, it gives the per operator profile even for the section that is delegated to XNNPACK so that you no longer need to decide between fast inference and detailed performance information. The operator name, data layout (NHWC for example), datatype (FP32) and microkernel type (if applicable) are shown.

New detailed per-operator profiling information is now shown. The operator name, data layout, data type and microkernel type are visible.
Now, you get lots of helpful information, such as the runtime per operator and the percentage of the total runtime that it accounts for. The runtime of each node is given in the order in which they were executed. The most expensive operators are also listed.
The most expensive operators are listed. In this example, you can see that a deconvolution accounted for 33.91% of the total runtime.

XNNPACK can also perform inference in half-precision (16 bit) floating point format if the hardware supports these operations natively, and IEEE16 inference is supported for every floating-point operator in the model, and the model’s `reduced_precision_support` metadata indicates that it is compatible with FP16 inference. FP16 inference can also be forced. More information is available here. If half precision has been used, then F16 will be present in the Name column:

FP16 inference has been used.

Here, unsigned quantized inference has been used (QU8).

QU8 indicates that unsigned quantized inference has been used

And finally, sparse inference has been used. Sparse operators require that the data layout change from NHWC to NCHW as this is more efficient. This can be seen in the operator name.

SPMM microkernel indicates that the operator is evaluated via SParse matrix-dense Matrix Multiplication. Note that sparse inference use NCHW layout (vs the typical NHWC) for the operators.

Note that when some operators are delegated to XNNPACK, and others aren’t, two sets of profile information are shown. This happens when not all operators in the model are supported by XNNPACK. The next step in this project is to merge profile information from XNNPACK operators and TensorFlow Lite into one profile.

Next Steps

You can learn more about performance measurement and profiling in TensorFlow Lite by visiting this guide. Thanks for reading!

Read More

Adding Quantization-aware Training and Pruning to the TensorFlow Model Garden

Posted by Jaehong Kim, Rino Lee, and Fan Yang, Software Engineers

The TensorFlow model optimization toolkit (TFMOT) provides modern optimization techniques such as quantization aware training (QAT) and pruning. Since the introduction of TFMOT, we have been continuously improving its usability and coverage. Today, we are excited to announce that we are extending the TFMOT model coverage to popular computer vision models in the TensorFlow Model Garden.

To do so, we added 8-bit QAT API support for subclassed models and custom layers, and Pruning API support. You can use these new features in the model garden, and when developing your own models as well. With this, we have showcased applying QAT and pruning to several canonical computer vision models, while accelerating the model development cycle significantly.

In this article, we will describe the technical challenges we encountered to apply QAT and pruning to the subclass models and custom layers. And show the optimized results to show the benefits from optimization techniques.

New support for Model Garden models

Quantization

We have resolved a few technical challenges to support subclassed models and simplified the process of applying QAT API. All the new changes have already been taken care of by TFMOT and Model Garden to save users from knowing all technical details. The final user-facing API to apply QAT on a computer vision model in Model Garden is quite straightforward. By applying a few configuration changes, you can enable QAT to finetune a pre-trained model and obtain a deployable on-device model in just a few hours. There is minimal to no code change at all. Here we will talk about those challenges and how we addressed them.

The previous QAT API assumed that the model only contained built-in layers. To support nested functional models, we apply the QAT method to different parts of the model individually. For example, to apply QAT to an image classification model (M) in the Model Garden that consists of two sub modules: the backbone network (B) and the classification head (C). Here B is a nested model within M, and C is a layer. Both B and C only contain built-in layers. Instead of directly quantizing the entire classification model M, we quantize the backbone B and classification head C individually. First, we apply QAT to backbone B only. Then we connect the quantized backbone B to its corresponding classification head C to form a new classification model, and annotate C to be quantized. Finally, we quantize the entire new model, which effectively applies QAT to the annotated classification head C.

When the backbone network also contains custom layers rather than built-in layers, we add quantized versions of those custom layers first. For example, if the backbone network (B) or the classification head (C) of the classification model (M) also contain a custom layer called MyLayer, we create its QAT counterpart called MyLayerQuantized and wrap any built-in layers within it by a quantize wrapper API. We do this recursively if there are any nested custom layers, until all built-in layers are properly wrapped.

The remaining part after applying quantize is loading the weights from the original model because the QAT-applied model contains more parameters due to additional quantization parameters. Our current solution is variable name filtering. We have added a logic to load the weights from the original model to filtered weight from the QAT-applied model to support fine-tuning from pre-trained models.

Pruning

Along with QAT, we provide two Model garden models with pruning, which is another in-training model optimization technique of MOT. Pruning sparsifies (forces a fixed portion of elements to zero) the given model’s weights during training for computation and storage efficiency.

Users can easily set pruning parameters in Model Garden configs. For better pruned model quality, starting pruning from a pre-trained dense model and careful tuning pruning schedule over training steps are well-known techniques. Both are available in Model Garden Pruning configs.

This work also provides an example of nested functional layer support in pruning. The way we used here using get_prunable_weight() is also applicable to any other Keras models with custom layers.

With the provided two Model Garden Pruning configs, users can quickly demonstrate pruning to ResNet50 and MobileNetV2 models for image classification. Understanding the practical usage of Pruning API and the pruning process by monitoring tensorboard are also another takeaways of this work.

Examples and Results

We support two tasks, image classification and semantic segmentation. Specifically, for QAT in image classification, we support the common MobileNet family, including MobileNetV2, MobileNetV3 (large), Multi-Hardware MobileNet (AVG), and ResNet (through quantization on common building blocks such as InvertedBottleneckBlockQuantized and BottleneckBlockQuantized). For QAT in semantic segmentation, we support MobileNetV2 backbone with DeepLab V3/V3+. For Pruning in image classification we support MobileNetV2 and ResNet. Please refer to the documentations of QAT and pruning for more details.

Create QAT Models using Model Garden

Using QAT with Model Garden is simple and straightforward. First, we train a floating point model following the standard process of training models using Model Garden. After training converges, we take the best checkpoint as our starting point to apply QAT, analogous to a finetuning stage. Soon, we will obtain a model that is more quantization friendly. Such model then can be converted to a TFLite model for on-device deployment.

For image classification, we evaluate the top-1 accuracy on the ImageNet validation set. As shown in Table 1, QAT model consistently outperforms PTQ model by a large margin, which achieves comparable latency. Notably, on models where PTQ fails to produce reasonable results (MobileNetV3), QAT is still capable of generating a strong quantized model with negligible accuracy drop.

Table 1. Accuracy and latency comparison of supported models for ImageNet classification. Latency is measured on a Samsung Galaxy S21 using 1-thread CPU. FP32 refers to the unquantized floating point TFLite model. PTQ INT8 refers to full integer post-training quantization. QAT INT8 refers to the quantized QAT model.

model

reso-

lution

TFLite Model

Top-1 accuracy

Top-1 accuracy (FP32)

Top-1 accuracy (PTQ INT8)

Top-1 accuracy (QAT INT8)

Latency (FP32, ms/img)

Latency (PTQ
INT8, ms/img)

Latency (QAT INT8, ms/img)

ResNet50

224×224

76.7

76.7

76.4

77.2

184.01

48.73

64.49

MobileNet V2

224×224

72.8

72.8

72.4

72.8

16.74

6.85

6.84

MobileNet V3 Large

224×224

75.1

75.1

34.5*

74.4

13.32

6.43

6.85

MobileNet Multi-HW AVG

224×224

75.3

75.2

73.5

75.1

20.97

7.73

7.73

* PTQ fails to quantize MobileNet V3 properly due to hard-swish activation, thus leading to low accuracy.

We have a similar observation on semantic segmentation: PTQ introduces 1.3 mIoU drop, compared to FP32 model, while QAT model minimizes the drop to just 0.7 and maintains comparable latency. On average, we expect QAT will only introduce 0.5 top-1 accuracy drop for image classification and less than 1 mIoU drop for semantic segmentation.

Table 2. Accuracy and latency comparison of a MobileNet v2 + DeepLab v3 on Pascal VOC segmentation. Latency is measured on a Samsung Galaxy S21 using 1-thread CPU. FP32 refers to the unquantized floating point TFLite model. PTQ INT8 refers to full integer post-training quantization. QAT INT8 refers to the quantized QAT model.

model

reso-

lution

TFLite Model

mIoU

mIoU (FP32)

mIoU (PTQ
INT8)

mIoU (QAT INT8)

Latency (FP32, ms/img)

Latency (PTQ
INT8, ms/img)

Latency (QAT INT8, ms/img)

MobileNet v2 + DeepLab v3

512×512

75.27

75.30

73.95

74.68

136.60

60.94

55.53

Pruning Models in Model Garden

We support ResNet50 and MobileNet V2 for image classification. Pretrained dense models for each task are generated using the Model Garden training configs. The pruned model can be converted to the TFLite model. By simply setting a flag for sparsity in TFLite conversion, one can get a benefit of model size reduction through sparse data format.

For image classification, we again evaluate the top-1 accuracy on the ImageNet validation set, as well as the size of converted TFLite models. As sparsity level increases, the model size becomes more compact but accuracy degrades. The accuracy impact in high sparsity is more severe in parameter-efficient models like MobileNetV2.

Table 3. Accuracy and model size comparison of ResNet-50 and MobileNet v2 for ImageNet classification. Model size is measured by disk usage of saved TFLite models. Dense refers to the unpruned TFLite model, and 50% sparsity refers to the TFLite model with all prunable layers’ weights randomly pruned 50% of their elements.

Model

Resolution

Top-1 Accuracy (Dense)

Top-1 Accuracy (50% sparsity)

Top-1 Accuracy (80% sparsity)

TFLite Model size (Dense)

TFLite Model size (Mb, 50% sparsity)

TFLite Model size (Mb, 80% sparsity)

MobileNet V2

224×224

72.768%

71.334%

61.378%

13.36 Mb

9.74 Mb

4.00 Mb

ResNet50

224×224

76.704%

76.61%

75.508%

97.44 Mb

70.34 Mb

28.35 Mb

Conclusions

We have presented an extension to TFMOT that offers QAT and pruning support for computer vision models in Model Garden. We highlight the ease of use and outstanding trade-offs about maintaining accuracy while keeping low latency or small model size.

While we believe this is a simple and user-friendly solution to enable QAT and pruning, we know this is just the beginning of streamlined works to provide even better usability.

Currently, supported tasks are limited to image classification and semantic segmentation. We will continue to add more support to other tasks, such as object detection and instance segmentation. We will also add more models, such as transformer based models, and improve the usability of TFMOT and Model Garden’s API. Thanks for your interest in this work.

Acknowledgements

We would like to thank everyone who contributed to this work, including Model Garden, Model Optimization, and our collaborators from Research. Special thanks to David Rim (emeritus), Ethan Kim (emeritus) from the Model Optimization team; Abdullah Rashwan, Xianzhi Du, Yeqing Li, Jaeyoun Kim, Jing Li from the Model Garden team; Yuqi Li from the on-device ML team.

Read More

Memory-efficient inference with XNNPack weights cache

Posted by Zhi An Ng and Marat Dukhan, Google

XNNPack is the default TensorFlow Lite CPU inference engine for floating-point models, and delivers meaningful speedups across mobile, desktop, and Web platforms. One of the optimizations employed in XNNPack is repacking the static weights of the Convolution, Depthwise Convolution, Transposed Convolution, and Fully Connected operators into an internal layout optimized for inference computations. During inference, the repacked weights are accessed in a sequential pattern that is friendly to the processors’ pipelines.

The inference latency reduction comes at a cost: repacking essentially creates an extra copy of the weights inside XNNPack. When the TensorFlow Lite model is memory-mapped, the operating system eventually releases the original copy of the weights and makes the overhead disappear. However, some use-cases require creating multiple copies of a TensorFlow Lite interpreter, each with its own XNNPack delegate, for the same model. As the XNNPack delegates belonging to different TensorFlow Lite interpreters are unaware of each other, every one of them creates its own copy of repacked weights, and the memory overhead grows linearly with the number of delegate instances. Furthermore, since the original weights in the model are static, the repacked weights in XNNPack are also the same across all instances, making these copies wasteful and unnecessary.

Weights cache is a mechanism that allows multiple instances of the XNNPack delegate accelerating the same model to optimize their memory usage for repacked weights. With a weights cache, all instances use the same underlying repacked weights, resulting in a constant memory usage, no matter how many interpreter instances are created. Moreover, elimination of duplicates due to weights cache may improve performance through increased efficiency of a processor’s cache hierarchy. Note: the weights cache is an opt-in feature available only via the C++ API.

The chart below shows the high water mark memory usage (vertical axis) of creating multiple instances (horizontal axis). It compares the baseline, which does not use weights cache, with using weights cache with soft finalization. The peak memory usage when using weights cache grows much slower with respect to the number of instances created. For this example, using weights cache allows you to double the number of instances created with the same peak memory budget.

The weights cache object is created by the TfLiteXNNPackDelegateWeightsCacheCreate function, and passed to the XNNPack delegate via the delegate options. XNNPack delegate will then use the weights cache to store repacked weights. Importantly, the weights cache must be finalized before any inference invocation.

// Example demonstrating how to create and finalize a weights cache.
std::unique_ptr<tflite::Interpreter> interpreter;
TfLiteXNNPackDelegateWeightsCache* weights_cache =
TfLiteXNNPackDelegateWeightsCacheCreate();
TfLiteXNNPackDelegateOptions xnnpack_options =
TfLiteXNNPackDelegateOptionsDefault();
xnnpack_options.weights_cache = weights_cache;
TfLiteDelegate* delegate =
TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) {
// Static weights will be packed and written into weights_cache.
}
TfLiteXNNPackDelegateWeightsCacheFinalizeHard(weights_cache);

// Calls to interpreter->Invoke and interpreter->AllocateTensors must
// be made here, between finalization and deletion of the cache.
// After the hard finalization any attempts to create a new XNNPack
// delegate instance using the same weights cache object will fail.

TfLiteXNNPackWeightsCacheDelete(weights_cache);

There are two ways to finalize a weights cache, and in the example above we use TfLiteXNNPackDelegateWeightsCacheFinalizeHard which performs hard finalization. The hard finalization has the least memory overhead, as it will trim the memory used by the weights cache to the absolute minimum. However, no new delegates can be created with this weights cache object after the hard finalization – the number of XNNPack delegate instances using this cache is fixed in advance. The other kind of finalization is a soft finalization. Soft finalization has higher memory overhead, as it leaves sufficient space in the weights cache for some internal bookkeeping. The advantage of the soft finalization is that the same weights cache can be used to create new XNNPack delegate instances, provided that the delegate instances use exactly the same model. This is useful if the number of delegate instances is not fixed or known beforehand.

// Example demonstrating soft finalization and creating multiple
// XNNPack delegate instances using the same weights cache.
std::unique_ptr<tflite::Interpreter> interpreter;
TfLiteXNNPackDelegateWeightsCache* weights_cache =
TfLiteXNNPackDelegateWeightsCacheCreate();
TfLiteXNNPackDelegateOptions xnnpack_options =
TfLiteXNNPackDelegateOptionsDefault();
xnnpack_options.weights_cache = weights_cache;
TfLiteDelegate* delegate =
TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) {
// Static weights will be packed and written into weights_cache.
}
TfLiteXNNPackDelegateWeightsCacheFinalizeSoft(weights_cache);

// Calls to interpreter->Invoke and interpreter->AllocateTensors can
// be made here, between finalization and deletion of the cache.
// Notably, new XNNPack delegate instances using the same cache can
// still be created, so long as they are used for the same model.

std::unique_ptr<tflite::Interpreter> new_interpreter;
TfLiteDelegate* new_delegate =
TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (new_interpreter->ModifyGraphWithDelegate(new_delegate) !=
kTfLiteOk)
{
// Repacked weights inside of the weights cache will be reused,
// no growth in memory usage
}

// Calls to new_interpreter->Invoke and
// new_interpreter->AllocateTensors can be made here.
// More interpreters with XNNPack delegates can be created as needed.

TfLiteXNNPackWeightsCacheDelete(weights_cache);

Next steps

With the weights cache, using XNNPack for batch inference will reduce memory usage, leading to better performance. Read more about how to use weights cache with XNNPack at the README and report any issues at XNNPack’s GitHub page.

To stay up to date, you can read the TensorFlow blog, follow twitter.com/tensorflow, or subscribe to youtube.com/tensorflow. If you’ve built something you’d like to share, please submit it for our Community Spotlight at goo.gle/TFCS. For feedback, please file an issue on GitHub or post to the TensorFlow Forum. Thank you!

Read More

New documentation on tensorflow.org

Posted by the TensorFlow team

As Google I/O took place, we published a lot of exciting new docs on tensorflow.org, including updates to model parallelism and model remediation, TensorFlow Lite, and the TensorFlow Model Garden. Let’s take a look at what new things you can learn about!

Counterfactual Logit Pairing

The Responsible AI team added a new model remediation technique as part of their Model Remediation library. The TensorFlow Model Remediation library provides training-time techniques to intervene on the model such as changing the model itself by introducing or altering model objectives. Originally, model remediation launched with its first technique, MinDiff, which minimizes the difference in performance between two slices of data.

New at I/O is Counterfactual Logit Pairing (CLP). This is a technique that seeks to ensure that a model’s prediction doesn’t change when a sensitive attribute referenced in an example is either removed or replaced. For example, in a toxicity classifier, examples such as “I am a man” and “I am a lesbian” should be equal and not classified as toxic.

Check out the basic tutorial, the Keras tutorial, and the API reference.

Model parallelism: DTensor

DTensor provides a global programming model that allows developers to operate on tensors globally while managing distribution across devices. DTensor distributes the program and tensors according to the sharding directives through a procedure called Single program, multiple data (SPMD) expansion.

By decoupling the overall application from sharding directives, DTensor enables running the same application on a single device, multiple devices, or even multiple clients, while preserving its global semantics. If you remember Mesh TensorFlow from TF1, DTensor can address the same issue that Mesh addressed: training models that may be larger than a single core.

With TensorFlow 2.9, we made DTensor, that had been in nightly builds, visible on tensorflow.org. Although DTensor is experimental, you’re welcome to try it out. Check out the DTensor Guide, the DTensor Keras Tutorial, and the API reference.

New in TensorFlow Lite

We made some big changes to the TensorFlow Lite site, including to the getting started docs.

Developer Journeys

First off, we now organize the developer journeys by platform (Android, iOS, and other edge devices) to make it easier to get started with your platform. Android gained a new learning roadmap and quickstart. We also earlier added a guide to the new beta for TensorFlow Lite in Google Play services. These quickstarts include examples in both Kotlin and Java, and upgrade our example code to CameraX, as recommended by our colleagues in Android developer relations!

If you want to immediately run an Android sample, one can now be imported directly from Android studio. When starting a new project, choose: New Project > Import Sample… and look for Artificial Intelligence > TensorFlow Lite in Play Services image classification example application. This is the sample that can help you find your mug…or other objects:

Model Maker

The TensorFlow Lite Model Maker library simplifies the process of training a TensorFlow Lite model using custom datasets. It uses transfer learning to reduce the amount of training data required and reduce training time, and comes pre-built with seven common tasks including image classification, object detection, and text search.

We added a new tutorial for text search. This type of model lets you take a text query and search for the most related entries in a text dataset, such as a database of web pages. On mobile, you might use this for auto reply or semantic document search.

We also published the full Python library reference.

TF Lite model page

Finding the right model for your use case can sometimes be confusing. We’ve written more guidance on how to choose the right model for your task, and what to consider to make that decision.You can also find links to models for common use cases.

Model Garden: State of the art models ready to go

The TensorFlow Model Garden provides implementations of many state-of-the-art machine learning (ML) models for vision and natural language processing (NLP), as well as workflow tools to let you quickly configure and run those models on standard datasets. The Model Garden covers both vision and text tasks, and a flexible training loop library called Orbit. Models come with pre-built configs to train to state-of-the-art, as well as many useful specialized ops.

We’re just getting started documenting all the great things you can do with the Model Garden. Your first stops should be the overview, lists of available models, and the image classification tutorial.

Other exciting things!

Don’t miss the crown-of-thorns starfish detector! Find your own COTS on real images from the Great Barrier reef. See the video, read the blog post, and try out the model in Colab yourself.

Also, there is a new tutorial on TensorFlow compression, which does lossy compression using neural networks. This example uses something like an autoencoder to compress and decompress MNIST.

And, of course, don’t miss all the great I/O talks you can watch on YouTube. Thank you!

Read More

OCR in the browser using TensorFlow.js

A guest post by Charles Gaillard, Mindee

Introduction

Optical Character Recognition (OCR) refers to technologies capable of capturing text elements from images or documents and converting them into a machine-readable text format. If you want to learn more on that topic, this article is a good introduction.

At Mindee, we have developed an open-source Python-based OCR called DocTR, however we also wanted to deploy it in the browser to ensure that it was accessible to all developers – especially as ~70% developers choose to use JavaScript.

We managed to achieve this using the TensorFlow.js API, which resulted in a web demo that you can now try for yourself using images of your own.

The demo interface with a picture of 2 receipts being parsed by the OCR: 89 words were found here

This demo is designed to be very simple to use and run quickly on most computers, therefore we provided a single pretrained model that we trained with a small (512 x 512) input size to save memory. Images are resized to be squares, so it generalizes well to most of the documents which have an aspect ratio close to 1: cards, smaller receipts, tickets, A4, etc. For rectangles with a very high aspect ratio, segmentation results might not be as good because we don’t preserve the aspect ratio (with padding) at the text detection step. It is optimized to work on documents with a significant word size (for example receipts, cards, etc). Keep in mind that these models have been designed to offer performance while running in the browser. Hence, performance might not be optimal on documents that have a very small writing size vs the size of the document or images with a very high aspect ratio.

Dive into the architecture

OCR models can be divided into 2 parts: A detection model and a text recognition model. In DocTR, the detection model is a CNN (convolutional neural network) which segments the input image to find text areas, then text boxes are cropped around each detected word and sent to a recognition model. The second model is a convolutional recurrent neural network (CRNN), which extracts features from word-images and then decodes the sequence of letters on the image with recurrent layers (LSTM).

Global architecture of the OCR model used in this Demo

Detection model

We have different architectures implemented in DocTR, but we chose a very light one for use on the client side as device hardware can change from person to person. Here we used a mobilenetV2 backbone with a DB (Differentiable Binarization) head. The implementation details can be found in the DocTR Github. We trained this model with an input size of (512, 512, 3) to decrease latency and memory usage. We have a private dataset composed of 130,000 annotated documents that was used to train this model.

Recognition model

The recognition model we used is also our lighter architecture: a CRNN (convolutional recurrent neural network) with a mobilenetV2 backbone. More information on this architecture can be found here. It is basically composed of the first half of the mobilenetV2 layers to extract features and it is followed by 2 bi-LSTMs to decode visual features as character sequences (words). It uses the CTC loss, introduced by Alex Graves, to decode a sequence efficiently. We have an input size of (32, 128, 3) for word images in this model, and we use padding to preserve the aspect ratio of crops. It is trained on our private dataset, composed of 11 millions text boxes extracted from different documents. This dataset has a wide variety of fonts, since it is composed of documents which come from many different data sources. We used data augmentation so that it generalizes well on different fonts, backgrounds, and renderings. It should also give decent results on handwritten text as long as it is human-readable.

Model conversion & code implementation

As our model was originally implemented using TensorFlow, Python conversion was required to run the resulting models in the web browser at scale. To do this we exported a tensorflow SavedModel for each Python model trained and used the tensorflowjs_converter command line tool to quickly convert our saved models to the TensorFlow.js JSON format required for execution in the browser.

The resulting converted models were then integrated into our React.js front end application that powered the user interface of the demo. More precisely, we used MUI to design the components of the interface for our in-house front-end SDK react-mindee-js (which provides computer vision tools) and OpenCV.js for the detection model post processing. This post processing step took the raw binarized segmentation map and converted it to a list of polygons with OpenCV.js functions. We could then crop those boxes from the source image to finally obtain word images ready to be sent to the recognition model.

Speed & performance

We had to manage the tradeoff between speed and performance efficiently. OCR models are quite slow because you have 2 tasks (text areas segmentation + words recognition) that can’t be parallelized, so we had to use lightweight models to ensure speedy execution on most devices.

On an modern computer with an RTX 2060 and an i7 9th Gen, the detection task takes around 750 milliseconds per image, and the recognition model around 170 milliseconds per batch of 32 crops (words) with the WebGL backend, benchmarked with the TensorFlow.js benchmarking tool.

Wrapping up the 2 models and the vision operations (detection post processing), the end-to-end OCR runs in less than 2 seconds on small documents (less than 100 words) and the prediction time can only take a few seconds more to run on very dense documents with a lot of words.

A screenshot of the demo interface with a very dense old A4 document being parsed by the OCR: 738 words are identified.

Conclusion

This demo powered by TensorFlow.js is a way to give access to an online, relatively quick and robust document OCR to almost everyone, which is one of the first of its kind powered by TensorFlow.js entirely in the browser.

As we are executing the model on the client side, exact performance will vary depending on the hardware of the device it is run on. However the goal here is more to demonstrate that even complex and state-of-the-art deep learning models can be deployed in the browser and run on almost every machine in an efficient manner that can be very useful, especially for potentially sensitive document information, where you do not want to send the document to the cloud for analysis.

We are excited to offer this solution for all to use, and keen to follow the future of the Web ML industry, where things will no doubt get faster with time as new web standards like WebGPU become mainstream and enabled by default on modern web browsers.

Read More

5 steps to go from a notebook to a deployed model

Posted by Nikita Namjoshi, Google Cloud Developer Advocate

When you start working on a new machine learning problem, I’m guessing the first environment you use is a notebook. Maybe you like running Jupyter in a local environment, using a Kaggle Kernel, or my personal favorite, Colab. With tools like these, creating and experimenting with machine learning is becoming increasingly accessible. But while experimentation in notebooks is great, it’s easy to hit a wall when it comes time to elevate your experiments up to production scale. Suddenly, your concerns are more than just getting the highest accuracy score.

What if you have a long running job, want to do distributed training, or host a model for online predictions? Or maybe your use case requires more granular permissions around security and data privacy. What is your data going to look like at serving time, how will you handle code changes, or monitor the performance of your model overtime?

Making production applications or training large models requires additional tooling to help you scale beyond just code in a notebook, and using a cloud service provider can help. But that process can feel a bit daunting. Take a look at the full list of Google Cloud products, and you might be completely unsure where to start.

So to make your journey a little easier, I’ll show you a fast path from experimental notebook code to a deployed model in the cloud.

The code used in this sample can be found here. This notebook trains an image classification model on the TF Flowers dataset. You’ll see how to deploy this model in the cloud and get predictions on a new flower image via a REST endpoint.

Note that you’ll need a Google Cloud project with billing enabled to follow this tutorial. If you’ve never used Google Cloud before, you can follow these instructions to set up a project and get $300 in free credits to experiment with.

Here are the five steps you’ll take:

  1. Create a Vertex AI Workbench managed notebook
  2. Upload .ipynb file
  3. Launch notebook execution
  4. Deploy model
  5. Get predictions

Create a Vertex AI Workbench managed notebook

To train and deploy the model, you’ll use Vertex AI, which is Google Cloud’s managed machine learning platform. Vertex AI contains lots of different products that help you across the entire lifecycle of an ML workflow. You’ll use a few of these products today, starting with Workbench, which is the managed notebook offering.

Under the Vertex AI section of the cloud console, select “Workbench”. Note that if this is the first time you’re using Vertex AI in a project, you’ll be prompted to enable the Vertex API and the Notebooks API. So be sure to click the button in the UI to do so.

Next, select MANAGED NOTEBOOKS, and then NEW NOTEBOOK.

Under Advanced Settings you can customize your notebook by specifying the machine type and location, adding GPUs, providing custom containers, and enabling terminal access. For now, keep the default settings and just provide a name for your notebook. Then click CREATE.

You’ll know your notebook is ready when you see the OPEN JUPYTERLAB text turn blue. The first time you open the notebook, you’ll be prompted to authenticate and you can follow the steps in the UI to do so.

When you open the JupyterLab instance, you’ll see a few different notebook options. Vertex AI Workbench provides different kernels (TensorFlow, R, XGBoost, etc), which are managed environments preinstalled with common libraries for data science. If you need to add additional libraries to a kernel, you can use pip install from a notebook cell, just like you would in Colab.

Step one is complete! You’ve created your managed JupyterLab environment.

Upload .ipynb file

Now it’s time to get our TensorFlow code into Google Cloud. If you’ve been working in a different environment (Colab, local, etc), you can upload any code artifacts you need to your Vertex AI Workbench managed notebook, and you can even integrate with GitHub. In the future, you can do all of your development right in Workbench, but for now let’s assume you’ve been using Colab.

Colab notebooks can be exported as .ipynb files.

You can upload the file to Workbench by clicking the “upload files” icon.

When you open the notebook in Workbench, you’ll be prompted to select the kernel, which is the environment where your notebook is run. There are a few different kernels you can choose from, but since this code sample uses TensorFlow, you’ll want to select the TensorFlow 2 kernel.

After you select the kernel, any cells you execute in your notebook will run in this managed TensorFlow environment. For example, if you execute the import cell, you’ll see that you can import TensorFlow, TensorFlow Datasets, and NumPy. This is because all of these libraries are included in the Vertex AI Workbench TensorFlow 2 kernel. Unsurprisingly, if you try to execute that same notebook cell in the XGBoost kernel, you’ll see an error message since TensorFlow is not installed there.

Launch a notebook execution

While we could run the rest of the notebook cells manually, for models that take a long time to train, a notebook isn’t always the most convenient option. And if you’re building an application with ML, it’s unlikely that you’ll only need to train your model once. Over time, you’ll want to retrain your model to make sure it stays fresh and keeps producing valuable results.

Manually executing the cells of your notebook might be the right option when you’re getting started with a new machine learning problem. But when you want to automate experimentation at a large scale, or retrain models for a production application, a managed ML training option will make things much easier.

The quickest way to launch a training job is through the notebook execution feature, which will run the notebook cell by cell on the Vertex AI managed training service.

When you launch the training job, it’s going to run on a machine you won’t have access to after the job completes. So you don’t want to save the TensorFlow model artifacts to a local path. Instead, you’ll want to save to Cloud Storage, which is Google Cloud’s object storage, meaning you can store images, csv files, txt files, saved model artifacts. Just about anything.

Cloud storage has the concept of a “bucket” which is what holds your data. You can create them via the UI. Everything you store in Cloud Storage must be contained in a bucket. And within a bucket, you can create folders to organize your data.

Each file in Cloud Storage has a path, just like a file on your local filesystem. Except that Cloud Storage paths always start with gs://

You’ll want to update your training code so that you’re saving to a Cloud Storage bucket instead of a local path.

For example, here I’ve updated the last cell of the notebook from model.save('model_ouput").Instead of saving locally, I’m now saving the artifacts to a bucket called nikita-flower-demo-bucket that I’ve created in my project.

Now we’re ready to launch the execution.

Select the Execute button, give your execution a name, then add a GPU. Under Environment, select the TensorFlow 2.7 GPU image. This container comes preinstalled with TensorFlow and many other data science libraries.

Then click SUBMIT.

You can track the status of your training job in the EXECUTIONS tab. The notebook and the output of each cell will be visible under VIEW RESULT when the job finishes and is stored in a GCS bucket. This means you can always tie a model run back to the code that was executed.

When the training completes you’ll be able to see the TensorFlow saved model artifacts in your bucket.

Deploy to endpoint

Now you know how to quickly launch serverless training jobs on Google Cloud. But ML is not just about training. What’s the point of all this effort if we don’t actually use the model to do something, right?

Just like with training, we could execute predictions directly from our notebook by calling model.predict. But when we want to get predictions for lots of data, or get low latency predictions on the fly, we’re going to need something more powerful than a notebook.

Back in your Vertex AI Workbench managed notebook, you can paste the code below in a cell, which will use the Vertex AI Python SDK to deploy the model you just trained to the Vertex AI Prediction service. Deploying the model to an endpoint associates the saved model artifacts with physical resources for low latency predictions.

First, import the Vertex AI Python SDK.

from google.cloud import aiplatform

Then, upload your model to the Vertex AI Model Registry. You’ll need to give your model a name, and provide a serving container image, which is the environment where your predictions will run. Vertex AI provides pre-built containers for serving, and in this example we’re using the TensorFlow 2.8 image.

You’ll also need to replace artifact_uri with the path to the bucket where you stored your saved model artifacts. For me, that was “nikita-flower-demo-bucket”. You’ll also need to replace project with your project ID.

my_model = aiplatform.Model.upload(display_name='flower-model',
artifact_uri='gs://{YOUR_BUCKET}',
serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest',
project={YOUR_PROJECT})

Then deploy the model to an endpoint. I’m using default values for now, but if you’d like to learn more about traffic splitting, and autoscaling, be sure to check out the docs. Note that if your use case does not require low latency predictions, you don’t need to deploy the model to an endpoint and can use the batch prediction feature instead.

endpoint = my_model.deploy(
deployed_model_display_name='my-endpoint',
traffic_split={"0": 100},
machine_type="n1-standard-4",
accelerator_count=0,
min_replica_count=1,
max_replica_count=1,
)

Once the deployment has completed, you can see your model and endpoint in the console

Get predictions

Now that this model is deployed to an endpoint, you can hit it like any other REST endpoint. This means you can integrate your model and get predictions into a downstream application.

For now, let’s just test it out directly within Workbench.

First, open a new TensorFlow notebook.

In the notebook, import the Vertex AI Python SDK.

from google.cloud import aiplatform

Then, create your endpoint, replacing project_number and endpoint_id.

endpoint = aiplatform.Endpoint(
endpoint_name="projects/{project_number}/locations/us-central1/endpoints/{endpoint_id}")

You can find your endpoint_id in the Endpoints section of the cloud Console.

You can find your Project Number on the home page of the console. Note that this is different from the Project ID.

When you send a request to an online prediction server, the request is received by an HTTP server. The HTTP server extracts the prediction request from the HTTP request content body. The extracted prediction request is forwarded to the serving function. The basic format for online prediction is a list of data instances. These can be either plain lists of values or members of a JSON object, depending on how you configured your inputs in your training application.

To test the endpoint, I first uploaded an image of a flower to my workbench instance.

The code below opens and resizes the image with PIL, and converts it into a numpy array.

import numpy as np
from PIL import Image

IMAGE_PATH = 'test_image.jpg'

im = Image.open(IMAGE_PATH)
im = im.resize((150, 150))

Then, we convert our numpy data to type float32 and to a list. We convert to a list because numpy data is not JSON serializable so we can’t send it in the body of our request. Note that we don’t need to scale the data by 255 because that step was included as part of our model architecture using tf.keras.layers.Rescaling(1./255). To avoid having to resizing our image, we could have added tf.keras.layers.Resizing to our model, instead of making it part of the tf.data pipeline.

# convert to float32 list
x_test = [np.asarray(im).astype(np.float32).tolist()]

Then, we call call predict

endpoint.predict(instances=x_test).predictions

The result you get is the output of the model, which is a softmax layer with 5 units. Looks like class at index 2 (tulips) scored the highest.

[[0.0, 0.0, 1.0, 0.0, 0.0]]

Tip: to save costs, be sure to undeploy your endpoint if you’re not planning to use it! You can undeploy by going to the Endpoints section of the console, selecting the endpoint and then the Undeploy model form endpoint option. You can always redeploy in the future if needed.

For more realistic examples, you’ll probably want to directly send the image itself to the endpoint, instead of loading it in NumPy first. If you’d like to see an example, check out this notebook.

What’s Next

You now know how to get from notebook experimentation to deployment in the cloud. With this framework in mind, I hope you start thinking about how you can build new ML applications with notebooks and Vertex AI.

If you’re interested in learning even more about how to use Google Cloud to get your TensorFlow models into production, be sure to register for the upcoming Google Cloud Applied ML Summit. This virtual event is scheduled for 9th June and brings together the world’s leading professional machine learning engineers and data scientists. Connect with other ML engineers and data scientists and discover new ways to speed up experimentation, quickly get into production, scale and manage models, and automate pipelines to deliver impact. Reserve your seat today!

Read More

Real-time SKU detection in the browser using TensorFlow.js

Posted by Hugo Zanini, Data Product Manager

Last year, I published an article on how to train custom object detection in the browser using TensorFlow.js. This received lots of interest from developers from all over the world who tried to apply the solution to their personal or business projects.While answering reader’s questions on my first article, I noticed a few difficulties in adapting our solution to large datasets, and deploying the resulting model in production using the new version of TensorFlow.js.

Therefore, the goal of this article is to share a solution for a well-known problem in the consumer packaged goods (CPG) industry: real-time and offline SKU detection using TensorFlow.js.

Offline SKU detection running in real time on a smartphone using TensorFlow.js

The problem

Items consumed frequently by consumers (foods, beverages, household products, etc) require an extensive routine of replenishment and placement of those products at their point of sale (supermarkets, convenience stores, etc).

Over the past few years, researchers have shown repeatedly that about two-thirds of purchase decisions are made after customers enter the store. One of the biggest challenges for consumer goods companies is to guarantee the availability and correct placement of their product in-stores.

At stores, teams organize the shelves based on marketing strategies, and manage the level of products in the stores. The people working on these activities may count the number of SKUs of each brand in a store to estimate product stocks and market share, and help to shape marketing strategies.

These estimations though are very time-consuming. Taking a photo and using an algorithm to count the SKUs on the shelves to calculate a brand’s market share could be a good solution.

To use an approach like that, the detection should run in real-time such that as soon as you point a phone camera to the shelf, the algorithm recognizes the brands and calculates the market shares. And, as the internet inside the stores is generally limited, the detection should work offline.

Example workflow

This post is going to show how to implement the real-time and offline image recognition solution to identify generic SKUs using the SKU110K dataset and the MobileNetV2 network.

Due to the lack of a public dataset with labeled SKUs of different brands, we’re going to create a generic algorithm, but all the instructions can be applied in a multiclass problem.

As with every machine learning flow, the project will be divided into four steps, as follows:

Object Detection Model Production Pipeline

Preparing the data

The first step to training a good model is to gather good data. As mentioned before, this solution is going to use a dataset of SKUs in different scenarios. The purpose of SKU110K was to create a benchmark for models capable of recognizing objects in densely packed scenes.

The dataset is provided in the Pascal VOC format and has to be converted to tf.record. The script to do the conversion is available here and the tf.record version of the dataset is also available in my project repository. As mentioned before, SKU110K is a large and very challenging dataset to work with. It contains many objects, often looking similar or even identical, positioned in close proximity.

Dataset characteristics (Gist link)

To work with this dataset, the neural network chosen has to be very effective in recognizing patterns and be small enough to run in real-time in TensorFlow.js.

Choosing the model

There are a variety of neural networks capable of solving the SKU detection problem. But, the architectures that easily achieve a high level of precision are very dense and don’t have reasonable inference times when converted to TensorFlow.js to run in real-time.

Because of that, the approach here is going to be to focus on optimizing a mid-level neural network to achieve reasonable precision working on densely packed scenes and run the inferences in real-time. Analyzing the TensorFlow 2.0 Detection Model Zoo, the challenge will be to try to solve the problem using the lighter single-shot model available: SSD MobileNet v2 320×320 which seems to fit the criteria required. The architecture is proven to be able to recognize up to 90 classes and can be trained to identify different SKUs.

Training the model

With a good dataset and the model selected, it’s time to think about the training process. TensorFlow 2.0 provides an Object Detection API that makes it easy to construct, train, and deploy object detection models. In this project, we’re going to use this API and train the model using a Google Colaboratory Notebook. The remainder of this section explains how to set up the environment, the model selection, and training. If you want to jump straight to the Colab Notebook, click here.

Setting up the environment

Create a new Google Colab notebook and select a GPU as the hardware accelerator:

Runtime > Change runtime type > Hardware accelerator: GPU

Clone, install, and test the TensorFlow Object Detection API:

Gist link

Next, download and extract the dataset using the following commands:

Gist link

Setting up the training pipeline

We’re ready to configure the training pipeline. TensorFlow 2.0 provides pre-trained weights for the SSD Mobilenet v2 320×320 on the COCO 2017 Dataset, and they are going to be downloaded using the following commands:

Gist link

The downloaded weights were pre-trained on the COCO 2017 Dataset, but the focus here is to train the model to recognize one class so these weights are going to be used only to initialize the network — this technique is known as transfer learning, and it’s commonly used to speed up the learning process.

The last step is to set up the hyperparameters on the configuration file that is going to be used during the training. Choosing the best hyperparameters is a task that requires some experimentation and, consequently, computational resources.

I took a standard configuration of MobileNetV2 parameters from the TensorFlow Models Config Repository and performed a sequence of experiments (thanks Google Developers for the free resources) to optimize the model to work with densely packed scenes on the SKU110K dataset. Download the configuration and check the parameters using the code below.

Gist link

Gist link

With the parameters set, start the training by executing the following command:

Gist link

To identify how well the training is going, we use the loss value. Loss is a number indicating how bad the model’s prediction was on the training samples. If the model’s prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples (Descending into ML: Training and Loss | Machine Learning Crash Course).

The training process was monitored through Tensorboard and took around 22h to finish on a 60GB machine using an NVIDIA Tesla P4. The final losses can be checked below

Total training loss

Validate the model

Now let’s evaluate the trained model using the test data:

Gist link

The evaluation was done across 2740 images and provides three metrics based on the COCO detection evaluation metrics: precision, recall, and loss (Classification: Precision and Recall | Machine Learning Crash Course). The same metrics are available via Tensorboard and can be analyzed in an easier way

%load_ext tensorboard
%tensorboard --logdir '/content/training/'

You can then explore all training and evaluation metrics.

Main evaluation metrics

Exporting the model

Now that the training is validated, it’s time to export the model. We’re going to convert the training checkpoints to a protobuf (pb) file. This file is going to have the graph definition and the weights of the model.

Gist link

As we’re going to deploy the model using TensorFlow.js and Google Colab has a maximum lifetime limit of 12 hours, let’s download the trained weights and save them locally. When running the command files.download("/content/saved_model.zip"), the Colab will prompt the file download automatically.

Gist link

Deploying the model

The model is going to be deployed in a way that anyone can open a PC or mobile camera and perform inference in real-time through a web browser. To do that, we’re going to convert the saved model to the TensorFlow.js layers format, load the model in a JavaScript application and make everything available on CodeSandbox.

Converting the model

At this point, you should have something similar to this structure saved locally:

%MD

├── inference-graph

│ ├── saved_model

│ │ ├── assets

│ │ ├── saved_model.pb

│ │ ├── variables

│ │ ├── variables.data-00000-of-00001

│ │ └── variables.index

Before we start, let’s create an isolated Python environment to work in an empty workspace and avoid any library conflict. Install virtualenv and then open a terminal in the inference-graph folder and create and activate a new virtual environment:

virtualenv -p python3 venv
source venv/bin/activate

Install the TensorFlow.js converter:

pip install tensorflowjs[wizard]

Start the conversion wizard:

tensorflowjs_wizard

Now, the tool will guide you through the conversion, providing explanations for each choice you need to make. The image below shows all the choices that were made to convert the model. Most of them are the standard ones, but options like the shard sizes and compression can be changed according to your needs.

To enable the browser to cache the weights automatically, it’s recommended to split them into shard files of around 4MB. To guarantee that the conversion is going to work, don’t skip the op validation as well, not all TensorFlow operations are supported so some models can be incompatible with TensorFlow.js — See this list for which ops are currently supported on the various backends that TensorFlow.js executes on such as WebGL, WebAssembly, or plain JavaScript.

Model conversion using TensorFlow.js Converter (Full resolution image here)

If everything works well, you’re going to have the model converted to the TensorFlow.js layers format in the web_model directory. The folder contains a model.json file and a set of sharded weights files in a binary format. The model.json has both the model topology (aka “architecture” or “graph”: a description of the layers and how they are connected) and a manifest of the weight files (Lin, Tsung-Yi, et al). The contents of the web_model folder currently contains the files shown below:

└ web_model
├── group1-shard1of5.bin
├── group1-shard2of5.bin
├── group1-shard3of5.bin
├── group1-shard4of5.bin
├── group1-shard5of5.bin
└── model.json

Configuring the application

The model is ready to be loaded in JavaScript. I’ve created an application to perform inference directly from the browser. Let’s clone the repository to figure out how to use the converted model in real-time. This is the project structure:

├── models
│ ├── group1-shard1of5.bin
│ ├── group1-shard2of5.bin
│ ├── group1-shard3of5.bin
│ ├── group1-shard4of5.bin
│ ├── group1-shard5of5.bin
│ └── model.json
├── package.json
├── package-lock.json
├── public
│ └── index.html
├── README.MD
└── src
├── index.js
└── styles.css

For the sake of simplicity, I have already provided a converted SKU-detector model in the model’s folder. However, let’s put the web_model generated in the previous section in the models folder and test it.

Next, install the http-server:

npm install http-server -g

Go to the models folder and run the command below to make the model available at http://127.0.0.1:8080 . This is a good choice when you want to keep the model weights in a safe place and control who can request inferences to it. The -c1 parameter is added to disable caching, and the –cors flag enables cross-origin resource sharing allowing the hosted files to be used by the client-side JavaScript for a given domain.

http-server -c1 --cors .

Alternatively, you can upload the model files somewhere else – even on a different domain if needed. In my case, I chose my own Github repo and referenced the model.json folder URL in the load_model function as shown below:

async function load_model() {
// It's possible to load the model locally or from a repo.
// Load from localhost locally:
const model = await loadGraphModel("http://127.0.0.1:8080/model.json");
// Or Load from another domain using a folder that contains model.json.
// const model = await loadGraphModel("https://github.com/hugozanini/realtime-sku-detection/tree/web");
return model;
}

This is a good option because it gives more flexibility to the application and makes it easier to run on public web servers.

Pick one of the methods to load the model files in the function load_model (lines 10–15 in the file src>index.js).

When loading the model, TensorFlow.js will perform the following requests:

GET /model.json
GET /group1-shard1of5.bin
GET /group1-shard2of5.bin
GET /group1-shard3of5.bin
GET /group1-shardo4f5.bin
GET /group1-shardo5f5.bin

Publishing in CodeSandbox

CodeSandbox is a simple tool for creating web apps where we can upload the code and make the application available for everyone on the web. By uploading the model files in a GitHub repo and referencing them in the load_model function, we can simply log into CodeSandbox, click on New project > Import from Github, and select the app repository.

Wait some minutes to install the packages and your app will be available at a public URL that you can share with others. Click on Show > In a new window and a tab will open with a live preview. Copy this URL and paste it in any web browser (PC or Mobile) and your object detection will be ready to run. A ready to use project can be found here as well if you prefer.

Conclusion

Besides the precision, an interesting part of these experiments is the inference time — everything runs in real-time in the browser via JavaScript. SKU detection models running in the browser, even offline, and using few computational resources is a must in many consumer packaged goods company applications, along with other industries too.

Enabling a Machine Learning solution to run on the client-side is a key step to guarantee that the models are being used effectively at the point of interaction with minimal latency and solve the problems when they happen: right in the user’s hand.

Deep learning should not be costly and should be used beyond just research, for real world use cases, which JavaScript is great for production deployments. I hope this article will serve as a basis for new projects involving Computer Vision, TensorFlow, and create an easier flow between Python and Javascript.

If you have any questions or suggestions you can reach me on Twitter.

Thanks for reading!

Acknowledgments

I’d like to thank the Google Developers Group, for providing all the computational resources for training the models, and the authors of the SKU 110K Dataset, for creating and open-sourcing the dataset used in this project.

Read More

What’s new in TensorFlow 2.9?

Posted by Goldie Gadde and Douglas Yarrington for the TensorFlow team

TensorFlow 2.9 has been released! Highlights include performance improvements with oneDNN, and the release of DTensor, a new API for model distribution that can be used to seamlessly move from data parallelism to model parallelism

We’ve also made improvements to the core library, including Eigen and tf.function unification, deterministic behavior, and new support for Windows’ WSL2. Finally, we’re releasing new experimental APIs for tf.function retracing and Keras Optimizers. Let’s take a look at these new and improved features.

Improved CPU performance: oneDNN by default

We have worked with Intel to integrate the oneDNN performance library with TensorFlow to achieve top performance on Intel CPUs. Since TensorFlow 2.5, TensorFlow has had experimental support for oneDNN, which could provide up to a 4x performance improvement. In TensorFlow 2.9, we are turning on oneDNN optimizations by default on Linux x86 packages and for CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, and others, which are found on Intel Cascade Lake and newer CPUs.

Users running TensorFlow with oneDNN optimizations enabled might observe slightly different numerical results from when the optimizations are off. This is because floating-point round-off approaches and order differ, and can create slight errors. If this causes issues for you, turn the optimizations off by setting TF_ENABLE_ONEDNN_OPTS=0 before running your TensorFlow programs. To enable or re-enable them, set TF_ENABLE_ONEDNN_OPTS=1 before running your TensorFlow program. To verify that the optimizations are on, look for a message beginning with "oneDNN custom operations are on" in your program log. We welcome feedback on GitHub and the TensorFlow Forum.

Model parallelism with DTensor

DTensor is a new TensorFlow API for distributed model processing that allows models to seamlessly move from data parallelism to single program multiple data (SPMD) based model parallelism, including spatial partitioning. This means you have tools to easily train models where the model weights or inputs are so large they don’t fit on a single device. (If you are familiar with Mesh TensorFlow in TF1, DTensor serves a similar purpose.)

DTensor is designed with the following principles at its core:

  • A device-agnostic API: This allows the same model code to be used on CPU, GPU, or TPU, including models partitioned across device types.
  • Multi-client execution: Removes the coordinator and leaves each task to drive its locally attached devices, allowing scaling a model with no impact to startup time.
  • A global perspective vs. per-replica: Traditionally with TensorFlow, distributed model code is written around replicas, but with DTensor, model code is written from the global perspective and per replica code is generated and run by the DTensor runtime. Among other things, this means no uncertainty about whether batch normalization is happening at the global level or the per replica level.

We have developed several introductory tutorials on DTensor, from DTensor concepts to training DTensor ML models with Keras:

TraceType for tf.function

We have revamped the way tf.function retraces to make it simpler, predictable, and configurable.

All arguments of tf.function are assigned a tf.types.experimental.TraceType. Custom user classes can declare a TraceType using the Tracing Protocol (tf.types.experimental.SupportsTracingProtocol).

The TraceType system makes it easy to understand retracing rules. For example, subtyping rules indicate what type of arguments can be used with particular function traces. Subtyping also explains how different specific shapes are joined into a generic shape that is their supertype, to reduce the number of traces for a function.

To learn more, see the new APIs for tf.types.experimental.TraceType, tf.types.experimental.SupportsTracingProtocol, and the reduce_retracing parameter of tf.function.

Support for WSL2

The Windows Subsystem for Linux lets developers run a Linux environment directly on Windows, without the overhead of a traditional virtual machine or dual boot setup. TensorFlow now supports WSL2 out of the box, including GPU acceleration. Please see the documentation for more details about the requirements and how to install WSL2 on Windows.

Deterministic behavior

The API tf.config.experimental.enable_op_determinism makes TensorFlow ops deterministic.

Determinism means that if you run an op multiple times with the same inputs, the op returns the exact same outputs every time. This is useful for debugging models, and if you train your model from scratch several times with determinism, your model weights will be the same every time. Normally, many ops are non-deterministic due to the use of threads within ops which can add floating-point numbers in a nondeterministic order.

TensorFlow 2.8 introduced an API to make ops deterministic, and TensorFlow 2.9 improved determinism performance in tf.data in some cases. If you want your TensorFlow models to run deterministically, just add the following to the start of your program:

“`

tf.keras.utils.set_random_seed(1)

tf.config.experimental.enable_op_determinism()

“`

The first line sets the random seed for Python, NumPy, and TensorFlow, which is necessary for determinism. The second line makes each TensorFlow op deterministic. Note that determinism in general comes at the expense of lower performance and so your model may run slower when op determinism is enabled.

Optimized Training with Keras

In TensorFlow 2.9, we are releasing a new experimental version of the Keras Optimizer API, tf.keras.optimizers.experimental. The API provides a more unified and expanded catalog of built-in optimizers which can be more easily customized and extended.

In a future release, tf.keras.optimizers.experimental.Optimizer (and subclasses) will replace tf.keras.optimizers.Optimizer (and subclasses), which means that workflows using the legacy Keras optimizer will automatically switch to the new optimizer. The current (legacy) tf.keras.optimizers.* API will still be accessible via tf.keras.optimizers.legacy.*, such as tf.keras.optimizers.legacy.Adam.

Here are some highlights of the new optimizer class:

  • Incrementally faster training for some models.
  • Easier to write customized optimizers.
  • Built-in support for moving average of model weights (“Polyak averaging”).

For most users, you will need to take no action. But, if you have an advanced workflow falling into the following cases, please make corresponding changes:

Use Case 1: You implement a customized optimizer based on the Keras optimizer

For these works, please first check if it is possible to change your dependency to tf.keras.optimizers.experimental.Optimizer. If for any reason you decide to stay with the old optimizer (we discourage it), then you can change your optimizer to tf.keras.optimizers.legacy.Optimizer to avoid being automatically switched to the new optimizer in a later TensorFlow version.

Use Case 2: Your work depends on third-party Keras-based optimizers (such as tensorflow_addons)

Your work should run successfully as long as the library continues to support the specific optimizer. However, if the library maintainers fail to take actions to accommodate the Keras optimizer change, your work would error out. So please stay tuned with the third-party library’s announcement, and file a bug to Keras team if your work is broken due to optimizer malfunction.

Use Case 3: Your work is based on TF1

First of all, please try migrating to TF2. It is worth it, and may be easier than you think! If for any reason migration is not going to happen soon, then please replace your tf.keras.optimizers.XXX to tf.keras.optimizers.legacy.XXX to avoid being automatically switched to the new optimizer.

Use Case 4: Your work has customized gradient aggregation logic

Usually this means you are doing gradients aggregation outside the optimizer, and calling apply_gradients() with experimental_aggregate_gradients=False. We changed the argument name, so please change your optimizer to tf.keras.optimizers.experimental.Optimizer and set skip_gradients_aggregation=True. If it errors out after making this change, please file a bug to Keras team.

Use Case 5: Your work has direct calls to deprecated optimizer public APIs

Please check if your method call has a match here. change your optimizer to tf.keras.optimizers.experimental.Optimizer. If for any reason you want to keep using the old optimizer, change your optimizer to tf.keras.optimizers.legacy.Optimizer.

Next steps

Check out the release notes for more information. To stay up to date, you can read the TensorFlow blog, follow twitter.com/tensorflow, or subscribe to youtube.com/tensorflow. If you’ve built something you’d like to share, please submit it for our Community Spotlight at goo.gle/TFCS. For feedback, please file an issue on GitHub or post to the TensorFlow Forum. Thank you!

Read More

TensorFlow Lite for education and makers

Posted by Scott Main, AIY Projects and Coral

Back in 2017, we began AIY Projects to make do-it-yourself artificial intelligence projects accessible to anybody. Our first project was the AIY Voice Kit, which allows you to build your own intelligent device that responds to voice commands. Then we released the AIY Vision Kit, which can recognize objects seen by its camera using on-device TensorFlow models. We were amazed by the projects people built with these kits and thrilled to see educational programs use them to introduce young engineers to the possibilities of computer science and machine learning (ML). So I’m excited to continue our mission to bring machine learning to everyone with the more powerful and more customizable AIY Maker Kit.

Making ML accessible to all

The Voice Kit and Vision Kit are a lot of fun to put together and they include great programs that demonstrate the possibilities of ML on a small device. However, they don’t provide the tools or procedures to help beginners achieve their own ML project ideas. When we released those kits in 2017, it was actually quite difficult to train an ML model, and getting a model to run on a device like a Raspberry Pi was even more challenging. Nowadays, if you have some experience with ML and know where to look for help, it’s not so surprising that you can train an object detection model in your web browser in less than an hour, or that you can run a pose detection model on a battery-powered device. But if you don’t have any experience, it can be difficult to discover the latest ML tools, let alone get started with them.

We intend to solve that with the Maker Kit. With this kit, we’re not offering any new hardware or ML tools; we’re offering a simplified workflow and a series of tutorials that use the latest tools to train TensorFlow Lite models and execute them on small devices. So it’s all existing technology, but better packaged so beginners can stop searching and start building incredible things right away.

Simplified tools for success

The material we’ve collected and created for the Maker Kit offers an end-to-end experience that’s ideal for educational programs and users who just want to make something with ML as fast as possible.

The hardware setup requires a Raspberry Pi, a Pi Camera, a USB microphone, and a Coral USB Accelerator so you can execute advanced vision models at high speed on the Coral Edge TPU. If you want your hardware in a case, we offer two DIY options: a 3D-printed case design or a cardboard case you can build using materials at home.

Once it’s booted up with our Maker Kit system image, just run some of our code examples and follow our coding tutorials. You’ll quickly discover how easy it is to accomplish amazing things with ML that were recently considered accessible only to experts, including object detection, pose classification, and speech recognition.

Our code examples use some pre-trained models and you can get more models that are accelerated on the Edge TPU from the Coral models library. However, training your own models allows you to explore all new project ideas. So the Maker Kit also offers step-by-step tutorials that show you how to collect your own datasets and train your own vision and audio models.

Last but not least, we want you to spend nearly all your time writing the code that’s unique to your project. So we created a Python library that reduces the amount of code needed to perform an inference down to a tiny part of your project. For example, this is how you can run an object detection model and draw labeled bounding boxes on a live camera feed:

from aiymakerkit import vision
from aiymakerkit import utils
import models

detector = vision.Detector(models.OBJECT_DETECTION_MODEL)
labels = utils.read_labels_from_metadata(models.OBJECT_DETECTION_MODEL)

for frame in vision.get_frames():
objects = detector.get_objects(frame, threshold=0.4)
vision.draw_objects(frame, objects, labels)

Our intent is to hide the code you don’t absolutely need. You still have access to structured inference results and program flow, but without any boilerplate code to handle the model.

This aiymakerkit library is built upon TensorFlow Lite and it’s available on GitHub, so we invite you to explore the innards and extend the Maker Kit API for your projects.

Getting started

We created the Maker Kit to be fully customizable for your projects. So rather than provide all the materials in a box with a predetermined design, we designed it with hardware that’s already available in stores (listed on our website) and with optional instructions to build your own case.

To get started, visit our website at g.co/aiy/maker, gather the required materials, flash our system image, and follow our programming tutorials to start exploring the possibilities. With this head start toward building smart applications that run entirely on an embedded system, we can’t wait to see what you will create.

Read More

AI and Machine Learning @ I/O Recap

Posted by TensorFlow Team

Google I/O 2022 was a major milestone in the evolution of AI and Machine Learning for developers. We’re really excited about the potential for developers using our technologies and Machine Learning to build intelligent solutions, and we believe that 2022 is the year when AI and ML become part of every developer’s toolbox.

At the I/O keynotes we showed our fully open source ecosystem that takes you from end to end with Machine Learning. There are developer tools for managing data, training your models, and deploying them to a variety of surfaces from global scale cloud all the way down to tiny microcontrollers…and of course ways to monitor and maintain your systems with MLOps. All of this comes with a common set of accelerated hardware for training and inference, along with open source tooling for responsible AI end-to-end.

You can get a tour through this ecosystem in the Keynote “AI and Machine Learning updates for Developers”

Responsible AI review processes: From a developer’s point of view

We can all agree that responsible and ethical AI is important, but when you want to build responsibly, you need tooling. We could, and will, create a whole video series about these tools, but the great content to watch right now is the talk on the Responsible AI review process. Googlers who worked on projects like the Covid-19 public forecasts or the Celebrity Recognition APIs will take you step-by-step through their thought process and how the tools lined up to help them build more responsibly and thoughtfully. You’ll also learn about some of the new releases in Responsible AI tools, such as the Counterfactual Logit Pairing library.

Adding machine learning to your developer toolbox

If you’re just getting started on your journey and you want ML to be a part of your toolbox, you probably have a million questions. Follow a developer’s journey through the best offerings, from a turnkey API that can solve basic problems fast, to custom models that can be tuned and deployed.

TensorFlow.js: From prototype to production, what’s new in 2022?

If you’re a web developer there’s a whole bunch of new updates, from the announcement of a new set of courses that will take you from first principles through a deep dive of what’s possible to lots of new models available to web devs. These include a selfie depth estimation model that can be used for cool things like a 3D effect in your pictures without needing any kind of extra sensor. You’ll also see 3D pose estimation that allows you to run at a high FPS to get real time results, allowing you to do things like having a full animated character following your body motion. All in the browser!

Deploy a custom ML model to mobile

If you want to build better mobile apps with AI and Machine Learning, you probably need to understand the ins and outs of getting models to execute on Android or iOS devices, including shrinking them and optimizing them to be power friendly. Supercharge your model with new releases from the TensorFlow Lite team that let you quantize, debug, and accelerate your model on CPU or delegated GPUs, and a whole lot more.

Further on the edge with Coral Dev Board Micro

Speaking of acceleration, this year at I/O we introduced the Coral Dev Board Micro. This is a new microcontroller class device with an on-board Edge TPU that’s powerful enough to run multiple models in tandem. The Coral team has also updated their catalog of pre-trained models, now including over 40 models now available for you to use on embedded systems out of the box!

Tips and tricks for distributed large model training

On the other side of the spectrum, if you want to train large models, you’ll need to understand how to shard training and data across multiple processors or cores. We’ve released lots of new guidance and updates for model and data parallelism. You can learn all about them in this talk, including lessons learned from Google researchers in building language models.

Easier data preprocessing with Keras

Of course, not all data is big data, and if you’re not building giant models, you still need to be able to manage your data. Often this is where devs will write the most code for ML, so we want to highlight some ways of making this easier, in particular with Keras. Keras’s new preprocessing layers that not only make vectorization and augmentation much easier, but also allow for precomputation to make your training more efficient by reducing idle time. Learn about data preprocessing from the creator of Keras!

An introduction to MLOps with TFX

Finally, let’s not forget MLOps and TFX, the open source, end-to-end pipeline management tool. Check out the talk from Robert Crowe who will help you understand everything, from why you need MLOps to managing your process through managing change. You’ll see the component model in TFX, and get an introduction to the new TFX-Addons community that’s focussed on building new ones. Check it all out in this talk!!

I/O wasn’t just about new releases and talks! If you are inspired by any of what you saw, we also have workshops and learning paths you can dig into to learn in more detail.

Full playlist to all AI/ML talks and workshops.

That’s it for this roundup of AI and ML at Google I/O 2022. We hope you’ve enjoyed it, and we’d love to hear your feedback when you explore the content. Please drop by the TensorFlow Forum and let us know what you think!

Read More