Optimizing style transfer to run on mobile with TFLite

Optimizing style transfer to run on mobile with TFLite

Posted by Khanh LeViet and Luiz Gustavo Martins, Developer Advocates

Neural style transfer is an optimization technique used to take two images, a content image (such as a building) and a style image (such as artwork by an iconic painter), and blend them together so the output image looks like the content image “painted” in the style of the reference image. Today, we are excited to share a pre-trained style transfer TensorFlow Lite model that is optimized for mobile, and an Android and an iOS sample app that uses the model to stylize any images.

In this article, we will walk you through the journey of optimizing the large TensorFlow model for mobile deployment, and how to use it efficiently in a mobile app with TensorFlow Lite. We hope that you can use our pre-trained style transfer model or leverage our insights for your use cases.

Background

An example of style transfer

Style transfer was first published in A Neural Algorithm of Artistic Style. The original technique, however, was computationally expensive and it can take several seconds to stylize an image even on high-end GPUs. Subsequent work by several authors (for example) showed how to speed up style transfer.

After evaluating several model architectures, we decided to start with a pre-trained arbitrary style transfer model from Magenta for our sample app. The model can take any content and style image as input, then use a feedforward neural network to generate a stylized output image. This model allows much faster style transfer compared to the technique in Gatys’s paper, but it is still quite large (44 MB) and slow (2340 ms on Pixel 4 CPU). Therefore, we need to optimize the model to make it suitable to use in mobile applications. This article shares our experience doing so, with resources you can take advantage of in your work.

Optimizing the model architecture

The structure of our style transfer model

The Magenta’s arbitrary style transfer model consists of two subnetworks:

  • Style prediction network: converts the style image to a style embedding vector.
  • Style transform network: applies the style embedding vector on the content image to generate a stylized image.

Magenta’s style prediction network has an InceptionV3 backbone, so we replaced it with a MobileNetV2 backbone, which is optimized for mobile. The style transform network consists of several convolution layers. We applied the width multiplier idea from MobileNet, scaling down the number of output channels of all convolution layers by a factor of 4.
Then, we had to decide how to train our model. We experimented with multiple options: training the mobile model from scratch or distilling from the pre-trained Magenta’s model. We found that fixing the weights of MobileNetV2 while optimizing other parameters from scratch gave the best result.
We were able to achieve a similar level of style and content loss, while significantly shrinking and speeding up the model.

* Benchmarked on Pixel 4 CPU using TensorFlow Lite with 2 threads, April 2020.
* See this paper for more details about the definition of loss function used in this style transfer model

Quantization

Once we have settled on the model architecture, we continue to shrink our mobile model further with quantization using the TensorFlow Model Optimization Toolkit. This is an important technique that is applicable for most mobile deployment of TensorFlow models, as it can shrink the model size up to 4X and speed up model inference with insignificant quality trade-off.
Among the quantization options available that TensorFlow provides, we decided to use post-training integer quantization because it has the right balance of simplicity and model quality. We only needed to provide a small portion of our training dataset when converting the TensorFlow model to TensorFlow Lite.
After quantization, our model is more than an order smaller and faster than the original model, while maintaining the same level of style and content loss.

* Benchmarked on Pixel 4 CPU using TensorFlow Lite with 2 threads, April 2020.

Deployment to mobile

We implemented an Android app to demonstrate how to use the style transfer model. The app takes a style image, a content image, and outputs an image that mixes the style and content of the input images.
We use the phone’s camera to capture the content images with the Camera2 API and provide a set of famous paintings to be used as style images. As mentioned above, there are two steps to apply a style to a content image. Firstly, we extract the style as an array floats using the style prediction network. Then we apply this style to the content image using the style transform network.
In order to achieve the best performance on both CPU and GPU, we created two sets of TensorFlow Lite models optimized for each chip. We use the int8 quantized model for CPU inference, and float16 quantized model for GPU inference. GPU generally achieves better performance than CPU but it currently only supports float models, which are larger than int8 quantized models. Here is how the int8 and the float16 model perform.

* Benchmarked on Pixel 4 using TensorFlow Lite, April 2020.

Another possible performance gain is to cache the results of the style prediction network if you only plan to support a fixed set of style images in your mobile app. This will make your app smaller as you do not need to include the style prediction network, which accounts for 91% of the total network size. This is the main reason why the process is splitted into two models instead of only one.
The sample can be found on GitHub and the main class applying style is the StyleTransferModelExecutor.
It is important that we do not run style transfer on the UI thread as it is computational expensive. We instead use the ViewModel class from AndroidX and a Coroutine to run it on a dedicated background thread and easily update the view. Besides, when running a model using GPU delegate, TF Lite interpreter initialization, GPU delegate initialization and inference have to run all on the same thread.

Style transfer in production

The Google Arts & Culture app recently added Art Transfer that uses TensorFlow Lite to run style transfer on-device. The model used is very similar to the one above but prioritizes quality over speed and model size. Try it out if you are interested in seeing style transfer in production.

Your Turn

If you want to add Style Transfer to your own app, you can start by downloading the mobile sample. Both model versions, the float16 (predict network, transform network) and the int8 quantized version (predict network, transform network), are available on TensorFlow Hub. We can’t wait to see what you can create! Don’t forget to share with us your creations.

Resources

Running machine learning models on-device has the benefits of keeping the users data private while enabling features with low latency.
In this post, we have shown that directly converting a TensorFlow model to TensorFlow Lite might be just the first step. To achieve good performance, developers should optimize their model with quantization, and find the right trade-off between model quality, model size, and inference time.
We used the resources below to create our model. They might be also applicable to your on-device machine learning use cases.

  • Magenta model repository
    Magenta is an open source project powered by TensorFlow. It uses machine learning to make music and art. There are many models that can be converted to TensorFlow Lite, including this style transfer model.
  • TensorFlow Model Optimization Toolkit
    Model Optimization Toolkit provides multiple methods to optimize your model, including quantization and pruning.
  • TensorFlow Lite delegates
    TensorFlow Lite can leverage many different types of hardware accelerator available on devices, including GPUs and DSPs, to speed up model inference.

Read More

Introducing the new TensorFlow Profiler

Introducing the new TensorFlow Profiler

Posted by Anirudh Sriram, Technical Writer, and Gal Oshri, Product Manager

Performance is a key consideration of successful ML research and production solutions. Faster model training leads to faster iterations and reduced overhead. It is sometimes an essential requirement to make a particular ML solution feasible.

However, it is not always clear what should be optimized. Is there an issue with a specific operation (op), or the input pipeline?

To help answer this, we have developed an extensive set of tools for TensorFlow performance profiling. Beyond the ability to capture and investigate numerous aspects of a profile, the tools offer guidance on how to resolve performance bottlenecks (e.g. input-bound programs).

These tools are used by low-level experts improving TensorFlow’s infrastructure, as well as engineers in Google’s most popular products to optimize their model performance. We want to enable the broader community to take advantage of the tools used at Google for performance profiling. That is why we recently open sourced the new TensorFlow Profiler.

TensorFlow Profiler overview page

What is the TensorFlow Profiler?

The TensorFlow Profiler (or the Profiler) provides a set of tools that you can use to measure the training performance and resource consumption of your TensorFlow models. This new version of the Profiler is integrated into TensorBoard, and builds upon existing capabilities such as the Trace Viewer.

The Profiler has the following new profiling tools available:

  • Overview Page: Provides a top-level view of model performance and recommendations to optimize performance
  • Input Pipeline Analyzer: Analyzes your model’s data input pipeline for bottlenecks and recommends improvements to improve performance
  • TensorFlow Stats: Displays performance statistics for every TensorFlow operation executed during the profiling session
  • GPU Kernel Stats: Displays performance statistics and the originating operation for every GPU accelerated kernel

Check out the Profiler guide in the TensorFlow documentation to learn more about these tools.

Getting started

The best way to get started with the Profiler is to follow the Colab tutorial here. We will cover a few of the important steps and insights in the blog post. First, we install the Profiler plugin for TensorBoard:

pip install -U tensorboard_plugin_profile

This adds the full Profiler capabilities to our TensorBoard installation. Next, we ensure that our model training captures a profile. In this case, we will use the TensorBoard callback in Keras:

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir = logs,
profile_batch = '500,510')

We can choose which batches to profile with the profile_batch parameter. This enables us to choose the number of steps to capture (recommended to be no more than 10). It also helps us skip the first few batches to avoid inaccuracy due to initialization overhead. There are other methods for capturing a profile, described here. We now start TensorBoard with the following command:

tensorboard --logdir {log directory}    # in terminal
%tensorboard --logdir {log directory} # in Colab

After clicking on Profile, we see the overview page: This immediately gives us an indication of our program’s performance. Besides a useful summary, we see a recommendation telling us that our program is input-bound (meaning our accelerator is wasting time waiting for input). This is a really common problem.
By following the instructions in the tutorial, we can bring our average step time from ~30ms to ~3ms. That’s a 10x improvement! While this is a toy example, it is common to hear from engineers and researchers at Google that they managed to improve their performance by significant factors.

Recommendations

Performance optimization is an iterative process and can sometimes be frustrating as it is tricky to pinpoint the exact location of the bottlenecks in your program. Not only can the Profiler tell you where your program has bottlenecks, it can often also tell you what you can do to resolve them and make your code execute faster. Following the recommendations provided can shorten the overall time taken to optimize your program.
When you open TensorBoard to view the profiling results, the Overview page provides code optimization recommendations below the Step time graph. One of the most common reasons for slow code execution is an improperly configured data input pipeline. Leverage the capabilities of the Input pipeline analyzer to effectively identify and eliminate bottlenecks in your data input pipeline. Read the best practices section of the Profiler guide to learn more about other strategies you can employ to get optimal performance.

More resources

Check out these resources to learn more:

What’s next for the TensorFlow Profiler?

In addition to addressing feedback, we are expanding the profiler’s capabilities. A few areas we are currently working on:

  • Memory Profiler: View memory usage over time and the associated op/training step.
  • Keras Analysis: Enable linking the information in the profiler to Keras. This enables, for example, identifying which Keras layers correspond to the ops shown in the trace viewer.
  • Multiworker GPU Analysis: Enable profiling multiple GPU workers and aggregate the results. Analyze the hotspot and the communication across workers.

We are excited to continue bringing the tools used at Google to improve ML performance to the broader community. If there are specific capabilities that would help you the most, or to report a bug, feel free to open an issue here!

Read More

How TensorFlow Lite helps you from prototype to product

How TensorFlow Lite helps you from prototype to product

Posted by Khanh LeViet, Developer Advocate

TensorFlow Lite is the official framework to run inference with TensorFlow models on edge devices. TensorFlow Lite is deployed on more than 4 billions edge devices worldwide, supporting Android, iOS, Linux-based IoT devices and microcontrollers.

Since first launch in late 2017, we have been improving TensorFlow Lite to make it robust while keeping it easy to use for all developers – from the machine learning experts to the mobile developers who just started learning about machine learning.

In this blog, we will highlight recent launches that made it easier for you to go from prototyping an on-device use case to deploying in production.
If you prefer a video format, check out this talk from TensorFlow DevSummit 2020.

Prototype: jump-start with state-of-the-art models

As machine learning is a very fast-moving field, it is very important to be able to know what is possible with current technologies before investing resources into building a feature. We have a repository of pretrained models and sample applications that implement the models so that you can try out TensorFlow Lite models on real devices without writing any code. Then, you can quickly integrate the models into your application to prototype and test how your user experiences will be like before spending time on training your own model.

We have published several new pretrained models, including a question & answer model and a style transfer model.

We are also committed to bringing more state-of-the-art models from research teams to TensorFlow Lite. Recently we have enabled 3 new model architectures: EfficientNet-Lite (paper), MobileBERT (paper) and ALBERT-Lite (paper).

  • EfficientNet-Lite is a novel image classification model that achieves state-of-the-art accuracy with an order of magnitude of fewer computations and parameters. It is optimized for TensorFlow Lite, supporting quantization with negligible accuracy loss and fully supported by the GPU delegate for faster inference. Find out more in our blog post.
    Benchmark on Pixel 4 CPU, 4 Threads, March 2020
  • MobileBERT is an optimized version of the popular BERT (paper) model that achieved state-of-the-art accuracy on a range of NLP tasks, including question and answer, natural language inference and others. MobileBERT is about 4x faster and smaller than BERT but retains similar accuracy.
  • ALBERT is another light-weight version of the BERT that was optimized for model size while retaining the same accuracy. ALBERT-Lite is the TensorFlow Lite compatible version of ALBERT, which is 6x smaller than BERT, or 1.5x smaller than MobileBERT, while the latency is on par with BERT.
Benchmark on Pixel 4 CPU, 4 Threads, March 2020
Model hyper parameters: Sequence length 128, Vocab size 30K

Develop model: without ML expertise, create models for your dataset

When bringing state-of-the-art research models to TensorFlow Lite, we also want to make it easier for you to customize these models to your own use cases. We are excited to announce TensorFlow Lite Model Maker, an easy-to-use tool to adapt state-of-the-art machine learning models to your dataset with transfer learning. It wraps the complex machine learning concepts with an intuitive API, so that everyone can get started without any machine learning expertise. You can train a state-of-the-art image classification with only 4 lines of code:

data = ImageClassifierDataLoader.from_folder('flower_photos/')
model = image_classifier.create(data)
loss, accuracy = model.evaluate()
model.export('flower_classifier.tflite', 'flower_label.txt', with_metadata=True)

Model Maker supports many state-of-the-art models that are available on TensorFlow Hub, including the EfficientNet-Lite models. If you want to get higher accuracy, you can switch to a different model architecture by changing just one line of code while keeping the rest of your training pipeline.

# EfficinetNet-Lite2.
model = image_classifier.create(data, efficientnet_lite2_spec)

# ResNet 50.
model = image_classifier.create(data, resnet_50_spec)

Model Maker currently supports two use cases: image classification (tutorial) and text classification (tutorial), with more computer vision and NLP use cases coming soon.

Develop model: attach metadata for seamless model exchange

The TensorFlow Lite file format has always had the input/output tensor shape in its metadata. This works well when the model creator is also the app developer. However, as the on-device machine learning ecosystem grows, these tasks are increasingly performed by different teams within an organization or even between organizations. To facilitate these model knowledge exchanges, we have added new fields in the metadata. They fall into two broad categories:

  1. Machine-readable parameters – e.g. normalization parameters such as mean and standard deviation, category label files. These parameters can be read by other systems so wrapper code can be generated. You can see an example of this in the next section.
  2. Human-readable parameters – e.g. model description, model license. This can provide the app developer using the model crucial information on how to use the model correctly – are there strengths or weaknesses they should be aware of? Also, fields like licenses can be critical in deciding whether a model can be used. Having this attached to the model significantly reduces the barrier of adoption.

To supercharge this effort, models created by TensorFlow Lite Model Maker and image related TensorFlow Lite models on TensorFlow Hub already have metadata attached to it. If you are creating your own model, you can attach metadata to make sharing models easier.

# Creates model info.
model_meta = _metadata_fb.ModelMetadataT()
model_meta.name = "MobileNetV1 image classifier"
model_meta.description = ("Identify the most prominent object in the "
"image from a set of 1,001 categories such as "
"trees, animals, food, vehicles, person etc.")
model_meta.version = "v1"
model_meta.author = "TensorFlow"
model_meta.license = ("Apache License. Version 2.0 "
"http://www.apache.org/licenses/LICENSE-2.0.")
# Describe input and output tensors
# ...

# Writing the metadata to your model
b = flatbuffers.Builder(0)
b.Finish(
model_meta.Pack(b),
_metadata.MetadataPopulator.METADATA_FILE_IDENTIFIER)
metadata_buf = b.Output()
populator = _metadata.MetadataPopulator.with_model_file(model_file)
populator.load_metadata_buffer(metadata_buf)
populator.load_associated_files(["your_path_to_label_file"])
populator.populate()

For a complete example of how we populate the metadata for MobileNet v1, please refer to this guide.

Develop app: automatically generate code from model

Instead of copy and pasting error-prone boilerplate code to transform typed objects such as Bitmap to ByteArray to feed to TensorFlow Lite interpreter, a code generator can generate the wrapper code ready for integration using the machine-readable parts of the metadata.
You can use our first code generator build for Android to generate model wrappers. We are also working on integrating this tool into Android Studio.

Develop app: discover performance with the benchmark and profiling tools

Once a model is created, we would like to check how it performs on mobile devices. TensorFlow Lite provides benchmark tools to measure model performance of models. We have added support for running benchmarks with all runtime options, including running models on GPU or other supported hardware accelerators, specifying the number of threads and more. You can also get inference latency breakdown to the granularity of a single operation to identify the most time consuming operations and optimize your model inference.
After integrating a model to your application, you may encounter other performance issues so that you may resort to platform-provided performance profiling tools. For example, on Android, one could investigate performance issues via various tracing tools. We have launched a TensorFlow Lite performance tracing module on Android that helps to poke into TensorFlow Lite internals. It is installed by default in our nightly release. With tracing, one may find whether there is resource contention during inference. Please refer to our documentation to learn more about how to use the module in the context of the Android benchmark tool.
We will continue working on improving TensorFlow Lite performance tooling to make it more intuitive and more helpful to measure and tune TensorFlow Lite performance on various devices.

Deploy: easily scale to multiple platforms

Nowadays, most applications need to support multiple platforms. That’s why we built TensorFlow Lite to work seamlessly across platforms: Android, iOS, Raspberry Pi, and other Linux-based IoT devices. All TensorFlow Lite models will just work out-of-the-box on any officially supported platforms, so that you can focus on creating good models instead of worrying about how to adapt your models to different platforms.
Each platform has its own hardware accelerator that can be used to speed up model inference. TensorFlow Lite has already supported running models on NNAPI for Android, GPU for both iOS and Android. We are excited to add more hardware accelerators:

  • On Android, we have added support for Qualcomm Hexagon DSP which is available on millions of devices. This enables developers to leverage the DSP on older Android devices below Android 8.1 where Android NN API is unavailable.
  • On iOS, we have launched CoreML delegate to allow running TensorFlow Lite models on Apple’s Neural Engine.

Besides, we continued to improve performance on existing supported platforms as you can see from the graph below comparing the performance between May 2019 and February 2020. You only need to upgrade to the latest version of TensorFlow Lite library to benefit from these improvements.

Pixel 4 – Single Threaded CPU, February 2020

Future work

Over the coming months, we will work on supporting more use cases and improving developer experiences:

  • Continuously release up-to-date state-of-the-art on-device models, including better support for BERT-family models for NLP tasks and new vision models.
  • Publish new tutorials and examples demonstrating more use cases, including how to use C/C++ APIs for inference on mobile.
  • Enhance Model Maker to support more tasks including object detection and several NLP tasks. We will add BERT support for NLP tasks, such as question and answer. This will empower developers without machine learning expertise to build state-of-the-art NLP models through transfer learning.
  • Expand the metadata and codegen tools to support more use cases, including object detection and more NLP tasks.
  • Launch more platform integration for even easier end-to-end experience, including better integration with Android Studio and TensorFlow Hub.

Feedback

We are committed to continue improving TensorFlow Lite and looking forward to seeing what you have built with TensorFlow Lite, as well as hearing your feedback. Share your use cases with us directly or on Twitter with hashtags #TFLite and #PoweredByTF. To report bugs and issues, please reach out to us on GitHub.

Acknowledgements

Thanks to Amy Jang, Andrew Selle, Arno Eigenwillig‎, Arun Venkatesan‎, Cédric Deltheil, Chao Mei, Christiaan Prins, Denny Zhou, Denis Brulé, Elizabeth Kemp, Hoi Lam, Jared Duke, Jordan Grimstad, Juho Ha, Jungshik Jang‎, Justin Hong, Hongkun Yu, Karim Nosseir, Khanh LeViet, Lawrence Chan, Lei Yu, Lu Wang‎, Luiz Gustavo Martins‎, Maxime Brénon, Mia Roh, Mike Liang, Mingxing Tan, Renjie Liu‎, Sachin Joglekar, Sarah Sirajuddin, Sebastian Goodman, Shiyu Hu, Shuangfeng Li‎, Sijia Ma, Tei Jeong, Tian Lin, Tim Davis, Vojtech Bardiovsky, Wei Wei, Wouter van Oortmerssen, Xiaodan Song, Xunkai Zhang‎, YoungSeok Yoon‎, Yuqi Li‎‎, Yi Zhou, Zhenzhong Lan, Zhiqing Sun and more.Read More

Introducing TensorFlow Videos for a Global Audience: Spanish

Introducing TensorFlow Videos for a Global Audience: Spanish

Posted by the TensorFlow Team

When the TensorFlow YouTube channel launched in 2018, we had a vision to inform and inspire developers around the world about what was possible with Machine Learning. With series like Coding TensorFlow showing how you can use it, and Made with TensorFlow showing inspirational stories about what people have done with TensorFlow and much more, the channel has grown greatly. But we learned an important lesson: it’s a global phenomenon, and to reach the world effectively, we should provide some of our best content in multiple languages with native speakers presenting. Check out the popular Zero to Hero series in Spanish!

Machine Learning with TensorFlow: Zero to Hero

Parece que uno no puede abrir un navegador, periódico o libro sin ver algo relacionado a Machine Learning o AI. Hay mucha información y mucha publicidad. Con eso en mente, Laurence Moroney, del equipo de TensorFlow, quiso producir una serie de cuatro videos, desde la perspectiva del desarrollador, sobre lo que realmente es el aprendizaje automático. Se basa en su popular conferencia de Google IO 2019, y se titula “Machine Learning: From Zero to Hero with TensorFlow”

Aquí está el primer video donde aprenderás que machine learning representa un nuevo paradigma en la programación, donde en lugar de programar reglas explícitas en un lenguaje como Java o C ++, se construye un sistema que está capacitado en datos para inferir las reglas en sí. Pero, ¿cómo se ve realmente ML? Aquí podrás ver un ejemplo básico de Hello World de cómo construir un modelo de ML, presentando ideas que aplicaremos en episodios posteriores a un reto más interesante: computer vision.

En el segundo video aprenderás acerca de computer vision al enseñarle a una computadora a ver y reconocer diferentes objetos. También puedes practicar y ver el ejemplo tú mismo aquí: https://goo.gle/34cHkDk

En el tercer vídeo discutimos las redes neuronales convolucionales y por qué son tan poderosas en aplicaciones de computer vision. Una convolución es un filtro que pasa sobre una imagen, la procesa e identifica características que muestran una similitud en la imagen. ¡En este video verás cómo funcionan, procesando una imagen para ver si puedes extraer características de ella! ¡También puedes probar aquí un codelab!: http://bit.ly/2lGoC5f

Aquí está el cuarto y último video donde aprenderás cómo construir un clasificador de imágenes para piedra, papel y tijeras. En el episodio uno, hablamos del juego de piedra, papel y tijeras; y discutimos lo difícil que podría ser escribir código para detectarlos y clasificarlos. A medida que los episodios han progresado hacia el machine learning, hemos aprendido cómo construir redes neuronales desde la detección de patrones en píxeles sin procesar, hasta la clasificación de ellos, hasta la detección de características mediante convoluciones. En este episodio, ponemos en práctica toda la información de las primeras tres partes de la serie. Cuaderno Colab: http://bit.ly/2lXXdw5. Conjunto de datos de piedra, papel y tijera: http://bit.ly/2kbV92O

¡Esperamos que disfrutes de esta serie y déjanos saber si deseas ver más!
Read More

How-to deploy TensorFlow 2 Models on Cloud AI Platform

How-to deploy TensorFlow 2 Models on Cloud AI Platform

Posted by Sara Robinson, Developer Advocate

Google Cloud’s AI Platform recently added support for deploying TensorFlow 2 models. This lets you scalably serve predictions to end users without having to manage your own infrastructure. In this post, I’ll walk you through the process of deploying two different types of TF2 models to AI Platform and use them to generate predictions with the AI Platform Prediction API. I’ll include one example for an image classifier and another for structured data. We’ll start by adding code to existing TensorFlow tutorials, and finish with models deployed on AI Platform.
In addition to cloud-based deployment options, TensorFlow also includes open source tools for deploying models, like TensorFlow Serving, which you can run on your own infrastructure. Here, our focus is on using a managed service.

AI Platform supports both autoscaling and manual scaling options. Autoscaling means your model infrastructure will scale to zero when no one is calling your model endpoint so that you aren’t charged when your model isn’t in use. If usage increases, AI Platform will automatically add resources to meet demand. Manual scaling lets you specify the number of nodes you’d like to keep running at all times, which can reduce cold start latency on your model.

The focus here will be on the deployment and prediction processes. AI Platform includes a variety of tools for custom model development, including infrastructure for training and hosted notebooks. When we refer to AI Platform in this post, we’re talking specifically about AI Platform Prediction, a service for deploying and serving custom ML models. In this post, we’ll build on existing tutorials in the TensorFlow docs by adding code to deploy your model to Google Cloud and get predictions.

In order to deploy your models, you’ll need a Google Cloud project with billing activated (you can also use the Google Cloud Platform Free Tier). If you don’t have a project yet, follow the instructions here to create one. Once you’ve created a project, enable the AI Platform API.

Deploying a TF2 image model to AI Platform

To show you how to deploy a TensorFlow 2 model on AI Platform, I’ll be using the model trained in this tutorial from the TF documentation. This trains a model on the Fashion MNIST dataset, which classifies images of articles of clothing into 10 different categories. Start by running through that whole notebook. You can click on the “Run in Google Colab” button at the top of the page to get started. Make sure to save your own copy of the notebook so you don’t lose your progress.

We’ll be using the probability_model created at the end of this notebook, since it outputs classifications in a more human-readable format. The output of probability_model is a 10-element softmax array with the probabilities that the given image belongs to each class. Since it’s a softmax array, all of the elements add up to 1. The highest-confidence classification will be the item of clothing corresponding with the index with the highest value.

In order to connect to your Cloud project, you will next need to authenticate your Colab notebook. Inside the notebook you opened for the Fashion MNIST tutorial, create a code cell:

from google.colab import auth
auth.authenticate_user()

Then run the following, replacing “your-project-id-here” with the ID of the Cloud project you created:

CLOUD_PROJECT = 'your-project-id-here'
BUCKET = 'gs://' + CLOUD_PROJECT + '-tf2-models'

For the next few code snippets, we’ll be using gcloud: the Google Cloud CLI along with gsutil, the CLI for interacting with Google Cloud Storage. Run the line below to configure gcloud with the project you created:

!gcloud config set project $CLOUD_PROJECT

In the next step, we’ll create a Cloud Storage bucket and print our GCS bucket URL. This will be used to store your saved model. You only need to run this cell once:

!gsutil mb $BUCKET
print(BUCKET)

Cloud AI Platform expects our model in TensorFlow 2 SavedModel format. To export our model in this format to the bucket we just created, we can run the following command. The model.save() method accepts a GCS bucket URL. We’ll save our model assets into a fashion-mnist subdirectory:

probability_model.save(BUCKET + '/fashion-mnist', save_format='tf')

To verify that this exported to your storage bucket correctly, navigate to your bucket in the Cloud Console (visit storage -> browser). You should see something like this:
Cloud console
With that we’re ready to deploy the model to AI Platform. In AI Platform, a model resource contains different versions of your model. Model names must be unique within a project. We’ll start by creating a model:

MODEL = 'fashion_mnist'
!gcloud ai-platform models create $MODEL --regions=us-central1

Once this runs, you should see the model in the Models section of the AI Platform Cloud Console:

It has no versions yet, so we’ll create one by pointing AI Platform at the SavedModel assets we uploaded to Google Cloud Storage. Models in AI Platform can have many versions. Versioning can help you ensure that you don’t break users who are dependent on a specific version of your model when you publish a new version. Depending on your use case, you can also serve different model versions to a subset of your users, for example, to run an experiment.

You can create a version either through the Cloud Console UI, gcloud, or the AI Platform API. Let’s deploy our first version with gcloud. First, save some variables that we’ll reference in our deploy command:

VERSION = 'v1'
MODEL_DIR = BUCKET + '/fashion-mnist'

Finally, run this gcloud command to deploy the model:

!gcloud ai-platform versions create $VERSION 
--model $MODEL
--origin $MODEL_DIR
--runtime-version=2.1
--framework='tensorflow'
--python-version=3.7

This command may take a minute to complete. When your model version is ready, you should see the following in the Cloud Console:

Getting predictions on a deployed image classification model

Now comes the fun part, getting predictions on our deployed model! You can do this with gcloud, the AI Platform API, or directly in the UI. Here we’ll use the API. We’ll use this predict method from the AI Platform docs:

import googleapiclient.discovery

def predict_json(project, model, instances, version=None):

service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}'.format(project, model)

if version is not None:
name += '/versions/{}'.format(version)

response = service.projects().predict(
name=name,
body={'instances': instances}
).execute()

if 'error' in response:
raise RuntimeError(response['error'])

return response['predictions']

We’ll start by sending two test images to our model for prediction. To do that, we’ll convert these images from our test set to lists (so it’s valid JSON) and send them to the method we’ve defined above along with our project and model:

test_predictions = predict_json(CLOUD_PROJECT, MODEL, test_images[:2].tolist())

In the response, you should see a JSON object with softmax as the key, and a 10-element softmax probability list as the value. We can get the predicted class of the first test image by running:

np.argmax(test_predictions[0]['softmax'])

Our model predicts class 9 for this image with 98% confidence. If we look at the beginning of the notebook, we’ll see that 9 corresponds with ankle boot. Let’s plot the image to verify our model predicted correctly. Looks good!

plt.figure()
plt.imshow(test_images[0])
plt.colorbar()
plt.grid(False)
plt.show()

Deploying TensorFlow 2 models with structured data

Now that you know how to deploy an image model, we’ll look at another common model type – a model trained on structured data. Using the same approach as the previous section, we’ll use this tutorial from the TensorFlow docs as a starting point and build upon it for deployment and prediction. This is a binary classification model that predicts whether a patient has heart disease.

To start, make a copy of the tutorial in Colab and run through the cells. Note that this model takes Keras feature columns as input and has two different types of features: numerical and categorical. You can see this by printing out the value of feature_columns. This is the input format our model is expecting, which will come in handy after we deploy it. In addition to sending features as tensors, we can also send them to our deployed model as lists. Note that this model has a mix of numerical and categorical features. One of the categorical features (thal) should be passed in as a string; the rest are either integers or floats.

Following the same process as above, let’s export our model and save it to the same Cloud Storage bucket in a hd-prediction subdirectory:

model.save(BUCKET + '/hd-prediction', save_format='tf')

Verify that the model assets were uploaded to your bucket. Since we showed how to deploy models with gcloud in the previous section, here we’ll use the Cloud Console. Start by selecting New Model in the Models section of AI Platform in the Cloud Console:

Then follow these steps (you can see a demo in the following GIF, and you can read about them in the text below).

Head over to the models section of your Cloud console. Then select the New model button and give your model a name, like hd_prediction and select Create.

Once your model resource has been created, select New version. Give it a name (like v1), then select the most recent Python version (3.7 at the time of this writing). Under frameworks select TensorFlow with Framework version 2.1 and ML runtime version 2.1. In Model URL, enter the Cloud Storage URL where you uploaded your TF SavedModel earlier. This should be equivalent to BUCKET + '/hd-prediction' if you followed the steps above. Then select Save, and when your model is finished deploying you’ll see a green checkmark next to the version name in your console.

To format our data for prediction, we’ll send each test instance as JSON objects with keys being the name of our features and the values being a list with each feature value. Here’s the code we’ll use to format the first two examples from our test set for prediction:

# First remove the label column
test = test.pop('target')

caip_instances = []
test_vals = test[:2].values

for i in test_vals:
example_dict = {k: [v] for k,v in zip(test.columns, i)}
caip_instances.append(example_dict)

Here’s what the resulting array of caip_instances looks like:

[{'age': [60],
'ca': [2],
'chol': [293],
'cp': [4],
'exang': [0],
'fbs': [0],
'oldpeak': [1.2],
'restecg': [2],
'sex': [1],
'slope': [2],
'thal': ['reversible'],
'thalach': [170],
'trestbps': [140]},
...]

We can now call the same predict_json method we defined above, passing it our new model and test instances:

test_predictions = predict_json(CLOUD_PROJECT, 'hd_prediction', caip_instances)

Your response will look something like the following (exact numbers will vary):

[{'output_1': [-1.4717596769332886]}, {'output_1': [-0.2714746594429016]}]

Note that if you’d like to change the name of the output tensor (currently output_1), you can add a name parameter when you define your Keras model in the tutorial above:

layers.Dense(1, name='prediction_probability')

In addition to making predictions with the API, you can also make prediction requests with gcloud. All of the prediction requests we’ve made so far have used online prediction, but AI Platform also supports batch prediction for large offline jobs. To create a batch prediction job, you can make a JSON file of your test instances and kick off the job with gcloud. You can read more about batch prediction here.

What’s next?

You’ve now learned how to deploy two types of TensorFlow 2 models to Cloud AI Platform for scalable prediction. The models we’ve deployed here all use autoscaling, which means they’ll scale down to 0 so you’re only paying when your model is in use. Note that AI Platform also supports manual scaling, which lets you specify the number of nodes you’d like to leave running.

If you’d like to learn more about what we did here, check out the following resources:

I’d love to hear your thoughts on this post. If you’ve got any feedback or topics you’d like to see covered in the future, find me on Twitter at @SRobTweets.
Read More

How Airbus Detects Anomalies in ISS Telemetry Data Using TFX

How Airbus Detects Anomalies in ISS Telemetry Data Using TFX

A guest post by Philipp Grashorn, Jonas Hansen and Marcel Rummens from Airbus

The International Space Station and it’s different modules. Airbus designed and built the Columbus module in 2008.

Airbus provides several services for the operation of the Columbus module and its payloads on the International Space Station (ISS). Columbus was launched in 2008 and is one of the main laboratories onboard the ISS. To ensure the health of the crew as well as hundreds of systems onboard the Columbus module, engineers have to keep track of many telemetry datastreams, which are constantly beamed to earth.

The operations team at the Columbus Control Center, in collaboration with Airbus, keeps track of thousands of parameters, monitored in 24/7 shifts. If an operator detects an anomaly, he or she creates an anomaly report which is resolved by Airbus system experts. The team at Airbus created the ISS Analytics project to automate part of the workflow of detecting anomalies.

Previous, manual workflow

Detecting Anomalies

The Columbus module consists of several subsystems, each of which is composed of multiple components, resulting in about 17,000 unique telemetry parameters. As each subsystem is highly specialized, it made sense to train a separate model for each subsystem.

Lambda Architecture

In order to detect anomalies within the real time telemetry data stream, the models are trained on about 10 years worth of historical data, which is constantly streamed to earth and stored in a specialized database. On average, the data is streamed in a frequency of one hertz. Simply looking at the data of the last 10 years results in over 5 trillion data points, (10y * 365d * 24h * 60min * 60s * 17K params).

A problem of this magnitude requires big data technologies and a level of computational power which is typically only found in the cloud. As of now a public cloud was adopted, however as more sensitive systems are integrated in the future, the project has to be migrated to the Airbus Private Cloud for security purposes.

To tackle this anomaly detection problem, a lambda architecture was designed which is composed of two parts: the speed and the batch layer.

High Level architecture of ISS Analytics

The batch layer consists only of the learning pipeline, fed with historical time series data which is queried from an on-premise database. Using an on-premise Spark cluster, the data is sanitized and prepared for the upload to GCP. TFX on Kubeflow is used to train an LSTM Autoencoder (details in the next section) and deploy it using TF-Serving.

The speed layer is responsible for monitoring the real-time telemetry stream, which is received using multiple ground stations on earth. The monitoring process uses the deployed TensorFlow model to detect anomalies and compare them against a database of previously detected anomalies, simplifying the root cause analysis and decreasing the time to resolution. In case the neural network detects an anomaly, a reporting service is triggered which consolidates all important information related to the potential anomaly. A notification service then creates an abstract and informs the responsible experts.

Training an Autoencoder to Detect Anomalies

As mentioned above, each model is trained on a subset of telemetry parameters. The objective of the model is to represent the nominal state of the subsystem. If the model is able to reconstruct observations of nominal states with a high accuracy, it will have difficulties reconstructing observations of states which deviate from the nominal state. Thus, the reconstruction error of the model is used as an indicator for anomalies during inference, as well as part of the cost function in training. Details of this practice can be found here and here.

The anomaly detection approach outlined above was implemented using a special type of artificial neural network called an Autoencoder. An Autoencoder can be divided into two parts: the encoder and the decoder. The encoder is a mapping from the input space into a lower dimensional latent space. The decoder is a mapping from the latent space into the reconstruction space with a dimensionality equal to the input space.

While the encoder generates a compressed representation of the input, the decoder generates a representation as close as possible to the original input, using the latent vector from the encoder. Dimensionality reduction acts as a funnel which enables the autoencoder to ignore signal noise.

The difference between the input and the reconstruction is called reconstruction error and is calculated as the root-mean-square error. The reconstruction error, as mentioned above, is minimized in the training step and acts as an indicator for anomalies during inference (e.g., an anomaly would have high reconstruction error).

Example Architecture of an Autoencoder

LSTM for sequences

The Autoencoder uses LSTMs to process sequences and capture temporal information. Each observation is represented as a tensor with shape [number_of_features,number_of_timesteps_per_sequence]. The data is prepared using TFT’s scale_to_0_1 and vocabulary functions. Each LSTM layer of the encoder is followed by an instance of tf.keras.layers.Dropout to increase the robustness against noise.

Model Architecture of ISS Analytics (Red circles represent dropout)

Using TFX

The developed solution contains many but not all of the TensorFlow Extended (TFX) components. However it is planned to research and integrate additional components included with the TFX suite in the future.

The library that is most used in this solution is tf.Transform, which processes the raw telemetry data and converts it into a format compatible with the Autoencoder model. The preprocessing steps are defined in the preprocessing_fn() function and executed on Apache Beam. The resulting transformation graph is stored hermetically within the graph of the trained model. This ensures that the raw data is always processed using the same function, independent of the environment it is deployed in. This way the data fed into the model is consistent.

The sequence-based approach which was outlined in an earlier section posed some challenges. The input_fn() of model training reads the data, preprocessed in the preceding tf.Transform step and applies a windowing function to create sequences. This step is necessary because the data is stored as time steps without any sequence information. Afterwards, it creates batches of size sequence_length * batch_size and converts the whole dataset into a sparse tensor for the input layer of the Autoencoder (tf.contrib.feature_column.sequence_input_layer()expects sparse tensors).

The serving_input_fn() on the other hand receives already sequenced data from upstream systems (data-stream from the ISS). But this data is not yet preprocessed and therefore the tf.Transform step has to be applied. This step is preceded and followed by reshaping calls, in order to temporarily remove the sequence-dimension of the tensor for the preprocessing_fn().

Orchestration for all parts of the machine learning pipeline (transform, train, evaluate) was done with Kubeflow Pipelines. This toolkit simplifies and accelerates the process of training models, experimenting with different architectures and optimizing hyperparameters. By leveraging the benefits of Kubernetes on GCP, it is very convenient to run multiple experiments in parallel. In combination with the Kubeflow UI, one can analyze the hyperparameters and results of these runs in a well-structured form. For a more detailed analysis of specific models and runs, TensorBoard was used to examine learning curves and neural network topologies.

The last step in this TFX use case is to connect the batch and the speed layer by deploying the trained model with TensorFlow Serving. This turned out to be the most important component of TFX, actually bringing the whole machine learning system into production. Its support for features like basic monitoring, a standardized API, effortless rollover and A/B testing, have been crucial for this project.

With the modular design of TFX pipelines, it was possible to train separate models for many subsystems of the Columbus module, without any major modifications. Serving these models as independent services on Kubernetes allows scaling the solution, in order to apply anomaly detection to multiple subsystems in parallel.

Utilizing TFX on Kubeflow brought many benefits to the project. Its flexible nature allows a seamless transition between different environments and will help the upcoming migration to the Airbus Private Cloud. In addition, the work done by this project can be repurposed to other products without any major rework, utilizing the development of generic and reusable TFX components.

Combining all these features the system is now capable of analysing large amounts of telemetry parameters, detecting anomalies and triggering the required steps for a faster and smarter resolution.

The partially automated workflow after the ISS Analytics project

To learn more about Airbus checkout out the Airbus website or dive deeper into the Airbus Space Infrastructure. To learn more about TFX check out the TFX website, join the TFX discussion group, dive into other posts in the TFX blog, or watch the TFX playlist on YouTube.

Read More

Quantization Aware Training with TensorFlow Model Optimization Toolkit - Performance with Accuracy

Quantization Aware Training with TensorFlow Model Optimization Toolkit – Performance with Accuracy

Posted by the TensorFlow Model Optimization team

We are excited to release the Quantization Aware Training (QAT) API as part of the TensorFlow Model Optimization Toolkit. QAT enables you to train and deploy models with the performance and size benefits of quantization, while retaining close to their original accuracy. This work is part of our roadmap to support the development of smaller and faster ML models. For more background, you can see previous posts on post-training quantization, float16 quantization and sparsity.

Quantization is lossy

Quantization is the process of transforming an ML model into an equivalent representation that uses parameters and computations at a lower precision. This improves the model’s execution performance and efficiency. For example, TensorFlow Lite 8-bit integer quantization results in models that are up to 4x smaller in size, 1.5x-4x faster in computations, and lower power consumption on CPUs. Additionally, it allows model execution on specialized neural accelerators, such as Edge TPU in Coral, which often has a restricted set of data types.

However, the process of going from higher to lower precision is lossy in nature. As seen in the image below, quantization squeezes a small range of floating-point values into a fixed number of information buckets.

Small range of float32 values mapped to int8 is a lossy conversion since int8 only has 255 information channels

This leads to information loss. The parameters (or weights) of a model can now only take a small set of values and the minute differences between them are lost. For example, all values in range [2.0, 2.3] may now be represented in one single bucket. This is similar to rounding errors when fractional values are represented as integers.

There are also other sources of loss. When these lossy numbers are used in several multiply-add computations, these losses accumulate. Further, int8 values, which accumulate into int32 integers, need to be rescaled back to int8 values for the next computation, thus introducing more computational error.

Quantization Aware Training

The core idea is that QAT simulates low-precision inference-time computation in the forward pass of the training process. This work is credited to the original innovations by Skirmantas Kligys in the Google Mobile Vision team. This introduces the quantization error as noise during the training and as part of the overall loss, which the optimization algorithm tries to minimize. Hence, the model learns parameters that are more robust to quantization.

If training is not an option, please check out post-training quantization, which works as part of TensorFlow Lite model conversion. QAT is also useful for researchers and hardware designers who may want to experiment with various quantization strategies (beyond what is supported by TensorFlow Lite) and / or simulate how quantization affects accuracy for different hardware backends.

QAT-trained models have comparable accuracy to floating-point

QAT accuracy numbers tableIn the table above, QAT accuracy numbers were trained with the default TensorFlow Lite configuration and contrasted with the floating-point baseline and post-training quantized models.

Emulating low-precision computation

The training graph itself operates in floating-point (e.g. float32), but it has to emulate low-precision computation, which is fixed-point (e.g. int8 in the case of TensorFlow Lite). To do so, we insert special operations into the graph (tensorflow::ops::FakeQuantWithMinMaxVars) that convert the floating-point tensors into low-precision values and then convert the low-precision values back into floating-point. This ensures that losses from quantization are introduced in the computation and that further computations emulate low-precision. In order to do so, we ensure that the losses from quantization are introduced in the tensor and, since each value in the floating-point tensor now maps 1:1 to a low-precision value, any further computation with similarly mapped tensors won’t introduce any further loss and mimics low-precision computations exactly.

Placing the quantization emulation operations

The quantization emulation operations need to be placed in the training graph such that they are consistent with the way that the quantized graph will be computed. This means that, for our API to be able to execute in TensorFlow Lite, we needed to follow the TensorFlow Lite quantization spec precisely.

The ‘wt quant’ and ‘act quant’ ops introduce losses in the forward pass of the model to simulate actual quantization loss during inference. Note how there is no Quant operation between Conv and ReLU6. This is because ReLUs get fused in TensorFlow Lite.

The API, built upon the Keras layers and model abstractions, hides the complexities mentioned above, so that you can quantize your entire model with a few lines of code.

Logging computation statistics

Aside from emulating the reduced precision computation, the API is also responsible for recording the necessary statistics to quantize the trained model. As an example, this allows you to take a model trained with the API and convert it to a quantized integer-only TensorFlow Lite model.

How to use the API with only few lines of code

The QAT API provides a simple and highly flexible way to quantize your TensorFlow Keras model. It makes it really easy to train with “quantization awareness” for an entire model or only parts of it, then export it for deployment withTensorFlow Lite.

Quantize the entire Keras model

import tensorflow_model_optimization as tfmot

model = tf.keras.Sequential([
...
])
# Quantize the entire model.
quantized_model = tfmot.quantization.keras.quantize_model(model)

# Continue with training as usual.
quantized_model.compile(...)
quantized_model.fit(...)

Quantize part(s) of a Keras model

import tensorflow_model_optimization as tfmot
quantize_annotate_layer = tfmot.quantization.keras.quantize_annotate_layer

model = tf.keras.Sequential([
...
# Only annotated layers will be quantized.
quantize_annotate_layer(Conv2D()),
quantize_annotate_layer(ReLU()),
Dense(),
...
])

# Quantize the model.
quantized_model = tfmot.quantization.keras.quantize_apply(model)

By default, our API is configured to work with the quantized execution support available in TensorFlow Lite. A detailed Colab with an end-to-end training example is located here.

The API is quite flexible and capable of handling far more complicated use cases. For example, it allows you to control quantization precisely within a layer, create custom quantization algorithms, and handle any custom layers that you may have written.

To learn more about how to use the API, please try this Colab. These sections of the Colab provide examples of how users can experiment with different quantization algorithms using the API. You can also check out this recent talk from the TensorFlow Developer Summit.

We are very excited to see how the QAT API further enables TensorFlow users to push the boundaries of efficient execution in their TensorFlow Lite-powered products as well as how it opens the door to researching new quantization algorithms and further developing new hardware platforms with different levels of precision.

If you want to learn more, check out this video from the TensorFlow DevSummit which introduces the Model Optimization Toolkit and explains QAT.

Acknowledgements

Thanks to Pulkit Bhuwalka, Alan Chiao, Suharsh Sivakumar, Raziel Alvarez, Feng Liu, Lawrence Chan, Skirmantas Kligys, Yunlu Li, Khanh LeViet, Billy Lambert, Mark Daoust, Tim Davis, Sarah Sirajuddin, and François CholletRead More

Upcoming changes to TensorFlow.js

Upcoming changes to TensorFlow.js

Posted by Yannick Assogba, Software Engineer, Google Research

As TensorFlow.js is used more and more in production environments, our team recognizes the need for the community to be able to produce small, production optimized bundles for browsers that use TensorFlow.js. We have been laying out the groundwork for this and want to share our upcoming plans with you.

TensorFlow.js updatesOne primary goal we have for upcoming releases of TensorFlow.js is to make it more modular and more tree shakeable, while preserving ease of use for beginners. To that end, we are planning two major version releases to move us in that direction. We are releasing this work over two major versions in order to maintain semver as we make breaking changes.

TensorFlow.js 2.0

In TensorFlow.js 2.x, the only breaking change will be moving the CPU and WebGL backends from tfjs-core into their own NPM packages (tfjs-backend-cpu and tfjs-backend-webgl respectively). While today these are included by default, we want to make tfjs-core as modular and lean as possible.

What does this mean for me as a user?

If you are using the union package (i.e. @tensorflow/tfjs), you should see no impact to your code. If you are using @tensorflow/tfjs-core directly, you will need to import a package for each backend you want to use.

What benefit do I get?

If you are using @tensorflow/tfjs-core directly, you will now have the option of omitting any backend you do not want to use in your application. For example, if you only want the WebGL backend, you will be able to get modest savings by dropping the CPU backend. You will also be able to lazily load the CPU backend as a fallback if your build tooling/app supports that.

TensorFlow.js 3.0

In this release, we will have fully modularized all our ops and kernels (backend specific implementations of the math behind an operation). This will allow tree shakers in bundlers like WebPack, Rollup, and Closure Compiler to do better dead-code elimination and produce smaller bundles.

We will move to a dynamic gradient and kernel registration scheme as well as provide tooling to aid in creating custom bundles that only contain kernels for a given model or TensorFlow.js program.

We will also start shipping ES2017 bundles by default. Users who need to deploy to browsers that only support earlier versions can transpile down to their desired target.

What does this mean for me as a user?

If you are using the union package (i.e. @tensorflow/tfjs), we anticipate the changes will be minimal. In order to support ease of use in getting started with tfjs, we want the default use of the union package to remain close to what it is today.

For users who want smaller production oriented bundles, you will need to change your code to take advantage of ES2015 modules to import only the ops (and other functionality) you would like to end up in your bundle.

In addition, we will provide command-line tooling to enable builds that only load and register the kernels used by the models/programs you are deploying.

What benefit do I get?

Production oriented users will be able to opt into writing code that results in smaller more optimized builds. Other users will still be able to use the union package pretty much as is, but will not get the advantage of the smallest builds possible.

Dynamic gradient and kernel registration will make it easier to implement custom kernels and gradients for researchers and other advanced users.

FAQ

When will this be ready?

We plan to release TensorFlow.js 2.0 this month. We do not have a release date for Tensorflow 3.0 yet because of the magnitude of the change. Since we need to touch almost every file in tfjs-core, we are also taking the opportunity to clean up technical debt where we can.

Should I upgrade to TensorFlow.js 2.x or just wait for 3.x?

We recommend that you upgrade to TensorFlow 2.x if you are actively developing a TensorFlow.js project. It should be a relatively painless upgrade, and any future bug fixes will be on this release train. We do not yet have a release date for TensorFlow.js 3.x.

How do I migrate my app to 2.x or 3.x? Will there be a tutorial to follow?

As we release these versions, we will publish full release notes with instructions on how to upgrade. Separately, with the launch of 3.x, we will publish a guide on making production builds.

How much will I have to change my code to get smaller builds?

We’ll have more details as we get closer to the release of 3.x, but at a high level, we want to take advantage of the ES2015 module system to let you control what code gets into your bundle.

In general, you will need to do things like import {max, div, mul, depthToSpace} from @tensorflow/tjfs (rather than import * as tf from @tensorflow/tfjs) in order for our tooling to determine which kernels to register from the backends you have selected for deployment. We are even working on making the chaining API on the Tensor class opt-in when targeting production builds.

Will this make TensorFlow.js harder to use?

We do not want to make the barrier to entry higher for using TensorFlow.js so we are designing this in a way that only production oriented users need to do extra work to get optimized builds. For end-users developing applications using the union package (@tensorflow/tfjs) from either a hosted script or from NPM in concert with our collection of pre-trained models we expect there will be no changes as a result of these updatesRead More

TensorFlow Lite Core ML delegate enables faster inference on iPhones and iPads

TensorFlow Lite Core ML delegate enables faster inference on iPhones and iPads

Posted by Tei Jeong and Karim Nosseir, Software Engineers

TensorFlow Lite offers options to delegate part of the model inference, or the entire model inference, to accelerators, such as the GPU, DSP, and/or NPU for efficient mobile inference. On Android, you can choose from several delegates: NNAPI, GPU, and the recently added Hexagon delegate. Previously, with Apple’s mobile devices — iPhones and iPads — the only option was the GPU delegate.

When Apple released its machine learning framework Core ML and Neural Engine (a neural processing unit (NPU) in Apple’s Bionic SoC) this allowed TensorFlow Lite to leverage Apple’s hardware.

Neural processing units (NPUs), similar to Google’s Edge TPU and Apple’s Neural Engine, are specialized hardware accelerators designed to accelerate machine learning applications. These chips are designed to speed up model inference on mobile or edge devices and use less power than running inference on the CPU or GPU.

Today, we are excited to announce a new TensorFlow Lite delegate that uses Apple’s Core ML API to run floating-point models faster on iPhones and iPads with the Neural Engine. We are able to see performance gains up to 14x (see details below) for models like MobileNet and Inception V3.

TFLite Core
Figure 1: High-level overview of how a delegate works at runtime. Supported portions of the graph run on the accelerator, while other operations run on the CPU via TensorFlow Lite kernels.

Which devices are supported?

This delegate runs on iOS devices with iOS version 12 or later (including iPadOS). However, to get true performance benefits, it should run on devices with Apple A12 SoC or later (for example, iPhone XS). For older iPhones, you should use the TensorFlow lite GPU delegate to get faster performance.

Which models are supported?

With this initial launch, 32-bit floating point models are supported. A few examples of supported models include, but not limited to, image classification, object detection, object segmentation, and pose estimation models. The delegate supports many compute-heavy ops such as convolutions, though there are certain constraints for some ops. These constraints are checked before delegation in runtime so that unsupported ops automatically fall back to the CPU. The complete list of ops and corresponding restrictions (if any) are in the delegate’s documentation.

Impacts on performance

We tested the delegate with two common float models, MobileNet V2 and Inception V3. Benchmarks were conducted on the iPhone 8+ (A11 SoC), iPhone XS (A12 SoC) and iPhone 11 Pro (A13 SoC), and tested for three delegate options: CPU only (no delegate), GPU, and Core ML delegate. As mentioned before, you can see the accelerated performance on models with A12 SoC and later, but on iPhone 8+ — where Neural Engine is not available for third parties — there is no observed performance gain when using the Core ML delegate with small models. For larger models, performance is similar to GPU delegate.

In addition to model inference latency, we also measured startup latency. Note that accelerated speed comes at a tradeoff with delayed startup. For the Core ML delegate, startup latency increases along with the model size. For example, on smaller models like MobileNet, we observed a startup latency of 200-400ms. On the other hand, for larger models, like Inception V3, the startup latency could be 2-4 seconds. We are working on reducing the startup latency. The delegate also has an impact on the binary size. Using the Core ML delegate may increase the binary size by up to 1MB.

Models

  • MobileNet V2 (1.0, 224, float) [download] : Image Classification
    • Small model. Entire graph runs on Core ML.
  • Inception V3 [download] : Image Classification
    • Large model. Entire graph runs on Core ML.

Devices

  • iPhone 8+ (Apple A11, iOS 13.2.3)
  • iPhone XS (Apple A12, iOS 13.2.3)
  • iPhone 11 Pro (Apple A13, iOS 13.2.3)

Latencies and speed-ups observed for MobileNet V2

Figure 2: Latencies and speed-ups observed for MobileNet V2. All versions use a floating-point model. CPU Baseline denotes two-threaded TensorFlow Lite kernels.
* GPU: Core ML uses CPU and GPU for inference. NPU: Core ML uses CPU and GPU, and NPU(Neural Engine) for inference.

Latencies and speed-ups observed for Inception V3

Figure 3: Latencies and speed-ups observed for Inception V3. All versions use a floating-point model. CPU Baseline denotes two-threaded TensorFlow Lite kernels.
* GPU: Core ML uses CPU and GPU for inference. NPU: Core ML uses CPU and GPU, and NPU(Neural Engine) for inference.

How do I use it?

You simply have to make a call on the TensorFlow Lite Interpreter with an instance of the new delegate. For a detailed explanation, please read the full documentation. You can use either the Swift API (example below) or the C++ API (shown in the documentation) to invoke the TensorFlow Lite delegate during inference.

Swift example

This is how invoking this delegate would look from a typical Swift application. All you have to do is create a new Core ML delegate and pass it to the original interpreter initialization code.

let coreMLDelegate = CoreMLDelegate()
let interpreter = try Interpreter(modelPath: modelPath,
delegates: [coreMLDelegate])

Future work

Over the coming months, we will improve upon the existing delegate with more op coverage and additional optimizations. Support for models trained with post-training float16 quantization is on the roadmap. This will allow acceleration of models with about half the model size and small accuracy loss.
Support for post-training weight quantization (also called dynamic quantization) is also on the roadmap.

Feedback

This was a common feature request that we got from our developers. We are excited to release this and are looking forward to hearing your thoughts. Share your use case directly or on Twitter with hashtags #TFLite and #PoweredByTF. For bugs or issues, please reach out to us on GitHub.

Acknowledgements

Thank you to Tei Jeong, Karim Nosseir, Sachin Joglekar, Jared Duke, Wei Wei, Khanh LeViet.
Note: Core ML, Neural Engine and Bionic SoCs (A12, A13) are products of Apple Inc.Read More