How Rasa Open Source Gained Layers of Flexibility with TensorFlow 2.x

How Rasa Open Source Gained Layers of Flexibility with TensorFlow 2.x

A guest post by Vincent D. Warmerdam and Vladimir Vlasov, Rasa

Rasa logo with bird

At Rasa, we are building infrastructure for conversational AI, used by developers to build chat- and voice-based assistants. Rasa Open Source, our cornerstone product offering, provides a framework for NLU (Natural Language Understanding) and dialogue management. On the NLU side we offer models that handle intent classification and entity detection using models built with Tensorflow 2.x.

In this article, we would like to discuss the benefits of migrating to the latest version of TensorFlow and also give insight into how some of the Rasa internals work.

A Typical Rasa Project Setup

When you’re building a virtual assistant with Rasa Open Source, you’ll usually begin by defining stories, which represent conversations users might have with your agent. These stories will serve as training data and you can configure them as yaml files. If we pretend that we’re making an assistant that allows you to buy pizzas online then we might have stories in our configuration that look like this:

yaml
version: "2.0"

stories:

- story: happy path
steps:
- intent: greet
- action: utter_greet
- intent: mood_great
- action: utter_happy

- story: purchase path
steps:
- intent: greet
- action: utter_greet
- intent: purchase
entities:
product: “pizza”
- action: confirm_purchase
- intent: affirm
- action: confirm_availability

These stories consist of intents and actions. Actions can be simple text replies, or they can trigger custom Python code (that checks a database, for instance). To define training data for each intent, you supply the assistant with example user messages, which might look something like:

yaml
version: "2.0"

nlu:
- intent: greet
examples: |
- hey
- hello
- hi
- hello there
- good morning

- intent: purchase
examples: |
- i’d like to buy a [veggie pizza](product) for [tomorrow](date_ref)
- i want to order a [pizza pepperoni](product)
- i’d want to buy a [pizza](product) and a [cola](product)
- ...

When you train an assistant using Rasa you’ll supply configuration files like those shown above. You can be very expressive in the types of conversations your agent can handle. Intents and actions are like lego bricks and can be combined expressively to cover many conversational paths. Once these files are defined they are combined to create a training dataset that the agent will learn from.

Rasa allows users to build custom machine learning pipelines to fit their datasets. That means you can incorporate your own (pre-trained) models for natural language understanding if you’d like. But Rasa also provides models, written in TensorFlow, that are specialized for these tasks.

Specific Model Requirements

You may have noticed that our examples include not just intents but also entities. When a user is interested in making a purchase, they (usually) also say what they’re interested in buying. This information needs to be detected when the user provides it. It’d be a bad experience if we needed to supply the user with a form to retrieve this information.

intent greet vs intent purchase

If you take a step back and think about what kind of model could work well here, you’ll soon recognize that it’s not a standard task. It’s not just that we have numerous labels at each utterance; we have multiple *types* of labels too. That means that we need models that have two outputs.

Types of labels texts to tokens

Rasa Open Source offers a model that can detect both intents and entities, called DIET. It uses a transformer architecture that allows the system to learn from the interaction between intents and entities. Because it needs to handle these two tasks at once, the typical machine learning pattern won’t work:

model.fit(X, y).predict(X)

You need a different abstraction.

Abstraction

This is where TensorFlow 2.x has made an improvement to the Rasa codebase. It is now much easier to customize TensorFlow classes. In particular, we’ve made a custom abstraction on top of Keras to suit our needs. One example of this is Rasa’s own internal `RasaModel.` We’ve added the base class’s signature below. The full implementation can be found here.

class RasaModel(tf.keras.models.Model):

def __init__(
self,
random_seed: Optional[int] = None,
tensorboard_log_dir: Optional[Text] = None,
tensorboard_log_level:Optional[Text] = "epoch",
**kwargs,
) -> None:
...

def fit(
self,
model_data: RasaModelData,
epochs: int,
batch_size: Union[List[int], int],
evaluate_on_num_examples: int,
evaluate_every_num_epochs: int,
batch_strategy: Text,
silent: bool = False,
eager: bool = False,
) -> None:
...

This object is customized to allow us to pass in our own `RasaModelData` object. The benefit is that we can keep all the existing features that the Keras model object offers while we can override a few specific methods to suit our needs. We can run the model with our preferred data format while maintaining manual control over “eager mode,” which helps us debug.

These Keras objects are now a central API in TensorFlow 2.x, which made it very easy for us to integrate and customize.

Training Loop

To give another impression of how the code became simpler, let’s look at the training loop inside the Rasa model.

Python Pseudo-Code for TensorFlow 1.8

We’ve got a part of the code used for our old training loop listed below (see here for the full implementation). Note that it is using `session.run` to calculate the loss as well as the accuracy.

def train_tf_dataset(
train_init_op: "tf.Operation",
eval_init_op: "tf.Operation",
batch_size_in: "tf.Tensor",
loss: "tf.Tensor",
acc: "tf.Tensor",
train_op: "tf.Tensor",
session: "tf.Session",
epochs: int,
batch_size: Union[List[int], int],
evaluate_on_num_examples: int,
evaluate_every_num_epochs: int,
)
session.run(tf.global_variables_initializer())
pbar = tqdm(range(epochs),desc="Epochs", disable=is_logging_disabled())

for ep in pbar:
ep_batch_size=linearly_increasing_batch_size(ep, batch_size, epochs)
session.run(train_init_op, feed_dict={batch_size_in: ep_batch_size})

ep_train_loss = 0
ep_train_acc = 0
batches_per_epoch = 0
while True:
try:
_, batch_train_loss, batch_train_acc = session.run(
[train_op, loss, acc])
batches_per_epoch += 1
ep_train_loss += batch_train_loss
ep_train_acc += batch_train_acc

except tf.errors.OutOfRangeError:
break

The train_tf_dataset function requires a lot of tensors as input. In TensorFlow 1.8, you need to keep track of these tensors because they contain all the operations you intend to run. In practice, this can lead to cumbersome code because it is hard to separate concerns.

Python Pseudo-Code for TensorFlow 2.x

In TensorFlow 2, all of this has been made much easier because of the Keras abstraction. We can inherit from a Keras class that allows us to compartmentalize the code much better. Here is the `train` method from Rasa’s DIET classifier (see here for the full implementation).

def train(
self,
training_data: TrainingData,
config: Optional[RasaNLUModelConfig] = None,
**kwargs: Any,
) -> None:
"""Train the embedding intent classifier on a data set."""

model_data = self.preprocess_train_data(training_data)

self.model = self.model_class()(
config=self.component_config,
)

self.model.fit(
model_data,
self.component_config[EPOCHS],
self.component_config[BATCH_SIZES],
self.component_config[EVAL_NUM_EXAMPLES],
self.component_config[EVAL_NUM_EPOCHS],
self.component_config[BATCH_STRATEGY],
)

The object-oriented style of programming from Keras allows us to customize more. We’re able to implement our own `self.model.fit` in such a way that we don’t need to worry about the `session` anymore. We don’t even need to keep track of the tensors because the Keras API abstracts everything away for you.

If you’re interested in the full code, you can find the old loop here and the new loop here.

An Extra Layer of Features

It’s not just the Keras models where we apply this abstraction; we’ve also developed some neural network layers using a similar technique.

We’ve implemented a few custom layers ourselves. For example, we’ve got a layer called `DenseWithSparseWeights.` It behaves just like a dense layer, but we drop many weights beforehand to make it more sparse. Again we only need to inherit from the right class (tf.keras.layers.Dense) to create it.

normal dense vs sparse dense model

We’ve grown so fond of customizing that we’ve even implemented a loss function as a layer. This made a lot of sense for us, considering that losses can get complex in NLP. Many NLP tasks will require you to sample such that you also have labels of negative examples during training. You may also need to mask tokens during the process. We’re also interested in recording the similarity loss as well as the label accuracy. By just making our own layer, we are building components for re-use, and it is easy to maintain as well.

custom layer

Lessons Learned

Discovering this opportunity for customization made a massive difference for Rasa. We like to design our algorithms to be flexible and applicable in many circumstances, and we were happy to learn that the underlying technology stack allowed us to do so. We do have some advice for folks who are working on their TensorFlow migration:

  1. Start by thinking about what “lego bricks” you need in your application. This mental design step will make it much easier to recognize how you can leverage existing Keras/TensorFlow objects for your use-case.
  2. It can be tempting to try to immerse yourself by going for a deep dive immediately. Instead, it may help to start from a working example and drill down from there. TensorFlow is not an average Python package, and the internals can get complex. The Python code that you interact with needs to interact with C++ to keep the tensor operations performant. Once the code works, you’re at a much better place to start tuning/optimizing all the new TensorFlow version’s performance features.

Read More

How The Trevor Project assesses LGBTQ youth suicide risk with TensorFlow

How The Trevor Project assesses LGBTQ youth suicide risk with TensorFlow

Posted by Wilson Lee (Machine Learning Engineering Manager at The Trevor Project), Dan Fichter (Head of AI & Engineering at The Trevor Project), Amber Zhang, and Nick Hamatake (Software Engineers at Google)

Introduction

The Trevor Project’s mission is to end suicide among LGBTQ youth. In addition to offering free crisis services through our original phone lifeline (started in 1998), we’ve since expanded to a digital platform, including SMS and web browser-based chat. Unfortunately, there are high-volume times when there are more youth reaching out on the digital platform than there are counselors, and youth have to wait for a counselor to become available. Ideally, youth would be connected with counselors based on their relative risk of attempting suicide, so that those who are at imminent risk of harm would be connected earlier.

As part of the Google AI Impact Challenge, Google.org provided us with a $1.5M grant, Cloud credits, and a Google.org Fellowship, a team of ML, product, and UX specialists who worked full-time pro bono with The Trevor Project for 6 months. The Googlers joined forces with The Trevor Project’s in-house engineering team to apply Natural Language Processing to the crisis contact intake process. As a result, Trevor is now able to connect youth with the help they need faster. And our work together is continuing with the support of a new $1.2M grant as well as a new cohort of Google.org Fellows, who are at The Trevor Project through December helping expand ML solutions.

ML Problem Framing

We framed the problem as a binary text classification problem. The inputs are answers to questions on the intake form that youth complete when they reach out:

  • Have you attempted suicide before? Yes / No
  • Do you have thoughts of suicide? Yes / No
  • How upset are you? [multiple choice]
  • What’s going on? [free text input]
AI Impact Challenge Product Demo

The output is a binary classification: whether to place the youth in the standard queue or a priority queue. As counselors become available, they connect with youth from the priority queue before youth from the standard queue.

Data

Once a youth connects with a counselor, the counselor performs a clinical risk assessment and records the result. The risk assessment result can be mapped to whether the youth should have been placed in the standard queue or the priority queue. The full transcript of the (digital) conversation is also logged, as are the answers to the intake questions. Thus, the dataset used for training consisted of a mixture of free-form text, binary / multiple-choice features, and human-provided labels.

Fortunately, there are relatively few youth classified as high-risk compared to standard-risk. This resulted in a significant class imbalance which had to be accounted for during training. Another major challenge was low signal-to-noise ratio in the dataset. Different youth could provide very similar responses on the intake form and then be given opposite classifications by counselors after completing in-depth conversations. Various methods of dealing with these issues are detailed later.

Because of the extremely sensitive nature of the dataset, special measures were taken to limit its storage, access, and processing. We automatically scrubbed and replaced data with personally-identifiable information (PII) such as names and locations with placeholder strings such as “[PERSON_NAME]” or “[LOCATION]”. This means the models were not trained using PII. Access to the scrubbed dataset was limited to the small group of people working on the project, and the data and model were kept strictly within Trevor’s systems and are not accessible to Google.

Metrics

For a binary classification task, we would usually optimize for metrics like precision and recall, or derived measures like F1 score or AUC. For crisis contact classification, however, the metric we need to optimize most for is how long a high-risk youth (one who should be classified into the priority queue) has to wait before connecting with a counselor. To estimate this, we built a queue simulation system that can predict average wait times given a historical snapshot of class balance, quantitative flow of contacts over time, number of counselors available, and the precision and recall of the prediction model.

The simulation was too slow to run during the update step of gradient descent, so we optimized first for proxy metrics such as precision at 80% recall, and precision at 90% recall. We then ran simulations at all points on the precision-recall curve of the resulting model to determine the optimal spot on the curve for minimizing wait time for high-risk youth.

It was also critical to quantify the fairness of the model with respect to the diverse range of demographic and intersectional groups that reach out to Trevor. For each finalized model, we computed false positive and false negative rates broken out by over 20 demographic categories, including intersectionality. We made sure that no demographic group was favored or disfavored by the model more often than the previous system.

Model Selection

We experimented with bi-LSTM and transformer-based models, as they have been shown to provide state-of-the-art results across a broad range of textual tasks. We tried embedding the textual inputs using Glove, Elmo, and Universal Sentence Encoder. For transformer-based models, we tried a single-layer transformer network and ALBERT (many transformer layers pre-trained with unlabeled text from the web).

We selected ALBERT for several reasons. It showed the best performance at the high-recall end of the curve where we were most interested. ALBERT allowed us not only to take advantage of massive amounts of pre-training, but also to leverage some of our own unlabeled data to do further pretraining (more on this later). Since ALBERT shares weights between its transformer layers, the model is cheaper to deploy (important for a non-profit organization) and less prone to overfitting (important given the noisiness of our data).

Training

We trained in a three-step process:

  1. Pre-training: ALBERT is already pre-trained with a large amount of data from the web. We simply loaded a pre-trained model using TF Hub.

    Instructions available here for loading a pre-trained model for text classification,.

  2. Further pre-training: Since ALBERT’s language model is based on generic Web data, we pre-trained it further using our own in-domain, unlabeled data. This included anonymized text from chat transcripts as well as from forum posts on TrevorSpace, The Trevor Project’s safe-space social networking site for LGBTQ youth. Although the unlabeled data is not labeled for suicide risk, it comes from real youth in our target demographics and is therefore linguistically closer to our labeled dataset than ALBERT’s generic web corpora are. We found that this increased model performance significantly.

    Instructions available here for checkpoint management strategies.

  3. Fine-tuning: We fine-tuned the model using our hand-labeled training data. We initially used ALBERT just to encode the textual response to “What’s going on” and used one-hot vectors to encode the responses to the binary and multiple-choice questions. We then tried converting everything to text and using ALBERT to encode everything. Specifically, instead of encoding the Yes / No answer to a question like “Do you have thoughts of suicide?” as a one-hot vector, we prepended something like “[ counselor] Do you have thoughts of suicide? [ youth] No” to the textual response to “What’s going on?” This yielded significant improvements in performance.

    Instructions available here for encoding with BERT tokenizer.

Optimization

We did some coarse parameter selection (learning rate and batch size) using manual trials. We also used Keras Tuner to refine the parameter space further. Because Keras Tuner is model-agnostic, we were able to use a similar tuning script for each of our model classes. For the LSTM-based models, we also used Keras Tuner to decide which kind of embeddings to use.

Normally, we would train with as large of a batch size as would fit on a GPU, but in this case we found better performance with fairly small batch sizes (~8 examples). We theorize that this is because the data has so much noise that it tends to regularize itself. This self-regularization effect is more pronounced in small batches.

Instructions available here here for setting up hyperparameter trials.

Conclusion

We trained a text-based model to prioritize at-risk youth seeking crisis services. The model outperformed a baseline classifier that only used responses from several multiple-choice intake questions as features. The NLP model was also shown to have less bias than the baseline model. Some of the highest-impact ingredients to the final model were 1) Using in-domain unlabeled data to further pretrain an off-the-shelf ALBERT model, 2) encoding multiple-choice responses as full text, which is in turn encoded by ALBERT, and 3) tuning hyperparameters using intuition about our specific dataset in addition to standard search methods.

Despite the success of the model, there are some limitations. The intake questions that produced our dataset were not extremely well-correlated with the results of the expert risk assessments that made up our training labels. This resulted in a low signal-to-noise ratio in our training dataset. More non-ML work could be done in the future to elicit more high-signal responses from youth in the intake process.

We’d like to acknowledge all of the teams and individuals who contributed to this project: Google.org and the Google.org Fellows, The Trevor Project’s entire engineering and data science team, as well as many hours of review and input from Trevor’s crisis service and clinical staff.

You can support our work by donating at TheTrevorProject.org/Donate. Your life-saving gift can help us expand our advocacy efforts, train a record number of crisis counselors, and provide all of our crisis services 24/7.

Read More

What’s new in TensorFlow 2.4?

What’s new in TensorFlow 2.4?

Posted by Goldie Gadde and Nikita Namjoshi for the TensorFlow Team

TF 2.4 is here! With increased support for distributed training and mixed precision, new NumPy frontend and tools for monitoring and diagnosing bottlenecks, this release is all about new features and enhancements for performance and scaling.

New Features in tf.distribute

Parameter Server Strategy

In 2.4, the tf.distribute module introduces experimental support for asynchronous training of models with ParameterServerStrategy and custom training loops. Like MultiWorkerMirroredStrategy, ParameterServerStrategy is a multi-worker data parallelism strategy; however, the gradient updates are asynchronous.

A parameter server training cluster consists of workers and parameter servers. Variables are created on parameter servers and then read and updated by workers during each step. The reading and updating of variables happens independently across the workers without any synchronization. Because the workers do not depend on one another, this strategy has the benefit of worker fault tolerance and is useful if you use preemptible VMs.

To get started with this strategy, check out the Parameter Server Training tutorial. This tutorial shows you how to set up ParameterServerStrategy and define a training step, and explains how to use the ClusterCoordinator class to dispatch the execution of training steps to remote workers.

Multi Worker Mirrored Strategy

MultiWorkerMirroredStrategy has moved out of experimental and is now part of the stable API. Like its single worker counterpart, MirroredStrategy, MultiWorkerMirroredStrategy implements distributed training with synchronous data parallelism. However, as the name suggests, with MultiWorkerMirroredStrategy you can train across multiple machines, each with potentially multiple GPUs.

In synchronous training, each worker computes the forward and backward passes on different slices of the input data, and the gradients are aggregated before updating the model. For this aggregation, known as an all-reduce, MultiWorkerMirroredStrategy uses CollectiveOps to keep variables in sync. A collective op is a single op in the TensorFlow graph that can automatically choose an all-reduce algorithm in the TensorFlow runtime according to hardware, network topology, and tensor sizes.

Graph of TF GPUs and CPU

To get started with MultiWorkerMirroredStrategy, check out the Multi-worker training with Keras tutorial, which has been updated with details on dataset sharding, saving/loading models trained with a distribution strategy, and failure recovery with the BackupAndRestore callback.

If you are new to distributed training and want to learn how to get started, or you’re interested in distributed training on GCP, see this blog post for an introduction to the key concepts and steps.

Updates in Keras

Mixed Precision

In TensorFlow 2.4, the Keras mixed precision API has moved out of experimental and is now a stable API. Most TensorFlow models use the float32 dtype; however, there are lower-precision types such as float16 that use less memory. Mixed precision is the use of 16-bit and 32-bit floating point types in the same model for faster training. This API can improve model performance by 3x on GPUs and 60% on TPUs.

To make use of the mixed precision API, you must use Keras layers and optimizers, but it’s not necessary to use other Keras classes such as models or losses. If you’re curious to learn how to take advantage of this API for better performance, check out the Mixed Precision tutorial.

Optimizers

This release includes refactoring the tf.keras.optimizers.Optimizer class, enabling users of model.fit or custom training loops to write training code that works with any optimizer. All built-in tf.keras.optimizer.Optimizer subclasses now accept gradient_transformers and gradient_aggregator arguments, allowing you to easily define custom gradient transformations.

With the refactor, you can now pass a loss tensor directly to Optimizer.minimize when writing custom training loops:

tape = tf.GradientTape()
with tape:
y_pred = model(x, training=True)
loss = loss_fn(y_pred, y_true)

# You can pass in the `tf.GradientTape` when using a loss `Tensor` as shown below.

optimizer.minimize(loss, model.trainable_variables, tape=tape)

These changes are intended to make both Model.fit and custom training loops more agnostic to optimizer details, allowing you to write training code that works with any optimizer without modification.

Functional API model construction internal improvements

Lastly, TensorFlow 2.4 includes a major refactoring of the internals of the Keras Functional API, improving the memory consumption of functional model construction and simplifying triggering logic. This refactoring also ensures TensorFlowOpLayers behave predictably and work with CompositeTensor type signatures.

Introducing tf.experimental.numpy

TensorFlow 2.4 introduces experimental support for a subset of NumPy APIs, available as tf.experimental.numpy. This module enables you to run NumPy code, accelerated by TensorFlow. Because it is built on top of TensorFlow, this API interoperates seamlessly with TensorFlow, allowing access to all of TensorFlow’s APIs and providing optimized execution using compilation and auto-vectorization. For example, TensorFlow ND arrays can interoperate with NumPy functions, and similarly TensorFlow NumPy functions can accept inputs of different types including tf.Tensor and np.ndarray.

import tensorflow.experimental.numpy as tnp

# Use NumPy code in input pipelines

dataset = tf.data.Dataset.from_tensor_slices(
tnp.random.randn(1000, 1024)).map(
lambda z: z.clip(-1,1)).batch(100)

# Compute gradients through NumPy code

def grad(x, wt):
with tf.GradientTape() as tape:
tape.watch(wt)
output = tnp.dot(x, wt)
output = tf.sigmoid(output)
return tape.gradient(tnp.sum(output), wt)

You can learn more about how to use this API in the NumPy API on TensorFlow guide.

New Profiler Tools

MultiWorker Support in TensorFlow Profiler

The TensorFlow Profiler is a suite of tools you can use to measure the training performance and resource consumption of your TensorFlow models. The TensorFlow Profiler helps you understand the hardware resource consumption of the ops in your model, diagnose bottlenecks, and ultimately train faster.

Previously, the TensorFlow Profiler supported monitoring multi-GPU, single host training jobs. In 2.4 you can now profile MultiWorkerMirroredStrategy training jobs. For example, you can use the sampling mode API to perform on demand profiling and connect to the same server:port in use by MultiWorkerMirroredStrategy workers:


# Start a profiler server before your model runs.


tf.profiler.experimental.server.start(6009)

# Model code goes here....

# E.g. your worker IP addresses are 10.0.0.2, 10.0.0.3, 10.0.0.4, and you
# would like to profile for a duration of 2 seconds. The profiling data will
# be saved to the Google Cloud Storage path “your_tb_logdir”.

tf.profiler.experimental.client.trace(
'grpc://10.0.0.2:6009,grpc://10.0.0.3:6009,grpc://10.0.0.4:6009',
'gs://your_tb_logdir',
2000)

Alternatively, you can use the TensorBoard profile plugin by providing the worker addresses to the Capture Profile tool.

After profiling, you can use the new Pod Viewer tool to choose a training step and view its step-time category breakdown across all workers.

TensorBoard preview

For more information on how to use the TensorFlow Profiler, check out the newly released GPU Performance Guide. This guide shows common scenarios you might encounter when you profile your model training job and provides a debugging workflow to help you get better performance, whether you’re training with one GPU, multiple GPUs, or multiple machines.

TFLite Profiler

The TFLite Profiler enables tracing TFLite internals in Android to identify performance bottlenecks. The TFLite Performance Measurement Guide shows you how to add trace events, enable TFLite tracing, and capture traces with both the Android Studio CPU Profiler and the System Tracing app.

Example trace using the Android System Tracing app

Example trace using the Android System Tracing app

New Features for GPU Support

TensorFlow 2.4 runs with CUDA 11 and cuDNN 8, enabling support for the newly available NVIDIA Ampere GPU architecture. To learn more about CUDA 11 features, check out this NVIDIA developer blog.

Additionally, support for TensorFloat-32 on Ampere-based GPUs is enabled by default. TensorFloat-32, or `TF32` for short, is a math mode for NVIDIA Ampere GPUs that causes certain float32 ops, such as matrix multiplications and convolutions, to run much faster on Ampere GPUs but with reduced precision. To learn more , see the documentation for tf.config.experimental.enable_tensor_float_32_execution.

Next steps

Check out the release notes for more information. To stay up to date, you can read the TensorFlow blog, follow twitter.com/tensorflow, or subscribe to youtube.com/tensorflow. If you’ve built something you’d like to share, please submit it for our Community Spotlight at goo.gle/TFCS. For feedback, please file an issue on GitHub. Thank you!

Read More

Making BERT Easier with Preprocessing Models From TensorFlow Hub

Making BERT Easier with Preprocessing Models From TensorFlow Hub

Posted by Arno Eigenwillig, Software Engineer and Luiz GUStavo Martins, Developer Advocate

BERT and other Transformer encoder architectures have been very successful in natural language processing (NLP) for computing vector-space representations of text, both in advancing the state of the art in academic benchmarks as well as in large-scale applications like Google Search. BERT has been available for TensorFlow since it was created, but originally relied on non-TensorFlow Python code to transform raw text into model inputs.

Today, we are excited to announce a more streamlined approach to using BERT built entirely in TensorFlow. This solution makes both pre-trained encoders and the matching text preprocessing models available on TensorFlow Hub. BERT in TensorFlow can now be run on text inputs with just a few lines of code:

# Load BERT and the preprocessing model from TF Hub.
preprocess = hub.load('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/1')
encoder = hub.load('https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3')

# Use BERT on a batch of raw text inputs.
input = preprocess(['Batch of inputs', 'TF Hub makes BERT easy!', 'More text.'])
pooled_output = encoder(input)["pooled_output"]
print(pooled_output)

tf.Tensor(
[[-0.8384154 -0.26902363 -0.3839138 ... -0.3949695 -0.58442086 0.8058556 ]
[-0.8223734 -0.2883956 -0.09359277 ... -0.13833837 -0.6251748 0.88950026]
[-0.9045408 -0.37877116 -0.7714909 ... -0.5112085 -0.70791864 0.92950743]],
shape=(3, 768), dtype=float32)

These encoder and preprocessing models have been built with TensorFlow Model Garden’s NLP library and exported to TensorFlow Hub in the SavedModel format. Under the hood, preprocessing uses TensorFlow ops from the TF.text library to do the tokenization of input text – allowing you to build your own TensorFlow model that goes from raw text inputs to prediction outputs without Python in the loop. This accelerates the computation, removes boilerplate code, is less error prone, and enables the serialization of the full text-to-outputs model, making BERT easier to serve in production.

To show in more detail how these models can help you, we’ve published two new tutorials:

  • The beginner tutorial solves a sentiment analysis task and doesn’t need any special customization to achieve great model quality. It’s the easiest way of using BERT and a preprocessing model.
  • The advanced tutorial solves NLP classification tasks from the GLUE benchmark, running on TPU. It also shows how to use the preprocessing model in situations where you need multi-segment input.
BERT Model

Choosing a BERT model

BERT models are pre-trained on a large corpus of text (for example, an archive of Wikipedia articles) using self-supervised tasks like predicting words in a sentence from the surrounding context. This type of training allows the model to learn a powerful representation of the semantics of the text without needing labeled data. However, it also takes a significant amount of computation to train – 4 days on 16 TPUs (as reported in the 2018 BERT paper). Fortunately, after this expensive pre-training has been done once, we can efficiently reuse this rich representation for many different tasks.

TensorFlow Hub offers a variety of BERT and BERT-like models:

  • Eight BERT models come with the trained weights released by the original BERT authors.
  • 24 Small BERTs have the same general architecture but fewer and/or smaller Transformer blocks, which lets you explore tradeoffs between speed, size and quality.
  • ALBERT: these are four different sizes of “A Lite BERT” that reduces model size (but not computation time) by sharing parameters between layers.
  • The 8 BERT Experts all have the same BERT architecture and size but offer a choice of different pre-training domains and intermediate fine-tuning tasks, to align more closely with the target task.
  • Electra has the same architecture as BERT (in three different sizes), but gets pre-trained as a discriminator in a set-up that resembles a Generative Adversarial Network (GAN).
  • BERT with Talking-Heads Attention and Gated GELU [base, large] has two improvements to the core of the Transformer architecture.
  • Lambert has been trained with the LAMB optimizer and several techniques from RoBERTa.
  • … and more to come.

These models are BERT encoders. The links above take you to their documentation on TF Hub, which refers to the right preprocessing model for use with each of them.

We encourage developers to visit these model pages to learn more about the different applications targeted by each model. Thanks to their common interface, it’s easy to experiment and compare the performance of different encoders on your specific task by changing the URLs of the encoder model and its preprocessing.

The Preprocessing model

For each BERT encoder, there is a matching preprocessing model. It transforms raw text to the numeric input tensors expected by the encoder, using TensorFlow ops provided by the TF.text library. Unlike preprocessing with pure Python, these ops can become part of a TensorFlow model for serving directly from text inputs. Each preprocessing model from TF Hub is already configured with a vocabulary and its associated text normalization logic and needs no further set-up.

We’ve already seen the simplest way of using the preprocessing model above. Let’s look again more closely:

preprocess = hub.load('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/1')
input = preprocess(["This is an amazing movie!"])

{'input_word_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[ 101, 2023, 2003, 2019, 6429, 3185, 999, 102, 0, ...]])>,
'input_mask': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[ 1, 1, 1, 1, 1, 1, 1, 1, 0, ...,]])>,
'input_type_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, ...,]])>}

Calling preprocess() like this transforms raw text inputs into a fixed-length input sequence for the BERT encoder. You can see that it consists of a tensor input_word_ids with numerical ids for each tokenized input, including start, end and padding tokens, plus two auxiliary tensors: an input_mask (that tells non-padding from padding tokens) and input_type_ids for each token (that can distinguish multiple text segments per input, which we will discuss below).

The same preprocessing SavedModel also offers a second, more fine-grained API, which supports putting one or two distinct text segments into one input sequence for the encoder. Let’s look at a sentence entailment task, in which BERT is used to predict if a premise entails a hypothesis or not:

text_premises = ["The fox jumped over the lazy dog.",
"Good day."]
tokenized_premises = preprocess.tokenize(text_premises)

<tf.RaggedTensor
[[[1996], [4419], [5598], [2058], [1996], [13971], [3899], [1012]],
[[2204], [2154], [1012]]]>


text_hypotheses = ["The dog was lazy.", # Entailed.
"Axe handle!"] # Not entailed.
tokenized_hypotheses = preprocess.tokenize(text_hypotheses)

<tf.RaggedTensor
[[[1996], [3899], [2001], [13971], [1012]],
[[12946], [5047], [999]]]>

The result of each tokenization is a RaggedTensor of numeric token ids, representing each of the text inputs in full. If some pairs of premise and hypothesis are too long to fit within the seq_length for BERT inputs in the next step, you can do additional preprocessing here, such as trimming the text segment or splitting it into multiple encoder inputs.

The tokenized input then gets packed into a fixed-length input sequence for the BERT encoder:

encoder_inputs = preprocess.bert_pack_inputs(
[tokenized_premises, tokenized_hypotheses],
seq_length=18) # Optional argument, defaults to 128.

{'input_word_ids': <tf.Tensor: shape=(2, 18), dtype=int32, numpy=
array([[ 101, 1996, 4419, 5598, 2058, 1996, 13971, 3899, 1012,
102, 1996, 3899, 2001, 13971, 1012, 102, 0, 0],
[ 101, 2204, 2154, 1012, 102, 12946, 5047, 999, 102,
0, 0, 0, 0, 0, 0, 0, 0, 0]])>,
'input_mask': <tf.Tensor: shape=(2, 18), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>,
'input_type_ids': <tf.Tensor: shape=(2, 18), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>}

The result of packing is the already-familiar dict of input_word_ids, input_mask and input_type_ids (which are 0 and 1 for the first and second input, respectively). All outputs have a common seq_length (128 by default). Inputs that would exceed seq_length are truncated to approximately equal sizes during packing.

Accelerating model training

TensorFlow Hub provides BERT encoder and preprocessing models as separate pieces to enable accelerated training, especially on TPUs.

Tensor Processing Units (TPUs) are Google’s custom-developed accelerator hardware that excel at large scale machine learning computations such as those required to fine-tune BERT. TPUs operate on dense Tensors and expect that variable-length data like strings has already been transformed into fixed-size Tensors by the host CPU.

The split between the BERT encoder model and its associated preprocessing model enables distributing the encoder fine-tuning computation to TPUs as part of model training, while the preprocessing model executes on the host CPU. The preprocessing computation can be run asynchronously on a dataset using tf.data.Dataset.map() with dense outputs ready to be consumed by the encoder model on the TPU. Asynchronous preprocessing like this can improve performance with other accelerators as well.

Our advanced BERT tutorial can be run in a Colab runtime that uses a TPU worker and demonstrates this end-to-end.

Summary

Using BERT and similar models in TensorFlow has just gotten simpler. TensorFlow Hub makes available a large collection of pre-trained BERT encoders and text preprocessing models that are easy to use in just a few lines of code.

Take a look at our interactive beginner and advanced tutorials to learn more about how to use the models for sentence and sentence-pair classification. Let us know what you build with these new BERT models and tag your posts with #TFHub.

Acknowledgements:

We’d like to thank a number of colleagues for their contribution to this work.

The new preprocessing models have been created in collaboration with Chen Chen, Terry Huang, Mark Omernick and Rajagopal Ananthanarayanan.

Additional BERT models have been published to TF Hub on this occasion by Sebastian Ebert (Small BERTs), Le Hou and Hongkun Yu (Lambert, Talking Heads).

Mark Daoust, Josh Gordon and Elizabeth Kemp have greatly improved the presentation of the material in this post and the associated tutorials.

Read More

Meet our TensorFlow AI Service Partners

Meet our TensorFlow AI Service Partners

Posted by Amy Hsueh, TensorFlow Partnerships Lead, and Sandeep Gupta, TensorFlow Product Manager

Implementing machine learning solutions can help businesses innovate, but it can be a challenge if companies don’t have the knowledge, experience, or resources in-house to get started.

That’s where our TensorFlow AI Service Partners may be able to help. We’ve selected AI/ML practitioners who have experience helping businesses implement AI/ML and TensorFlow-based solutions. We hope that these partners can help more enterprises benefit from AI-based systems and innovate faster, solve smarter, and scale bigger.

“Service partners are critical in driving large adoption of AI in the enterprise”, said Kemal El Moujahid, Director of Product Management for TensorFlow, “so we are excited to partner with some of the leading AI Service companies, who excel at building powerful solutions with TensorFlow. The breadth and diversity of business problems that these companies solve for their customers are incredible, and we are looking forward to seeing even more real-world impact with TensorFlow.”

Our selected partners are experienced in creating a range of consulting and software solutions powered by TensorFlow and other frameworks that span across the machine learning workflow, including preparing and ingesting data, training and optimizing models, and productionizing them.

TensorFlow AI Service Partners share their insights and product feedback with the TensorFlow team, helping us make enhancements and improvements that address enterprise ML needs.

“We are thrilled to be partnering with companies that have deep TensorFlow expertise and demonstrated track records in solving their customer’s business critical needs,” remarks Sarah Sirajuddin, Engineering Director of TensorFlow, “We value user feedback tremendously, and believe that the feedback and insights gathered from these partners will help us improve TensorFlow for all our users.”

Choose from our partners, ranging in geographic reach and industry specializations, all with demonstrated expertise in TensorFlow on our website or hear from them directly on why they are excited about this program on their respective blogs: Determined AI, Labelbox, Paperspace, SpringML, Stradigi AI, Quantiphi.

We look forward to seeing this program grow and adding additional partners in the future. If you’re interested in becoming a partner, check out our application guide.

Read More

Getting Started with Distributed TensorFlow on GCP

Getting Started with Distributed TensorFlow on GCP

Posted by Nikita Namjoshi, Machine Learning Solutions Engineer

For many in the world of data science, distributed training can seem a daunting task. In addition to building and thoughtfully evaluating a high-quality ML model, you have to be aware of how to optimize your model for specific hardware and manage infrastructure. The latter skills are not often included in a data scientist’s toolkit. However, with the help of managed services on the Google Cloud Platform (GCP), you can easily scale your model training job to multiple accelerators or even multiple machines, with no GPU expertise required.

In this tutorial-style article, you’ll get hands-on experience with GCP data science tools and train a TensorFlow model across multiple GPUs. You’ll also learn key terminology in the field of distributed training, such as data parallelism, synchronous training, and AllReduce.

Data parallelism chart
Data parallelism is one of the concepts you will learn about in this article.

Why Distributed Training?

Every data scientist and machine learning engineer has experienced the agony of sitting and waiting for a model to train. Even if you have access to a GPU, with a large dataset it can take days for a large deep learning model to converge. Using the right hardware configuration can reduce training time to hours, or even minutes. And a shorter training time makes for faster iteration to reach your modeling goals.

If you have a GPU available, TensorFlow will use it automatically with no code changes required. Similarly, TensorFlow can make use of multiple CPU cores out of the box. However, if you want to train with two or more GPUs then you’ll have to do a bit of extra work. This extra work is necessary because TensorFlow needs to know how to coordinate the training process across the multiple GPUs in your runtime. Fortunately, with the tf.distribute module, you have access to different distributed training strategies that you can easily incorporate into your program.

When doing distributed training, it’s important to be clear on the distinction between machines and devices. A device refers to a CPU or accelerator, such as GPUs or TPUs, on some machine that TensorFlow can run operations on. The focus in this article will be training with a single machine that has multiple GPU devices, but the tf.distribute.Strategy API also provides support for multi-worker training. In a multi-worker set up, the training is distributed across multiple machines. These machines can be CPU only, or have one or more GPU devices each.

Single GPU Training

In the following Colab notebook, you’ll find the code to train a ResNet50 architecture on the Cassava dataset. If you execute the cells in the notebook and train the model, you’ll notice that the number of steps taken in each epoch is 89, and each epoch takes around 100 seconds. Make note of these numbers; we will come back to them later.

Multi-GPU Training

You can access a single GPU in colab, but your luck stops there if you want to use multiple GPUs. Moreover, while a Colab notebook is great for quick experimentation you’ll likely want a more secure and reliable set up that offers you more control over your environment. For that, you can turn to the cloud.

There are many different ways to do distributed training on GCP. Picking the best option for your use case will likely involve different considerations if you are a student/researcher running experiments, versus an engineer at a company training models in a production workflow.

In this article you will use the GCP AI Platform Notebooks. This path provides an easy approach to distributed training and also gives you a chance to explore a managed notebook environment running on GCP. As an alternative, if you already have a local environment set up and are looking for a hassle free transition between your local and GCP environments, you can check out the TensorFlow Cloud library. TensorFlow Cloud can automate many of the steps described in this article; however, we will walk through the steps here so you can get a deeper understanding of the key concepts involved in distributed training.

In the following section, you’ll learn how to modify the single GPU training code using the tf.distribute.Strategy API. The resulting code will be cloud platform agnostic so you could run it in a different environment without any changes. You can also run the same code on your own hardware.

Prepare Code for Distributed Training

The first step in using the tf.distribute.Strategy API is to instantiate your strategy. In this tutorial, you will use MirroredStrategy, which is one of several distribution strategies available in TensorFlow.

strategy = tf.distribute.MirroredStrategy()

Next, you need to wrap the creation of your model parameters within the scope of the strategy. This step is crucial because it tells MirroredStrategy which variables to mirror across your GPU devices.

with strategy.scope():
model = create_model()
model.compile(
loss='sparse_categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(0.0001),
metrics=['accuracy'])

Before we run the updated code, let’s take a brief look at what will actually happen when we call model.fit and how training will differ now that we have added a strategy. For the sake of simplicity, imagine you have a simple linear model instead of the ResNet50 architecture. In TensorFlow, you can think of this simple model in terms of its computational graph.

In the image below, you can see that the matmul op takes in the X and W tensors, which are the training batch and weights respectively. The resulting tensor is then passed to the add op with the tensor b, which is the model’s bias terms. The result of this op is Ypred, which is the model’s predictions.

Chart of matmul op taking the X and W tensors

We want a way of executing this computational graph such that we can leverage two GPUs. There are multiple different ways we can achieve this. For example, you could put different layers of your model on different machines or devices, which is one flavor of model parallelism. Alternatively, you could distribute your dataset such that each device processes a portion of the input batch on each training step with the same model, which is known as data parallelism. Or you might do a combination of both. Data parallelism is the most common (and easiest) approach, and that’s what we’ll do here.

The next image shows an example of data parallelism. The input batch X is split in half, and one slice is sent to GPU 0 and the other to GPU 1. In this case, each GPU calculates the same ops but on different slices of the data.

Data parallelism chart

MirroredStrategy is a data parallelism strategy. So when we call model.fit, MirroredStrategy will make a copy (known as a replica) of the ResNet50 model on both of the GPUs. The CPU (host) is responsible for preparing the tf.data.Dataset batches and sending the data to the GPUs (devices).

The subsequent gradient updates will happen in a synchronous manner. This means that each worker device computes the forward and backward passes through the model on a different slice of the input data. The computed gradients from each of these slices are then aggregated across all of the devices and reduced (usually an average) in a process known as AllReduce. The optimizer then performs the parameter updates with these reduced gradients thereby keeping the devices in sync. Because each worker cannot proceed to the next training step until all the other workers have finished the current step, this gradient calculation becomes the main overhead in distributed training for synchronous strategies.

While MirroredStrategy is a synchronous strategy, data parallelism strategies can also be asynchronous. In an asynchronous data parallelism strategy, each worker computes the gradients from a slice of the input data and makes updates to the parameters in an asynchronous fashion. Compared to synchronous strategies, asynchronous training has the benefit of fault tolerance because the workers are not dependent on one another, but can result in stale gradients. You can learn more about asynchronous training by experimenting with the TensorFlow Parameter Server Strategy.

With the two easy steps of instantiating MirroredStrategy, and then wrapping your model creation within the strategy scope, TensorFlow will do the heavy lifting of distributing your training job across your GPUs through data parallelism and synchronous gradient updates.

The last change you will want to make is to the batch size.

BATCH_SIZE = 64 * strategy.num_replicas_in_sync

Recall that in the single GPU case, the batch size was 64. This means that on each step of model training, 64 images were processed, and the number of resulting steps in each epoch was the total dataset size / batch size, which we noted previously as 89.

When you do distributed training with the tf.distribute.Strategy API and tf.data, the batch size now refers to the global batch size. In other words, if you pass a batch size of 10, and you have two GPUs, then each machine will process 5 examples per step. In this case, 10 is known as the global batch size, and 5 as the per replica batch size. To make the most out of your GPUs, you will want to scale the batch size by the number of replicas, which is two in this case because there is one replica on each GPU.

You can make these code changes yourself, or simply use this other Colab notebook where the changes have been made already. Although MirroredStrategy is designed for a multi-GPU environment, you can actually run this notebook in Colab on a GPU runtime or a CPU runtime without error. TensorFlow will use a single GPU or multiple CPU cores out of the box anyway so you don’t actually need a strategy, but this could come in handy for testing/experimentation purposes.

Set up GCP Project

Now that we’ve made the necessary code changes, the next step is to set up the GCP environment. To do this you will need a GCP project with billing enabled.

  1. Create your project in the UI
  2. Create your billing account

Next, you should enable the Cloud Compute Engine API. If you are working in a brand new project, then this process will likely also prompt you to connect the billing account you created. If you are using a GCP project that you have already worked with, then most likely the Compute Engine API will already be enabled.

Request Quota

Google Cloud enforces quotas on resource usage to prevent abuse and accidental usage. If you need access to more of a particular resource than what is available by default, you’ll have to request more quota. For this tutorial, we will use the NVIDIA T4 GPU. By default, you get access to one T4 GPU per location, but in order to do distributed training you’ll need to request quota for an additional GPU in a location.

In the GCP console, scroll to the hamburger menu on the left side and navigate to IAM & Admin > Quotas

Google Cloud Platform Quotas

On the Quotas page you can add a service filter for the Compute Engine API. Note that if you have not enabled the Compute Engine API or enabled billing, you will not see Compute Engine API as a filter option, so be sure you have completed the earlier steps first.

Compute Engine API Quotas page

When you find the NVIDIA T4 GPU resource in the list, go ahead and click on ALL QUOTAS for that row.

List of all the quotas with NVIDIA T4 GPUs highlighted

Once you’ve made it to the Quota metric details page for NVIDIA T4 GPUs, select the Location: us-west1 and click edit quotas at the top of the page.

If you already have quota for a different type of GPU, or in a different location, you can easily use those instead. Just make sure you remember the GPU type and location as you will need to specify these parameters when setting up your AI Platform Notebook environment later. Additionally, if you prefer to follow along and just use a single GPU instead of requesting quota for two, you can do that as well. Your code will not be distributed, but you will still get the benefit of learning how to set your GCP environment.

>

Quota metric details with us-west1 highlighted

Fill in your contact details in the Quota changes menu and then set your New Limit to 2. Then click Done when you’re finished.

image of setting new limit to 2

You’ll get a confirmation email first when you have submitted the request, and then when your request has been approved.

Create AI Platform Notebook Instance

While you wait for quota approvals, the next step is to get set up with AI Platform Notebooks, which can be found using the same hamburger menu as before in the console and scrolling to Artificial Intelligence > AI Platform > Notebooks

You’ll need to enable the API if this is your first time using the tool.

UI of setting up AI Platform Notebooks

AI Platform Notebooks is a managed service for doing data science work. This tool is ideal if you like developing in a notebook environment. You can easily add and remove GPUs without having to worry about GPU driver installation, and there are a number of instance images you can choose from depending on your use case so you don’t need to hassle with setting up all the Python packages you need to get your job done.

Once the Notebooks API is enabled, the next step is to create your instance. You can do this by clicking the NEW INSTANCE button at the top of the page, and then selecting the TensorFlow Enterprise 2.3 image (or the most recent TensorFlow image if you’re following along at a later date), with the 1 NVIDIA Tesla T4 option. TensorFlow Enterprise is a TensorFlow distribution optimized for GCP.

UI showing where to find New Instance

Click ADVANCED OPTIONS at the bottom of the New notebook instance window, and then change the following fields:

  • Instance name: give your instance a name
  • Region: us-west1
  • GPU type: NVIDIA Tesla T4
  • Number of GPUs: 2
  • Check the Install NVIDIA GPU driver automatically for me box

Then click CREATE. Note that if you have not yet been approved for the NVIDIA T4 quota, you will get an error message when you click CREATE. So be sure you have received your approval message before completing this step. Additionally, if you plan to use a different GPU type or location other than T4 in us-west1, you will need to change these parameters when creating your notebook.

Your instance will take a few minutes to launch, and when it’s done you’ll see the option to OPEN JUPYTERLAB appear in blue letters.

Option to open JUPYTERLAB

Note that even after you’ve created an AI Platform Notebook instance, you can change the hardware (for example adding or removing GPUs). Should you need to do this in the future, simply stop the instance and follow the steps here.

Train Multi-GPU Model on AI Platform Notebooks

Now that your instance is set up, you can click on OPEN JUPYTERLAB.

Download the Colab Notebook as an .ipynb file, and upload it to your Jupyter Lab environment. When the file is uploaded go to the notebook and run the code.

Download the Colab Notebook as an .ipynb file, and upload it to your Jupyter Lab environment. When the file is uploaded go to the notebook and run the code.

When you execute the model.fit cell, you should notice that the number of steps per epoch is now 45, which is half of what it was when using a single GPU. This is data parallelism in action. With a global batch size of 64 * 2, your CPU is sending batches of 64 images to each GPU. So while previously the model only saw 64 examples in a single step, it now sees 128 examples on each step and thus each epoch takes less time. Previously each epoch took around 100 seconds, and now each epoch takes around 60 seconds. You’ll notice that adding a second GPU does not cut the time in half, as there is some overhead involved in synchronizing the gradients. The benefits will be more noticeable with a larger dataset (Cassava only has 5656 training images). Additionally, there are lots of techniques you can use to get even more benefit from that second GPU, such as making sure your input pipeline isn’t a bottleneck. To learn more about making the most of your GPUs, see the TensorFlow Performance Debugging guide.

Long Running Jobs on the DLVM

So far you’ve learned how to use the GCP AI Platform Notebooks to run a simple distributed training job. The dataset we used was not very large, and the model achieved fairly high accuracy after only a few epochs. However, in reality your training job will probably run for a lot longer and you might not want to use a notebook.

When you launch an AI Platform Notebook, it creates a Google Compute Engine (GCE) instance using the GCP Deep Learning VM Images. The Deep Learning VM images are Compute Engine virtual machine images optimized for data science and machine learning tasks. In our example we used the TensorFlow Enterprise 2.3 image, but there are many other options available.

In the console, you can use the menu to navigate to Compute Engine > VM instances

Navigating to VM Instances

And you should see an instance with the same name as the notebook you created earlier. Because this is a GCE instance, we can ssh into the machine and run the code there.

VM instances my test notebook

Install Google SDK

Installing the Google Cloud SDK will allow you to manage GCE resources in your project from your terminal. Follow the steps here to install the SDK and connect to your project.

SSH into the VM

Once the SDK is installed and configured, you can use the following command in your terminal to ssh into your vm. Just be sure to change the instance name and project name.

gcloud compute ssh {your-vm-name} --project={your-project-name}

If you run the command nvidia-smi on the vm, you’ll see the two T4 GPUs we provisioned earlier.

interface after running nvidia-smi

To run the distributed training job, simply download the code from the Colab Notebook as a .py file, and use the following command from your local machine to copy it to your vm.

gcloud compute scp --project {your-project-name} {local-path-to-py-file} {your-vm-name}:~/

Finally, you can run the script on your vm with

python dist_strat_blog_multi_gpu.py

And you should see the output of your model training job

If Notebooks are your environment of choice, you can stick with the workflow we used in the previous section. But if you prefer to use vim or emacs, or if you want to run a long running job using Screen for example, you have the option to ssh into the vm from your terminal. Note that you can also launch a Deep Learning VM directly from the command line instead of using the AI Platform Notebooks UI like we did in this tutorial.

When you’re finished experimenting, do not forget to shut your instance down. You can do this by selecting the instance from the Notebook instances page, or GCE Instances page in the console UI and clicking STOP at the top of the window. Shutting down the instance is very important as you will be billed a few dollars for every hour that it is left running. You can easily stop your instance, then restart it when you want to run more experiments and all of your files will still be there.

Take Your Distributed Training Skills to the Next Level

In this article you learned how to use MirroredStrategy, a synchronous data parallelism strategy, to distribute your TensorFlow training job across two GPUs on GCP. You now know the basic mechanics of how to set up your GCP environment and prepare your code, but there’s a lot more to explore in the world of distributed training. For example, if you are interested in building a distributed training job into a production ML pipeline, check out the AI Platform Training Service, which also allows you to configure a training job across multiple machines, each containing multiple GPUs.

On the tensorflow.org site you can check out the other strategies available with the tf.distribute.Strategy API in the overview guide, and also learn how to use a strategy with a custom training loop. For more advanced concepts, there’s a guide on how data gets distributed, and a guide on how to do performance debugging with the TensorFlow Profiler to make sure you are maximizing utilization of your GPUs.

Read More

Build sound classification models for mobile apps with Teachable Machine and TFLite

Build sound classification models for mobile apps with Teachable Machine and TFLite

Posted by Khanh LeViet, TensorFlow Developer Advocate

Sound classification is a machine learning task where you input some sound to a machine learning model to categorize it into predefined categories such as dog barking, car horn and so on. There are already many applications of sound classification, including detecting illegal deforestation activities, or detecting sound of humpback whales for better understanding about their natural behaviors.

We are excited to announce that Teachable Machine now allows you to train your own sound classification model and export it in the TensorFlow Lite (TFLite) format. Then you can integrate the TFLite model to your mobile applications or your IoT devices. This is an easy way to quickly get up and running with sound classification, and you can then explore building production models in Python and exporting them to TFLite as a next step.

Model architecture

Timeline chart of of sound classification model

The model that Teachable Machine uses to classify 1-second audio samples is a small convolutional neural network. As the diagram above illustrates, the model receives a spectrogram (2D time-frequency representation of sound obtained through Fourier transform). It first processes the spectrogram with successive layers of 2D convolution (Conv2D) and max pooling layers. The model ends in a number of dense (fully-connected) layers, which are interleaved with dropout layers for the purpose of reducing overfitting during training. The final output of the model is an array of probability scores, one for each class of sound the model is trained to recognize.

You can find a tutorial to train your own sound classifications models using this approach in Python here.

Train a model using your own dataset

There are two ways to train a sound classification model using your own dataset:

  • Simple way: Use Teachable Machine to collect training data and train the model all within your browser without writing a single line of code. This approach is useful for those who want to build a prototype quickly and interactively.
  • Robust way: Record sounds to use as your training dataset in advance then use Python to train and carefully evaluate your model. Of course, his approach is also more automated and repeatable than the simple way.

Train a model using Teachable Machine

Teachable Machine is a GUI tool that allows you to create training dataset and train several types of machine learning models, including image classification, pose classification and sound classification. Teachable Machine uses TensorFlow.js under the hood to train your machine learning model. You can export the trained models in TensorFlow.js format to use in web browsers, or export in TensorFlow Lite format to use in mobile applications or IoT devices.

Here are the steps to train your models:

  1. Go to Teachable Machine website
  2. Create an audio project
  3. Record some sound clips for each category that you want to recognize. You need only 8 seconds of sound for each category.
  4. Start training. Once it has finished, you can test your model on live audio feed.
  5. Export the model in TFLite format.
Teachable machine chart

Train a model using Python

If you have a large training dataset with several hours of sound recording and or than a dozen of categories, then training a sound classification on a web browser will likely take a lot of time. In that case, you can collect the training dataset in advance, convert them to the WAV format and use this Colab notebook (which includes steps to convert the model to TFLite format) to train your sound classification. Google Colab offers a free GPU so that you can significantly speed up your model training.

Deploy the model to Android with TensorFlow Lite

Once you have trained your TensorFlow Lite sound classification model, you can just put it in this Android sample app to try it out. Just follow these steps:

  1. Clone the sample app from GitHub:
    git clone https://github.com/tensorflow/examples.git
  2. Import the sound classification Android app into Android Studio. You can find it in the lite/examples/sound_classification/android folder.
  3. Add your model (both the soundclassifier.tflite and labels.txt) into the src/main/assets folder replacing the example model that is already there.
  4. Build the app and deploy it on an Android device. Now you can classify sound in real time!
UI of TensorFlow Lite using Sound Classifier

To integrate the model into your own app, you can copy the SoundClassifier.kt class from the sample app and the TFLite model you have trained to your app. Then you can use the model as below:

1. Initialize a `SoundClassifier` instance from your `Activity` or `Fragment` class.

var soundClassifier: SoundClassifier
soundClassifier = SoundClassifier(context).also {
it.lifecycleOwner = context
}

2. Start capturing live audio from the device’s microphone and classify in real time:

soundClassifier.start()

3. Receive classification results in real time as a map of human-readable class names and probabilities of the current sound belonging to each particular category.

let labelName = soundClassifier.labelList[0] // e.g. "Clap"
soundClassifier.probabilities.observe(this) { resultMap ->
let probability = result[labelName] // e.g. 0.7
}

What’s next

We are working on an iOS version of the sample app that will be released in a few weeks. We will also extend TensorFlow Lite Model Maker to allow easy training of sound classification in Python. Stay tuned!

Acknowledgements

This project is a joint effort between multiple teams inside Google. Special thanks to:

  • Google Research: Shanqing Cai, Lisie Lillianfeld
  • TensorFlow team: Tian Lin
  • Teachable Machine team: Gautam Bose, Jonas Jongejan
  • Android team: Saryong Kang, Daniel Galpin, Jean-Michel Trivi, Don Turner

Read More

TensorFlow Recommenders: Scalable retrieval and feature interaction modelling

TensorFlow Recommenders: Scalable retrieval and feature interaction modelling

Posted by Ruoxi Wang, Phil Sun, Rakesh Shivanna and Maciej Kula (Google)

In September, we open-sourced TensorFlow Recommenders, a library that makes building state-of-the-art recommender system models easy. Today, we’re excited to announce a new release of TensorFlow Recommenders (TFRS), v0.3.0.

The new version brings two important features, both critical to building and deploying high-quality, scalable recommender models.

The first is built-in support for fast, scalable approximate retrieval. By leveraging ScaNN, TFRS now makes it possible to build deep learning recommender models that can retrieve the best candidates out of millions in milliseconds – all while retaining the simplicity of deploying a single “query features in, recommendations out” SavedModel object.

The second is support for better techniques for modelling feature interactions. The new release of TFRS includes an implementation of Deep & Cross Network: efficient architectures for learning interactions between all the different features used in a deep learning recommender model.

If you’re eager to try out the new features, you can jump straight into our efficient retrieval and feature interaction modelling tutorials. Otherwise, read on to learn more!

Efficient retrieval

The goal of many recommender systems is to retrieve a handful of good recommendations out of a pool of millions or tens of millions of candidates. The retrieval stage of a recommender system tackles the “needle in a haystack” problem of finding a short list of promising candidates out of the entire candidate list.

As discussed in our previous blog post, TensorFlow Recommenders makes it easy to build two-tower retrieval models. Such models perform retrieval in two steps:

  1. Mapping user input to an embedding
  2. Finding the top candidates in embedding space

The cost of the first step is largely determined by the complexity of the query tower model. For example, if the user input is text, a query tower that uses an 8-layer transformer will be roughly twice as expensive to compute as one that uses a 4-layer transformer. Techniques such as sparsity, quantization, and architecture optimization all help with reducing this cost.

However, for large databases with millions of candidates, the second step is generally even more important for fast inference. Our two-tower model uses the dot product of the user input and candidate embedding to compute candidate relevancy, and although computing dot products is relatively cheap, computing one for every embedding in a database, which scales linearly with database size, quickly becomes computationally infeasible. A fast nearest neighbor search (NNS) algorithm is therefore crucial for recommender system performance.

Enter ScaNN. ScaNN is a state-of-the-art NNS library from Google Research. It significantly outperforms other NNS libraries on standard benchmarks. Furthermore, it integrates seamlessly with TensorFlow Recommenders. As seen below, the ScaNN Keras layer acts as a seamless drop-in replacement for brute force retrieval:

# Create a model that takes in raw query features, and
# recommends movies out of the entire movies dataset.
# Before
# index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
# index.index(movies.batch(100).map(model.movie_model), movies)
# After
scann = tfrs.layers.factorized_top_k.ScaNN(model.user_model)
scann.index(movies.batch(100).map(model.movie_model), movies)

# Get recommendations.
# Before
# _, titles = index(tf.constant(["42"]))
# After
_, titles = scann(tf.constant(["42"]))
print(f"Recommendations for user 42: {titles[0, :3]}")

Because it’s a Keras layer, the ScaNN index serializes and automatically stays in sync with the rest of the TensorFlow Recommender model. There is also no need to shuttle requests back and forth between the model and ScaNN because everything is already wired up properly. As NNS algorithms improve, ScaNN’s efficiency will only improve and further improve retrieval accuracy and latency.

ScaNN can speed up large retrieval models by over 10x while still providing almost the same retrieval accuracy as brute force vector retrieval.
ScaNN can speed up large retrieval models by over 10x while still providing almost the same retrieval accuracy as brute force vector retrieval.

We believe that ScaNN’s features will lead to a transformational leap in the ease of deploying state-of-the-art deep retrieval models. If you’re interested in the details of how to build and serve ScaNN based models, have a look at our tutorial.

Deep cross networks

Effective feature crosses are the key to the success of many prediction models. Imagine that we are building a recommender system to sell blenders using users’ past purchase history. Individual features such as the number of bananas and cookbooks purchased give us some information about the user’s intent, but it is their combination – having bought both bananas and cookbooks – that gives us the strongest signal of the likelihood that the user will buy a blender. This combination of features is referred to as a feature cross.

Chart of cross features in deep cross networks

In web-scale applications, data are mostly categorical, leading to large and sparse feature space. Identifying effective feature crosses in this setting often requires manual feature engineering or exhaustive search. Traditional feed-forward multilayer perceptron (MLP) models are universal function approximators; however, they cannot efficiently approximate even 2nd or 3rd-order feature crosses as pointed out in the Deep & Cross Network and Latent Cross papers.

What is a Deep & Cross Network (DCN)?

DCN was designed to learn explicit and bounded-degree cross features more effectively. They start with an input layer (typically an embedding layer), followed by a cross network which models explicit feature interactions, and finally a deep network that models implicit feature interactions.

Cross Network

This is the core of a DCN. It explicitly applies feature crossing at each layer, and the highest polynomial degree (feature cross order) increases with layer depth. The following figure shows the (𝑖+1)-th cross layer.

Cross layer visualization. x0 is the base layer (typically set as the embedding layer), xi is the input to the cross layer, ☉ represents element-wise multiplications, and matrix W and vector b are the parameters to be learned.
Cross layer visualization. x0 is the base layer (typically set as the embedding layer), xi is the input to the cross layer, ☉ represents element-wise multiplications, and matrix W and vector b are the parameters to be learned.

When we only have a single cross layer, it creates 2nd-order (pairwise) feature crosses among input features. In the blender example above, the input to the cross layer would be a vector that concatenates three features: [country, purchased_bananas, purchased_cookbooks]. Then, the first dimension of the output would contain a weighted sum of pairwise interactions between country and all the three input features; the second dimension would contain weighted interactions of purchased_bananas and all the other features, and so on.

The weights of these interaction terms form the matrix W: if an interaction is unimportant, its weight will be close to zero. If it is important, it will be away from zero.

To create higher-order feature crosses, we could stack more cross layers. For example, we now know that a single cross layer outputs 2nd-order feature crosses such as interaction between purchased_bananas and purchased_cookbook. We could further feed these 2nd-order crosses to another cross layer. Then, the feature crossing part would multiply those 2nd-order crosses with the original (1st-order) features to create 3rd-order feature crosses, e.g., interactions among countries, purchased_bananas and purchased_cookbooks. The residual connection would carry over those feature crosses that have already been created in the previous layer.

If we stack k cross layers together, the k-layered cross network would create all the feature crosses up to order k+1, with their importance characterized by parameters in the weight matrices and bias vectors.

Deep Network

The deep part of a Deep & Cross Network is a traditional feedforward multilayer perceptron (MLP).

The deep network and cross network are then combined to form DCN. Commonly, we could stack a deep network on top of the cross network (stacked structure); we could also place them in parallel (parallel structure).

Deep & Cross Network (DCN) visualization. Left: parallel structure; Right: stacked structure.
Deep & Cross Network (DCN) visualization. Left: parallel structure; Right: stacked structure.

Model Understanding

A good understanding of the learned feature crosses helps improve model understandability. Fortunately, the weight matrix 𝑊 in the cross layer reveals what feature crosses the model has learned to be important.

Take the example of selling a blender to a customer. If purchasing both bananas and cookbooks is the most predictive signal in the data, a DCN model should be able to capture this relationship. The following figure shows the learned matrix of a DCN model with one cross layer, trained on synthetic data where the joint purchase feature is most important. We see that the model itself has learned that the interaction between `purchased_bananas` and `purchased_cookbooks` is important, without any manual feature engineering applied.

Learned weight matrix in the cross layer.
Learned weight matrix in the cross layer.

Cross layers are now implemented in TensorFlow Recommenders, and you can easily adopt them as building blocks in your models. To learn how, check out our tutorial for example usage and practical lessons. If you are interested in more detail, have a look at our research papers DCN and DCN v2.

Acknowledgements

We would like to give a special thanks to Derek Zhiyuan Cheng, Sagar Jain, Shirley Zhe Chen, Dong Lin, Lichan Hong, Ed H. Chi, Bin Fu, Gang (Thomas) Fu and Mingliang Wang for their critical contributions to Deep & Cross Network (DCN). We also would like to thank everyone who has helped with and supported the DCN effort from research idea to productionization: Shawn Andrews, Sugato Basu, Jakob Bauer, Nick Bridle, Gianni Campion, Jilin Chen, Ting Chen, James Chen, Tianshuo Deng, Evan Ettinger, Eu-Jin Goh, Vidur Goyal, Julian Grady, Gary Holt, Samuel Ieong, Asif Islam, Tom Jablin, Jarrod Kahn, Duo Li, Yang Li, Albert Liang, Wenjing Ma, Aniruddh Nath, Todd Phillips, Ardian Poernomo, Kevin Regan, Olcay Sertel, Anusha Sriraman, Myles Sussman, Zhenyu Tan, Jiaxi Tang, Yayang Tian, Jason Trader, Tatiana Veremeenko‎, Jingjing Wang, Li Wei, Cliff Young, Shuying Zhang, Jie (Jerry) Zhang, Jinyin Zhang, Zhe Zhao and many more (in alphabetical order). We’d also like to thank David Simcha, Erik Lindgren, Felix Chern, Nathan Cordeiro, Ruiqi Guo, Sanjiv Kumar, Sebastian Claici, and Zonglin Li for their contributions to ScaNN.

Read More

My experience with TensorFlow Quantum

My experience with TensorFlow Quantum

A guest post by Owen Lockwood, Rensselaer Polytechnic Institute

Quantum mechanics was once a very controversial theory. Early detractors such as Albert Einstein famously said of quantum mechanics that “God does not play dice” (referring to the probabilistic nature of quantum measurements), to which Niels Bohr replied, “Einstein, stop telling God what to do”. However, all agreed that, to quote John Wheeler “If you are not completely confused by quantum mechanics, you do not understand it”. As our understanding of quantum mechanics has grown, not only has it led to numerous important physical discoveries but it also resulted in the field of quantum computing. Quantum computing is a different paradigm of computing from classical computing. It relies on and exploits the principles of quantum mechanics to achieve speedups (in some cases superpolynomial) over classical computers.

In this article I will be discussing some of the challenges I faced as a beginning researcher in quantum machine learning (QML) and how TensowFlow Quantum (TFQ) and Cirq enabled me and can help other researchers investigate the field of quantum computing and QML. I have previous experience with TensorFlow, which made the transition to using TensorFlow Quantum seamless. TFQ proved instrumental in enabling my work and ultimately my work utilizing TFQ culminated in my first publication on quantum reinforcement learning in the 16th AIIDE conference. I hope that this article helps and inspires other researchers, neophytes and experts alike, to leverage TFQ to help advance the field of QML.

QML Background

QML has important similarities and differences to traditional neural network/deep learning approaches to machine learning. Both methodologies can be seen as using “stacked layers” of transformations that make up a larger model. In both cases, data is used to inform updates to model parameters, typically to minimize some loss function (usually, but not exclusively via gradient based methods). Where they differ is QML models have access to the power of quantum mechanics and deep neural networks do not. An important type of QML that TFQ provides techniques for is called variational quantum circuits (QVC). QVCs are also called quantum neural networks (QNN).

A QVC can be visualized below (from the TFQ white paper). The diagram is read from left to right, with the qubits being represented by the horizontal lines. In a QVC there are three important and distinct parts: the encoder circuit, the variational circuit and the measurement operators. The encoder circuit either takes naturally quantum data (i.e. a nonparametrized quantum circuit) or converts classical data into quantum data. This circuit is connected to the variational circuit which is defined by its learnable parameters. The parametrized part of the circuit is the part that is updated during the learning process. The last part of the QVC is the measurement operators. In order to extract information from the QVC some sort of quantum measurement (such as a Pauli X, Y, or Z basis measurements) must be applied. With the information extracted from these measurements a loss function (and gradients) can be calculated on a classical computer and the parameters can be updated. These gradients can be optimized with the same optimizers as traditional neural networks such as Adam or RMSProp. QVC’s can also be combined with traditional neural networks (as is shown in the diagram) as the quantum circuit is differentiable and thus gradients can be backpropagated through.

TensorFlow image

However, the intuitive models and mathematical framework surrounding QVCs have some important differences from traditional neural networks, and in these differences lies the potential for quantum speedups. Quantum computing and QML can harness quantum phenomena, such as superposition and entanglement. Superposition stems from the wavefunction being a linear combination of multiple states and enables a qubit to represent two different states simultaneously (in a probabilistic manner). The ability to operate on these superpositions, i.e. operate on multiple states simultaneously, is integral to the power of quantum computing. Entanglement is a complex phenomenon that is induced via multi-qubit gates. Getting a basic understanding of these concepts and of quantum computing is an important first step for QML. There are a number of great resources available for this such as Preskill’s Quantum Computing Course, and de Wolf’s lecture notes.

Currently, access to real quantum hardware is limited and as such, many quantum computing researchers conduct work on simulations of quantum computers. Near term and current quantum devices have 10s-100s of quantum bits (qubits) like the Google sycamore processor. Because of their size and the noise, this hardware is often called Noisy Intermediate Scale Quantum (NISQ) technology. TFQ and Cirq are built for these near term NISQ devices. These devices are far smaller than what some of the most famous quantum algorithms require to achieve quantum speedups given current error correction techniques; e.g. Shor’s algorithm requires upwards of thousands of qubits and the Quantum Approximate Optimization Algorithm (QAOA) could require at least 420 qubits for quantum advantages. However, there is still significant potential for NISQ devices to achieve quantum speedups (as Google demonstrated with 53 qubits).

My Work With TFQ

TFQ was announced in mid March this year (2020) and I began to use it shortly after. Around that time I had begun research into QML, specifically QML for reinforcement learning (RL). While there have been great strides in the accessibility of quantum circuit simulators, QML can be a difficult field to get into. Not only are there difficulties from a mathematical and physical perspective, but the time investment for implementation can be substantial. The time it would take to code QVCs from scratch and properly test and debug them (not to mention the optimization) is a challenge, especially for those coming from classical machine learning. Spending so much time building something for an experiment that has the potential to not work is a big risk – especially for an undergraduate student on a deadline! Thankfully I did not have to take this risk. With the release of TFQ it was easy to immediately start on the implementations of my ideas. Realistically, I would never have done this work if TFQ had not been released. Inspired by previous work, we expanded upon applying QML to RL tasks.

In our work we demonstrate the potential to use QVCs in place of neural networks in contemporary RL algorithms (specifically DQN and DDQN). We also show the potential to use multiple types of QVC models, using QVCs with either a dense layer or quantum pooling layers (denoted hybrid and pure respectively) to shrink the number of qubits to the correct output space.

TensorFlow Graph

The representational power of QVCs is also put on display; using a QVC with ~50 parameters we were able to achieve comparable performance to neural networks with orders of magnitude more parameters. See the graphs for a comparison of the reward achieved on the canonical CartPole environment (balancing a pole on a cart), the left graph includes all neural networks and the right shows only the largest neural network. The number in front of the NN represents the size of the parameter space.

We are continuing to work with QML applications to RL and have more manuscripts in submission. Continuation of this work has been accepted into the 2020 NeurIPS workshop: “The pre-registration experiment: an alternative publication model for machine learning research”.

Suggested use of TFQ

TFQ can be an incredible tool for anyone interested in QML research no matter your background. All too common in scientific communities is a ‘publish or perish’ mentality which can stifle innovative work and is prohibitive to intellectual risk taking, especially for experiments that require significant implementation efforts. Not only can TFQ help speed up any experiments you may have, but it also allows for easy implementation of ideas that would otherwise never get tested. Implementation is a common hindrance to new and interesting ideas, and far too many projects never progress out of the idea stage due to difficulties in transitioning the idea to reality, something TFQ makes easy.

For beginners, TFQ enables a first foray into the field without substantial time investment in coding and allows for significant learning. Being able to try and experiment with QVCs without having to build from the ground up is an incredible tool. For classical ML researchers with experience in TensorFlow, TFQ makes it easy to transition and experiment with QML at small or large scales. The API of TFQ and the modules it provides (i.e. Keras-esque layers and differentiators) share design principles with TF and their similarities make for an easier programming transition. For researchers already in the QML field, TFQ can certainly help.

In order to get started with TFQ it is important to become familiar with the basics of quantum computing, with either the references mentioned above or with any of the many other great resources out there. Another important step that often gets overlooked, is reading the TFQ white paper. The white paper is accessible to QML beginners and is an invaluable introduction to QML and the basic as well as advanced usage of TFQ. Just as important is to play around with TFQ. Try different things out, experiment; it is a great way to expand not only understanding of the software but of the theory and mathematics as well. Reading other contemporary papers and the papers that cite TFQ is a great way to become immersed with the current research going on in the field.

Read More

Videos from the TensorFlow User Group Summit in India

Videos from the TensorFlow User Group Summit in India

Posted by Siddhant Agarwal and Biswajeet Mallik, Program Managers

Logo of TFUG India Summit

TensorFlow has a strong developer community in India with 13 TensorFlow User Groups and 20+ Google Developer Experts. In September, these groups came together to organise the “TFUG India Summit“, a 4-day online event with four tracks. You can check out the recordings for these talks below.

Read More