Posted by the TensorFlow Team
TensorFlow 2.10 has been released! Highlights of this release include user-friendly features in Keras to help you develop transformers, deterministic and stateless initializers, updates to the optimizers API, and new tools to help you load audio data. We’ve also made performance enhancements with oneDNN, expanded GPU support on Windows, and more. This release also marks TensorFlow Decision Forests 1.0! Read on to learn more.
Keras
Expanded, unified mask support for Keras attention layers
Starting from TensorFlow 2.10, mask handling for Keras attention layers, such as tf.keras.layers.Attention, tf.keras.layers.AdditiveAttention, and tf.keras.layers.MultiHeadAttention have been expanded and unified. In particular, we’ve added two features:
Causal attention: All three layers now support a use_causal_mask argument to call (Attention and AdditiveAttention used to take a causal argument to __init__).
Implicit masking: Keras Attention, AdditiveAttention, and MultiHeadAttention layers now support implicit masking (set mask_zero=True in tf.keras.layers.Embedding).
Combined, this simplifies the implementation of any Transformer-style model since getting the masking right is often a tricky part.
A basic Transformer self-attention block can now be written as:
import tensorflow as tf
embedding = tf.keras.layers.Embedding(
input_dim=10,
output_dim=3,
mask_zero=True) # Infer a correct padding mask.
# Instantiate a Keras multi-head attention (MHA) layer,
# a layer normalization layer, and an `Add` layer object.
mha = tf.keras.layers.MultiHeadAttention(key_dim=4, num_heads=1)
layernorm = tf.keras.layers.LayerNormalization()
add = tf.keras.layers.Add()
# Test input.
x = tf.constant([[1, 2, 3, 4, 5, 0, 0, 0, 0],
[1, 2, 1, 0, 0, 0, 0, 0, 0]])
# The embedding layer sets the mask.
x = embedding(x)
# The MHA layer uses and propagates the mask.
a = mha(query=x, key=x, value=x, use_causal_mask=True)
x = add([x, a]) # The `Add` layer propagates the mask.
x = layernorm(x)
# The mask made it through all layers.
print(x._keras_mask)
And here’s the output:
> tf.Tensor( > [[ True True True True True False False False False] > [ True True True False False False False False False]], shape=(2, > 9), dtype=bool) |
Try out the new Keras Optimizers API
In the previous release, Tensorflow 2.9, we published a new version of the Keras Optimizer API, in tf.keras.optimizers.experimental, which will replace the current tf.keras.optimizers namespace in TensorFlow 2.11. To prepare for the upcoming formal switch of the optimizer namespace to the new API, we’ve also exported all of the current Keras optimizers under tf.keras.optimizers.legacy in TensorFlow 2.10.
Most users won’t be affected by this change, but please check the API doc to see if any API used in your workflow has changed. If you decide to keep using the old optimizer, please explicitly change your optimizer to corresponding tf.keras.optimizers.legacy.Optimizer.
You can also find more details about new Keras Optimizers in this article.
Deterministic and Stateless Keras initializers
In TensorFlow 2.10, we’ve made Keras initializers (the tf.keras.initializers API) stateless and deterministic, built on top of stateless TF random ops. Starting in TensorFlow 2.10, both seeded and unseeded Keras initializers will always generate the same values every time they are called (for a given variable shape). The stateless initializer enables Keras to support new features such as multi-client model training with DTensor.
init = tf.keras.initializers.RandomNormal() a = init((3, 2)) b = init((3, 2)) # a == b init_2 = tf.keras.initializers.RandomNormal(seed=1) c = init_2((3, 2)) d = init_2((3, 2)) # c == d # a != c init_3 = tf.keras.initializers.RandomNormal(seed=1) e = init_3((3, 2)) # e == c init_4 = tf.keras.initializers.RandomNormal() f = init_4((3, 2)) # f != a |
For unseeded initializers (seed=None), a random seed will be created and assigned at initializer creation (different initializer instances get different seeds). An unseeded initializer will raise a warning if it is reused (called) multiple times. This is because it would produce the same values each time, which may not be intended.
BackupAndRestore checkpoints with step level granularity
In the previous release, Tensorflow 2.9, the tf.keras.callbacks.BackupAndRestore Keras callback would backup the model and training state at epoch boundaries. In Tensorflow 2.10, the callback can also backup the model every N training steps. However, keep in mind that when BackupAndRestore is used with tf.distribute.MultiWorkerMirroredStrategy, the distributed dataset iterator state will be reinitialized and won’t be restored when restoring the model. More information and code examples can be found in the migrate the fault tolerance mechanism guide.
Easily generate an audio classification dataset from a directory of audio files
You can now use a new utility, tf.keras.utils.audio_dataset_from_directory, to easily generate audio classification datasets from directories of .wav files. Just sort your audio files into one different directory per file class, and a single line of code will get you a labeled tf.data.Dataset you can pass to a Keras model. You can find an example here.
The EinsumDense layer is no longer experimental
The einsum function is the swiss army knife of linear algebra. It can efficiently and explicitly describe a wide variety of operations. The tf.keras.layers.EinsumDense layer brings some of that power to Keras.
Operations like einsum, einops.rearrange, and the EinsumDense layer operate based on a string “equation” that describes the axes of the inputs and outputs. For EinsumDense the equation lists the axes of the input argument, the axes of the weights, and the axes of the output. A basic Dense layer can be written as:
dense = keras.layers.Dense(units=10, activation=’relu’) dense = keras.layers.EinsumDense(‘…i, ij -> …j’, output_shape=(10,), activation=’relu’) |
Notes:
- …i – This only works on the last axis of the input, that axis is called i.
- ij – The weights are a matrix with shape (ij).
- …j – The result sums out the i axis and leaves j.
For example, here is a stack of 5 Dense layers with 10 units each:
dense = keras.layers.EinsumDense(‘…i, nij -> …nj’, output_shape=(5,10)) |
Here is a stack of Dense layers, where each one operates on a different input vector:
dense = keras.layers.EinsumDense(‘…ni, nij -> …nj’, output_shape=(5,10)) |
Here is a stack of Dense layers where each one operates on each input vector independently:
dense = keras.layers.EinsumDense(‘…ni, mij -> …nmj’, output_shape=(None, 5,10)) |
Performance and collaborations
Improved aarch64 CPU performance: ACL/oneDNN integration
We have worked with Arm, AWS, and Linaro to integrate Compute Library for the Arm® Architecture (ACL) with TensorFlow through oneDNN to accelerate performance on aarch64 CPUs. Starting with TensorFlow 2.10, you can try these experimental optimizations by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1 before running your TensorFlow program.
There may be slightly different numerical results due to different computation and floating-point round-off approaches. If this causes issues for you, turn the optimizations off by setting TF_ENABLE_ONEDNN_OPTS=0 before running your program.
To verify that the optimizations are on, look for a message beginning with “oneDNN custom operations are on” in your program log. We welcome feedback on GitHub and the TensorFlow Forum.
Expanded GPU support on Windows
TensorFlow can now leverage a wider range of GPUs on Windows through the TensorFlow-DirectML plug-in. To enable model training on DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm, install the plug-in alongside standard TensorFlow CPU packages on native Windows or WSL2. The preview package currently supports a limited number of basic machine learning models, with a goal to increase model coverage in the future. You can view the open-source code and leave feedback at the TensorFlow-DirectML GitHub repository.
New features in tf.data
Create tf.data Dataset from lists of elements
Tensorflow 2.10 introduces a convenient new experimental API tf.data.experimental.from_list which creates a tf.data.Dataset comprising the given list of elements. The returned dataset will produce the items in the list one by one. The functionality is identical to tf.data.Dataset.from_tensor_slices when elements are scalars, but different when elements have structure.
Consider the following example:
dataset = tf.data.experimental.from_list([(1, ‘a’), (2, ‘b’), (3, ‘c’)]) list(dataset.as_numpy_iterator()) [(1, ‘a’), (2, ‘b’), (3, ‘c’)] |
In contrast, to get the same output with `from_tensor_slices`, the data needs to be reorganized:
dataset = tf.data.Dataset.from_tensor_slices(([1, 2, 3], [‘a’, ‘b’, ‘c’])) list(dataset.as_numpy_iterator()) [(1, ‘a’), (2, ‘b’), (3, ‘c’)] |
Unlike the from_tensor_slices method, from_list supports non-rectangular input (achieving the same with from_tensor_slices requires the use of ragged tensors).
Sharing tf.data service with concurrent trainers
If you run multiple trainers concurrently using the same training data, it could save resources to cache the data in one tf.data service cluster and share the cluster with the trainers. For example, if you use Vizier to tune hyperparameters, the Vizier jobs can run concurrently and share one tf.data service cluster.
To enable this feature, each trainer needs to generate a unique trainer ID, and you pass the trainer ID to tf.data.experimental.service.distribute. Once a job has consumed the data, the data remains in the cache and is re-used by jobs with different trainer_ids. Requests with the same trainer_id do not re-use data. For example:
dataset = expensive_computation() dataset = dataset.apply(tf.data.experimental.service.distribute( processing_mode=tf.data.experimental.service.ShardingPolicy.OFF, service=FLAGS.tf_data_service_address, job_name=”job”, cross_trainer_cache=data_service_ops.CrossTrainerCache( trainer_id=trainer_id()))) |
tf.data service uses a sliding-window cache to store shared data. When one trainer consumes data, the data remains in the cache. When other trainers need data, they can get data from the cache instead of repeating the expensive computation. The cache has a bounded size, so some workers may not read the full dataset. To ensure all the trainers get sufficient training data, we require the input dataset to be infinite. This can be achieved, for example, by repeating the dataset and performing random augmentation on the training instances.
TensorFlow Decision Forests 1.0
In conjunction with the release of Tensorflow 2.10, Tensorflow Decision Forests (TF-DF) reaches version 1.0. With this milestone we want to communicate more broadly that Tensorflow Decision Forests has become a more stable and mature library. We’ve improved our documentation and established more comprehensive testing to make sure that TF-DF is ready for professional environments.
The new release of TF-DF also offers a first look at the Javascript and Go APIs for inference of TF-DF models. While these APIs are still in beta, we are actively looking for feedback for them. TF-DF 1.0 improves performance of oblique splits. Oblique splits allow decision trees to express more complex patterns by conditioning on multiple features at the same time – learn more in our Decision Forests class on developers.google.com. Benchmarks and real-world observations show that oblique splits outperform classical axis-aligned splits on the majority of datasets. Finally, the new release includes our latest bug fixes.
Next steps
Check out the release notes for more information. To stay up to date, you can read the TensorFlow blog, follow twitter.com/tensorflow, or subscribe to youtube.com/tensorflow. If you’ve built something you’d like to share, please submit it for our Community Spotlight at goo.gle/TFCS. For feedback, please file an issue on GitHub or post to the TensorFlow Forum. Thank you!