Making Better Future Predictions by Watching Unlabeled Videos

Posted by Dave Epstein, Student Researcher and Chen Sun, Staff Research Scientist, Google Research

Machine learning (ML) agents are increasingly deployed in the real world to make decisions and assist people in their daily lives. Making reasonable predictions about the future at varying timescales is one of the most important capabilities for such agents because it enables them to predict changes in the world around them, including other agents’ behaviors, and plan how to act next. Importantly, successful future prediction requires both capturing meaningful transitions in the environment (e.g., dough transforming into bread) and adapting to how transitions unfold over time in order to make decisions.

Previous work in future prediction from visual observations has largely been constrained by the format of its output (e.g., pixels that represent an image) or a manually-defined set of human activities (e.g., predicting if someone will keep walking, sit down, or jump). These are either too detailed and hard to predict or lack important information about the richness of the real world. For example, predicting “person jumping” does not capture why they’re jumping, what they’re jumping onto, etc. Also, with very few exceptions, previous models were designed to make predictions at a fixed offset into the future, which is a limiting assumption because we rarely know when meaningful future states will happen.

For example, in a video about making ice cream (depicted below), the meaningful transition from “cream” to “ice cream” occurs over 35 seconds, so models predicting such transitions would need to look 35 seconds ahead. But this time interval varies a large amount across different activities and videos — meaningful transitions occur at any distance into the future. Learning to make such predictions at flexible intervals is hard because the desired ground truth may be relatively ambiguous. For example, the correct prediction could be the just-churned ice cream in the machine, or scoops of the ice cream in a bowl. In addition, collecting such annotations at scale (i.e., frame-by-frame for millions of videos) is infeasible. However, many existing instructional videos come with speech transcripts, which often offer concise, general descriptions throughout entire videos. This source of data can guide a model’s attention toward important parts of the video, obviating the need for manual labeling and allowing a flexible, data-driven definition of the future.

In “Learning Temporal Dynamics from Cycles in Narrated Video”, published at ICCV 2021, we propose an approach that is self-supervised, using a recent large unlabeled dataset of diverse human action. The resulting model operates at a high level of abstraction, can make predictions arbitrarily far into the future, and chooses how far into the future to predict based on context. Called Multi-Modal Cycle Consistency (MMCC), it leverages narrated instructional video to learn a strong predictive model of the future. We demonstrate how MMCC can be applied, without fine-tuning, to a variety of challenging tasks, and qualitatively examine its predictions. In the example below, MMCC predicts the future (d) from present frame (a), rather than less relevant potential futures (b) or (c).

This work uses cues from vision and language to predict high-level changes (such as cream becoming ice cream) in video (video from HowTo100M).

Viewing Videos as Graphs
The foundation of our method is to represent narrated videos as graphs. We view videos as a collection of nodes, where nodes are either video frames (sampled at 1 frame per second) or segments of narrated text (extracted with automatic speech recognition systems), encoded by neural networks. During training, MMCC constructs a graph from the nodes, using cross-modal edges to connect video frames and text segments that refer to the same state, and temporal edges to connect the present (e.g., strawberry-flavored cream) and the future (e.g., soft-serve ice cream). The temporal edges operate on both modalities equally — they can start from either a video frame, some text, or both, and can connect to a future (or past) state in either modality. MMCC achieves this by learning a latent representation shared by frames and text and then making predictions in this representation space.

Multi-modal Cycle Consistency
To learn the cross-modal and temporal edge functions without supervision, we apply the idea of cycle consistency. Here, cycle consistency refers to the construction of cycle graphs, in which the model constructs a series of edges from an initial node to other nodes and back again: Given a start node (e.g., a sample video frame), the model is expected to find its cross-modal counterpart (i.e., text describing the frame) and combine them as the present state. To do this, at the start of training, the model assumes that frames and text with the same timestamps are counterparts, but then relaxes this assumption later. The model then predicts a future state, and the node most similar to this prediction is selected. Finally, the model attempts to invert the above steps by predicting the present state backward from the future node, and thus connecting the future node back with the start node.

The discrepancy between the model’s prediction of the present from the future and the actual present is the cycle-consistency loss. Intuitively, this training objective requires the predicted future to contain enough information about its past to be invertible, leading to predictions that correspond to meaningful changes to the same entities (e.g., tomato becoming marinara sauce, or flour and eggs in a bowl becoming dough). Moreover, the inclusion of cross-modal edges ensures future predictions are meaningful in either modality.

To learn the temporal and cross-modal edge functions end-to-end, we use the soft attention technique, which first outputs how likely each node is to be the target node of the edge, and then “picks” a node by taking the weighted average among all possible candidates. Importantly, this cyclic graph constraint makes few assumptions for the kind of temporal edges the model should learn, as long as they end up forming a consistent cycle. This enables the emergence of long-term temporal dynamics critical for future prediction without requiring manual labels of meaningful changes.

An example of the training objective: A cycle graph is expected to be constructed between the chicken with soy sauce and the chicken in chili oil because they are two adjacent steps in the chicken’s preparation (video from HowTo100M).

Discovering Cycles in Real-World Video
MMCC is trained without any explicit ground truth, using only long video sequences and randomly sampled starting conditions (a frame or text excerpt) and asking the model to find temporal cycles. After training, MMCC can identify meaningful cycles that capture complex changes in video.

Given frames as input (left), MMCC selects relevant text from video narrations and uses both modalities to predict a future frame (middle). It then finds text relevant to this future and uses it to predict the past (right). Using its knowledge of how objects and scenes change over time, MMCC “closes the cycle” and ends up where it started (videos from HowTo100M).
The model can also start from narrated text rather than frames and still find relevant transitions (videos from HowTo100M).

Zero-Shot Applications
For MMCC to identify meaningful transitions over time in an entire video, we define a “likely transition score” for each pair (A, B) of frames in a video, according to the model’s predictions — the closer B is to our model’s prediction of the future of A, the higher the score assigned. We then rank all pairs according to this score and show the highest-scoring pairs of present and future frames detected in previously unseen videos (examples below).

The highest-scoring pairs from eight random videos, which showcase the versatility of the model across a wide range of tasks (videos from HowTo100M).

We can use this same approach to temporally sort an unordered collection of video frames without any fine-tuning by finding an ordering that maximizes the overall confidence scores between all adjacent frames in the sorted sequence.

Left: Shuffled frames from three videos. Right: MMCC unshuffles the frames. The true order is shown under each frame. Even when MMCC does not predict the ground truth, its predictions often appear reasonable, and so, it can present an alternate ordering (videos from HowTo100M).

Evaluating Future Prediction
We evaluate the model’s ability to anticipate action, potentially minutes in advance, using the top-k recall metric, which here measures a model’s ability to retrieve the correct future (higher is better). On CrossTask, a dataset of instruction videos with labels describing key steps, MMCC outperforms the previous self-supervised state-of-the-art models in inferring possible future actions.

Recall
Model    Top-1       Top-5       Top-10   
Cross-modal    2.9 14.2 24.3
Repr. Ant. 3.0 13.3 26.0
MemDPC 2.9 15.8 27.4
TAP 4.5 17.1 27.9
MMCC 5.4 19.9 33.8

Conclusions
We have introduced a self-supervised method to learn temporal dynamics by cycling through narrated instructional videos. Despite the simplicity of the model’s architecture, it can discover meaningful long-term transitions in vision and language, and can be applied without further training to challenging downstream tasks, such as anticipating far-away action and ordering collections of images. An interesting future direction is transferring the model to agents so they can use it to conduct long-term planning.

Acknowledgements
The core team includes Dave Epstein, Jiajun Wu, Cordelia Schmid, and Chen Sun. We thank Alexei Efros, Mia Chiquier, and Shiry Ginosar for their feedback, and Allan Jabri for inspiration in figure design. Dave would like to thank Dídac Surís and Carl Vondrick for insightful early discussions on cycling through time in video.

Read More

Women in Machine Learning Symposium – Event Recap

Posted by Joana Carrasqueira, Program Manager, TensorFlow.

Thank you to everyone who joined us at the first Women in Machine Learning Symposium!

Hundreds of practitioners joined from all over the world to share tips and insights for careers in ML, how to be involved in the community, contribute to open source, and much more. It was very inspiring to learn from each other’s experiences. Following is a quick recap, and an overview of the resources we discussed at the event. Thanks again.

Online education

Get involved in the community

Build your portfolio

Connect with (or become) a GDE

Read More

NVIDIA Omniverse Enterprise Delivers the Future of 3D Design and Real-Time Collaboration

For millions of professionals around the world, 3D workflows are essential.

Everything they build, from cars to products to buildings, must first be designed or simulated in a virtual world. At the same time, more organizations are tackling complex designs while adjusting to a hybrid work environment.

As a result, design teams need a solution that helps them improve remote collaboration while managing 3D production pipelines. And NVIDIA Omniverse is the answer.

NVIDIA Omniverse Enterprise, now available, helps professionals across industries transform complex 3D design workflows. The groundbreaking platform lets global teams working across multiple software suites collaborate in real time in a shared virtual space.

Designed for the Present, Built for the Future

With Omniverse Enterprise, professionals gain new capabilities to boost traditional visualization workflows. It’s a newly launched subscription that brings fully supported software to 3D organizations of any scale.

The foundation of Omniverse is Pixar’s Universal Scene Description, an open-source file format that enables users to enhance their design process with real-time interoperability across applications. Additionally, the platform is built on NVIDIA RTX technology, so creators can render faster, do multiple iterations at no opportunity cost, and quickly achieve their final designs with stunning, photorealistic detail.

Ericsson, a leading telecommunications company, is using Omniverse Enterprise to create a digital twin of a 5G radio network to simulate and visualize signal propagation and performance. Within Omniverse, Ericsson has built a true-to-reality city-scale simulation environment, bringing in scenes, models and datasets from Esri CityEngine.

A New Experience for 3D Design

Omniverse Enterprise is available worldwide through global computer makers BOXX Technologies, Dell Technologies, HP, Lenovo and Supermicro. Many companies have already experienced the advanced capabilities of the platform.

Epigraph, a leading provider for companies such as Black & Decker, Yamaha and Wayfair, creates physically accurate 3D assets and product experiences for e-commerce. BOXX Technologies helped Epigraph achieve faster rendering with Omniverse Enterprise and NVIDIA RTX A6000 graphics. The advanced RTX Renderer in Omniverse enabled Epigraph to render images at final-frame quality faster, while significantly reducing the amount of computational resources needed.

Media.Monks is exploring ways to enhance and extend their workflows in a virtual world with Omniverse Enterprise, together with HP. The combination of remote computing and collocated workstations enables the Media.Monks design, creative and solutions teams to accelerate their clients’ digital transformation toward a more decentralized future. In collaboration with NVIDIA and HP, Media.Monks is exploring new approaches and the convergence of collaboration, real-time graphics, and live broadcast for a new era of brand virtualization.

Dell Technologies is presenting at GTC to show how Omniverse is advancing the hybrid workforce with Dell Precision workstations, Dell EMC PowerEdge servers and Dell Technologies Validated Designs. The interactive panel discussion will dive into why users need Omniverse today, and how Dell is helping more professionals adopt this solution, from the desktop to the data center.

And Lenovo is showcasing how advanced technologies like Omniverse are making remote collaboration seamless. Whether it’s connecting to a powerful mobile workstation on the go, a physical workstation back in the office, or a virtual workstation in the data center, Lenovo, TGX and NVIDIA are providing remote workers with the same experience they get at the office.

These systems manufacturers have also enabled other Omniverse Enterprise customers such as Kohn Pedersen Fox, Woods Bagot and WPP to improve their efficiency and productivity with real-time collaboration.

Experience Virtual Worlds With NVIDIA Omniverse

NVIDIA Omniverse Enterprise is now generally available by subscription from BOXX Technologies, Dell Technologies, HP, Lenovo and Supermicro.

The platform is optimized and certified to run on NVIDIA RTX professional mobile workstations and NVIDIA-Certified Systems, including desktops and servers on the NVIDIA EGX platform.

With Omniverse Enterprise, creative and design teams can connect their Autodesk 3ds Max, Maya and Revit, Epic Games’ Unreal Engine, McNeel & Associates Rhino, Grasshopper and Trimble SketchUp workflows through live-edit collaboration. Learn more about NVIDIA Omniverse Enterprise and our 30-day evaluation program. For individual artists, there’s also a free beta version of the platform available for download.

Watch NVIDIA founder and CEO Jensen Huang’s GTC keynote address below:

The post NVIDIA Omniverse Enterprise Delivers the Future of 3D Design and Real-Time Collaboration appeared first on The Official NVIDIA Blog.

Read More

Catch Some Rays This GFN Thursday With ‘Jurassic World Evolution 2’ and ‘Bright Memory: Infinite’ Game Launches

This week’s GFN Thursday packs a prehistoric punch with the release of Jurassic World Evolution 2. It also gets infinitely brighter with the release of Bright Memory: Infinite.

Both games feature NVIDIA RTX technologies and are part of the six titles joining the GeForce NOW library this week.

GeForce NOW RTX 3080 members will get the peak cloud gaming experience in these titles and more. In addition to RTX ON, they’ll stream both games at up to 1440p and 120 frames per second on PC and Mac; and up to 4K on SHIELD.

Preorders for six-month GeForce NOW RTX 3080 memberships are currently available in North America and Europe for $99.99. Sign up today to be among the first to experience next-generation gaming.

The Latest Tech, Streaming From the Cloud

GeForce RTX GPUs give PC gamers the best visual quality and highest frame rates. They also power NVIDIA RTX technologies. And with GeForce RTX 3080-class GPUs making their way to the cloud in the GeForce NOW SuperPOD, the most advanced platform for ray tracing and AI is now available across nearly any low-powered device.

GeForce NOW SuperPOD
The next generation of cloud gaming is powered by the GeForce NOW SuperPOD, built on the second-gen RTX, NVIDIA Ampere architecture.

Real-time ray tracing creates the most realistic and immersive graphics in supported games, rendering environments in cinematic quality. NVIDIA DLSS gives games a speed boost with uncompromised image quality, thanks to advanced AI.

With GeForce NOW’s Priority and RTX 3080 memberships, gamers can take advantage of these features in numerous top games, including new releases like Jurassic World Evolution 2 and Bright Memory: Infinite.

The added performance from the latest generation of NVIDIA GPUs also means GeForce NOW RTX 3080 members have exclusive access to stream at up to 1440p at 120 FPS on PC, 1600p at 120 FPS on most MacBooks, 1440p at 120 FPS on most iMacs, 4K HDR at 60 FPS on NVIDIA SHIELD TV and up to 120 FPS on select Android devices.

Welcome to …

Immerse yourself in a world evolved in a compelling, original story, experience the chaos of “what-if” scenarios from the iconic Jurassic World and Jurassic Park films and discover over 75 awe-inspiring dinosaurs, including brand-new flying and marine reptiles. Play with support for NVIDIA DLSS this week on GeForce NOW.

GeForce NOW gives your low-end rig the power to play Jurassic World Evolution 2 with even higher graphics settings thanks to NVIDIA DLSS, streaming from the cloud.

Blinded by the (Ray-Traced) Light

FYQD-studio, a one-man development team that released Bright Memory in 2020, is back with a full-length sequel, Bright Memory: Infinite, streaming from the cloud with RTX ON.

Bright Memory: Infinite combines the FPS and action genres with dazzling visuals, amazing set pieces and exciting action. Mix and match available skills and abilities to unleash magnificent combos on enemies. Cut through the opposing forces with your sword, or lock and load with ranged weaponry, customized with a variety of ammunition. The choice is yours.

Priority and GeForce NOW RTX 3080 members can experience every moment of the action the way FYQD-studio intended, gorgeously rendered with ray-traced reflections, ray-traced shadows, ray-traced caustics and dazzling RTX Global Illumination. And GeForce NOW RTX 3080 members can play at up to 1440p and 120 FPS on PC and Mac.

Never Run Out of Gaming

GFN Thursday always means more games.

Members can find these six new games streaming on the cloud this week:

  • Bright Memory: Infinite (new game launch on Steam)
  • Epic Chef (new game launch on Steam)
  • Jurassic World Evolution 2 (new game on launch on Steam and Epic Games Store)
  • MapleStory (Steam)
  • Severed Steel (Steam)
  • Tale of Immortal (Steam)

We make every effort to launch games on GeForce NOW as close to their release as possible, but, in some instances, games may not be available immediately.

What are you planning to play this weekend? Let us know on Twitter or in the comments below.

The post Catch Some Rays This GFN Thursday With ‘Jurassic World Evolution 2’ and ‘Bright Memory: Infinite’ Game Launches appeared first on The Official NVIDIA Blog.

Read More

How Researchers Use NVIDIA AI to Help Mitigate Misinformation

Researchers tackling the challenge of visual misinformation — think the TikTok video of Tom Cruise supposedly golfing in Italy during the pandemic — must continuously advance their tools to identify AI-generated images.

NVIDIA is furthering this effort by collaborating with researchers to support the development and testing of detector algorithms on our state-of-the-art image-generation models.

By crafting a dataset of highly realistic images with StyleGAN3 — our latest, state-of-the-art media generation algorithm — NVIDIA provided crucial information to researchers testing how well their detector algorithms work when tested on AI-generated images created by previously unseen techniques. These detectors help experts identify and analyze synthetic images to combat visual misinformation.

At this week’s NVIDIA GTC, this work was shared in a session titled “Alias-Free Generative Adversarial Networks,” which provided an overview of StyleGAN3. To watch on demand,  register free for GTC.

“This has been a unique situation in that people doing image generation detection have worked closely with the people at NVIDIA doing image generation,” said Edward Delp, a professor at Purdue University and principal investigator of one of the research teams. “This collaboration with NVIDIA has allowed us to build even better and more robust detectors. The ‘early access’ approach used by NVIDIA is an excellent way to further forensics research.”

Advancing Media Forensics With StyleGAN3 Images

When researchers know the underlying code or neural network of an image-generation technique, developing a detector that can identify images created by that AI model is a comparatively straightforward task.

It’s more challenging — and useful — to build a detector that can spot images generated by brand-new AI models.

StyleGAN3, a model developed by NVIDIA Research that will be presented at the NeurIPS 2021 AI conference in December, advances the state of the art in generative adversarial networks used to synthesize images. The breakthrough brings graphics principles in signal processing and image processing to GANs to avoid aliasing: a kind of image corruption often visible when images are rotated, scaled or translated.

NVIDIA researchers developed StyleGAN3 using a publicly released dataset of 70,000 images. Another 27,000 unreleased images from that collection, alongside AI-generated images from StyleGAN3, were shared with forensic research collaborators as a test dataset.

The collaboration with researchers enabled the community to assess how a diversity of different detector approaches performs in identifying images synthesized by StyleGAN3 — before the generator’s code was publicly released.

These detectors work in many different ways: Some may look for telltale correlations among groups of pixels produced by the neural network, while others might look for inconsistencies or asymmetries that give away synthetic images. Yet others attempt to reverse engineer the synthesis approach to estimate if a particular neural network could have created the image.

One of these detectors, GAN-Scanner, reaches up to 95 percent accuracy in identifying synthetic images generated with StyleGAN3, despite never having seen an image created by that model during training. Another detector, created by Politecnico di Milano, achieves an area under the curve of .999 (where a perfect classifier would achieve an AUC of 1.0).

Our work with researchers on StyleGAN3 showcases and supports the important, cutting-edge research done by media forensics groups. We hope it inspires others in the image-synthesis research community to participate in forensics research as well.

Source code for NVIDIA StyleGAN3 is available on GitHub, as well as results and links for the detector collaboration discussed here. The paper behind the research can be found on arXiv.

The GAN detector collaboration is part of Semantic Forensics (SemaFor), a program focused on forensic analysis of media organized by DARPA, the U.S. federal agency for technology research and development.

To learn more about the latest in AI research, watch NVIDIA CEO Jensen Huang’s keynote presentation at GTC below.

The post How Researchers Use NVIDIA AI to Help Mitigate Misinformation appeared first on The Official NVIDIA Blog.

Read More

What’s new in TensorFlow 2.7?

Posted by Goldie Gadde and Josh Gordon for the TensorFlow team

TensorFlow 2.7 is here! This release improves usability with clearer error messages, simplified stack traces, and adds new tools and documentation for users migrating to TF2.

Improved Debugging Experience

The process of debugging your code is a fundamental part of the user experience of a machine learning framework. In this release, we’ve considerably improved the TensorFlow debugging experience to make it more productive and more enjoyable, via three major changes: simplified stack traces, displaying additional context information in errors that originate from custom Keras layers, and a wide-ranging audit of all error messages in Keras and TensorFlow.

Simplified stack traces

TensorFlow is now filtering by default the stack traces displayed upon error to hide any frame that originates from TensorFlow-internal code, and keep the information focused on what matters to you: your own code. This makes stack traces simpler and shorter, and it makes it easier to understand and fix the problems in your code.

If you’re actually debugging the TensorFlow codebase itself (for instance, because you’re preparing a PR for TensorFlow), you can turn off the filtering mechanism by calling tf.debugging.disable_traceback_filtering().

Automatic context injection for Keras layer exceptions

One of the most common use cases for writing low-level code is creating custom Keras layers, so we wanted to make debugging your layers as easy and productive as possible. The first thing you do when you’re debugging a layer is to print the shapes and dtypes of its inputs, as well the value of its training and mask arguments. We now add this information automatically to all stack traces that originate from custom Keras layers.

See the effect of stack trace filtering and call context information display in practice in the image below:

Simplified stack traces in TensorFlow 2.7
Simplified stack traces in TensorFlow 2.7

Audit and improve all error messages in the TensorFlow and Keras codebases

Lastly, we’ve audited every error message in the Keras and TensorFlow codebases (thousands of error locations!) and improved them to make sure they follow UX best practices. A good error message should tell you what the framework expected, what you did that didn’t match the framework’s expectations, and should provide tips to fix the problem.

Improve tf.function error messages

We have improved two common types of tf.function error messages: runtime error messages and “Graph” tensor error messages, by including tracebacks pointing to the error source in the user code. For other vague and inaccurate tf.function error messages, we also updated them to be more clear and accurate.

For the runtime error message caused by the user code

@tf.function
def f():
l = tf.range(tf.random.uniform((), minval=1, maxval=10, dtype=tf.int32))
return l[20]

A summary of the old error message looks like

# … Python stack trace of the function call …

InvalidArgumentError: slice index 20 of dimension 0 out of bounds.
[[node strided_slice (defined at <'ipython-input-8-250c76a76c0e'>:5) ]] [Op:__inference_f_75]

Errors may have originated from an input operation.
Input Source operations connected to node strided_slice:
range (defined at <ipython-input-8-250c76a76c0e >':4)

Function call stack:
f

A summary of the new error message looks like

# … Python stack trace of the function call …

InvalidArgumentError: slice index 20 of dimension 0 out of bounds.
[[node strided_slice
(defined at <ipython-input-3-250c76a76c0e>:5)
]] [Op:__inference_f_15]

Errors may have originated from an input operation.
Input Source operations connected to node strided_slice:
In[0] range (defined at <ipython-input-3-250c76a76c0e>:4)
In[1] strided_slice/stack:
In[2] strided_slice/stack_1:
In[3] strided_slice/stack_2:

Operation defined at: (most recent call last)
# … Stack trace of the error within the function …
>>> File "<ipython-input-3-250c76a76c0e>", line 7, in <module>
>>> f()
>>>
>>> File "<ipython-input-3-250c76a76c0e>", line 5, in f
>>> return l[20]
>>>

The main difference is runtime errors raised while executing a tf.function now include a stack trace which shows the source of the error, in the user’s code.

# … Original error message and information …
# … More stack frames …
>>> File "<ipython-input-3-250c76a76c0e>", line 7, in <module>
>>> f()
>>>
>>> File "<ipython-input-3-250c76a76c0e>", line 5, in f
>>> return l[20]
>>>

For the “Graph” tensor error messages caused by the following user code

x = None

@tf.function
def leaky_function(a):
global x
x = a + 1 # Bad - leaks local tensor
return a + 2

@tf.function
def captures_leaked_tensor(b):
b += x
return b

leaky_function(tf.constant(1))
captures_leaked_tensor(tf.constant(2))

A summary of the old error message looks like

# … Python stack trace of the function call …

TypeError: An op outside of the function building code is being passed
a "Graph" tensor. It is possible to have Graph tensors
leak out of the function building context by including a
tf.init_scope in your function building code.
For example, the following function will fail:
@tf.function
def has_init_scope():
my_constant = tf.constant(1.)
with tf.init_scope():
added = my_constant * 2
The graph tensor has name: add:0

A summary of the new error message looks like

# … Python stack trace of the function call …

TypeError: Originated from a graph execution error.

The graph execution error is detected at a node built at (most recent call last):
# … Stack trace of the error within the function …
>>> File <ipython-input-5-95ca3a98778f>, line 6, in leaky_function
# … More stack trace of the error within the function …

Error detected in node 'add' defined at: File "<ipython-input-5-95ca3a98778f>", line 6, in leaky_function

TypeError: tf.Graph captured an external symbolic tensor. The symbolic tensor 'add:0' created by node 'add' is captured by the tf.Graph being executed as an input. But a tf.Graph is not allowed to take symbolic tensors from another graph as its inputs. Make sure all captured inputs of the executing tf.Graph are not symbolic tensors. Use return values, explicit Python locals or TensorFlow collections to access it. Please see https://www.tensorflow.org/guide/function#all_outputs_of_a_tffunction_must_be_return_values for more information.

The main difference is errors for attempting to capture a tensor that was leaked from an unreachable graph now include a stack trace which shows where the tensor was created in the user’s code:

# … Original error message and information …
# … More stack frames …
>>> File <ipython-input-5-95ca3a98778f>, line 6, in leaky_function

Error detected in node 'add' defined at: File "<ipython-input-5-95ca3a98778f>", line 6, in leaky_function

TypeError: tf.Graph captured an external symbolic tensor. The symbolic tensor 'add:0' created by node 'add' is captured by the tf.Graph being executed as an input. But a tf.Graph is not allowed to take symbolic tensors from another graph as its inputs. Make sure all captured inputs of the executing tf.Graph are not symbolic tensors. Use return values, explicit Python locals or TensorFlow collections to access it. Please see https://www.tensorflow.org/guide/function#all_outputs_of_a_tffunction_must_be_return_values for more information.

Introducing tf.experimental.ExtensionType

User-defined types can make your projects more readable, modular, maintainable. TensorFlow 2.7.0 introduces the ExtensionType API, which can be used to create user-defined object-oriented types that work seamlessly with TensorFlow’s APIs. Extension types are a great way to track and organize the tensors used by complex models. Extension types can also be used to define new tensor-like types, which specialize or extend the basic concept of “Tensor.” To create an extension type, simply define a Python class with tf.experimental.ExtensionType as its base, and use type annotations to specify the type for each field:

class TensorGraph(tf.experimental.ExtensionType):
"""A collection of labeled nodes connected by weighted edges."""
edge_weights: tf.Tensor # shape=[num_nodes, num_nodes]
node_labels: typing.Mapping[str, tf.Tensor] # shape=[num_nodes]; dtype=any

class MaskedTensor(tf.experimental.ExtensionType):
"""A tensor paired with a boolean mask, indicating which values are valid."""
values: tf.Tensor
mask: tf.Tensor # shape=values.shape; false for missing/invalid values.

class CSRSparseMatrix(tf.experimental.ExtensionType):
"""Compressed sparse row matrix (https://en.wikipedia.org/wiki/Sparse_matrix)."""
values: tf.Tensor # shape=[num_nonzero]; dtype=any
col_index: tf.Tensor # shape=[num_nonzero]; dtype=int64
row_index: tf.Tensor # shape=[num_rows+1]; dtype=int64

The ExtensionType base class adds a constructor and special methods based on the field type annotations (similar to typing.NamedTuple and @dataclasses.dataclass from the standard Python library). You can optionally customize the type by overriding these defaults, or adding new methods, properties, or subclasses.

Extension types are supported by the following TensorFlow APIs:

  • Keras: Extension types can be used as inputs and outputs for Keras Models and Layers.
  • Dataset: Extension types can be included in Datasets, and returned by dataset Iterators.
  • TensorFlow hub: Extension types can be used as inputs and outputs for tf.hub modules.
  • SavedModel: Extension types can be used as inputs and outputs for SavedModel functions.
  • tf.function: Extension types can be used as arguments and return values for functions wrapped with the @tf.function decorator.
  • control flow: Extension types can be used by control flow operations, such as tf.cond and tf.while_loop. This includes control flow operations added by autograph.
  • tf.py_function: Extension types can be used as arguments and return values for the func argument to tf.py_function.
  • Tensor ops: Extension types can be extended to support most TensorFlow ops that accept Tensor inputs (e.g., tf.matmul, tf.gather, and tf.reduce_sum), using dispatch decorators.
  • distribution strategy: Extension types can be used as per-replica values.

For more information about extension types, see the Extension Type guide.

Note: The tf.experimental prefix indicates that this is a new API, and we would like to collect feedback from real-world usage; barring any unforeseen design issues, we plan to migrate ExtensionType out of the experimental package in accordance with the TF experimental policy.

TF2 Migration made easier!

To support users interested in migrating their workloads from TF1 to TF2, we have created a new Migrate to TF2 tab on the TensorFlow website, which includes updated guides and completely new documentation with concrete, runnable examples in Colab.

A new shim tool has been added which dramatically eases migration of variable_scope-based models to TF2. It is expected to enable most TF1 users to run existing model architectures as-is (or with only minor adjustments) in TF2 pipelines without having to rewrite your modeling code. You can learn more about it in the model mapping guide.

New community contributed models on TensorFlow Hub

Since the last TensorFlow release, the community really came together to make many new models available on TensorFlow Hub. Now you can find models like MLP-Mixer, Vision Transformers, Wav2Vec2, RoBERTa, ConvMixer, DistillBERT, YoloV5 and many more. All of these models are ready to use via TensorFlow Hub. You can learn more about publishing your models here.

Next steps

Check out the release notes for more information. To stay up to date, you can read the TensorFlow blog, follow twitter.com/tensorflow, or subscribe to youtube.com/tensorflow. If you’ve built something you’d like to share, please submit it for our Community Spotlight at goo.gle/TFCS. For feedback, please file an issue on GitHub or post to the TensorFlow Forum. Thank you!

Read More

Model Ensembles Are Faster Than You Think

Posted by Xiaofang Wang, Intern and Yair Alon (prev. Movshovitz-Attias), Software Engineer, Google Research

When building a deep model for a new machine learning application, researchers often begin with existing network architectures, such as ResNets or EfficientNets. If the initial model’s accuracy isn’t high enough, a larger model may be a tempting alternative, but may not actually be the best solution for the task at hand. Instead, better performance potentially could be achieved by designing a new model that is optimized for the task. However, such efforts can be challenging and usually require considerable resources.

In “Wisdom of Committees: An Overlooked Approach to Faster and More Accurate Models”, we discuss model ensembles and a subset called model cascades, both of which are simple approaches that construct new models by collecting existing models and combining their outputs. We demonstrate that ensembles of even a small number of models that are easily constructed can match or exceed the accuracy of state-of-the-art models while being considerably more efficient.

What Are Model Ensembles and Cascades?
Ensembles and cascades are related approaches that leverage the advantages of multiple models to achieve a better solution. Ensembles execute multiple models in parallel and then combine their outputs to make the final prediction. Cascades are a subset of ensembles, but execute the collected models sequentially, and merge the solutions once the prediction has a high enough confidence. For simple inputs, cascades use less computation, but for more complex inputs, may end up calling on a greater number of models, resulting in higher computation costs.

Overview of ensembles and cascades. While this example shows 2-model combinations for both ensembles and cascades, any number of models can potentially be used.

Compared to a single model, ensembles can provide improved accuracy if there is variety in the collected models’ predictions. For example, the majority of images in ImageNet are easy for contemporary image recognition models to classify, but there are many images for which predictions vary between models and that will benefit most from an ensemble.

While ensembles are well-known, they are often not considered a core building block of deep model architectures and are rarely explored when researchers are developing more efficient models (with a few notable exceptions [1, 2, 3]). Therefore, we conduct a comprehensive analysis of ensemble efficiency and show that a simple ensemble or cascade of off-the-shelf pre-trained models can enhance both the efficiency and accuracy of state-of-the-art models.

To encourage the adoption of model ensembles, we demonstrate the following beneficial properties:

  1. Simple to build: Ensembles do not require complicated techniques (e.g., early exit policy learning).
  2. Easy to maintain: Ensembles are trained independently, making them easy to maintain and deploy.
  3. Affordable to train: The total training cost of models in an ensemble is often lower than a similarly accurate single model.
  4. On-device speedup: The reduction in computation cost (FLOPS) successfully translates to a speedup on real hardware.

Efficiency and Training Speed
It’s not surprising that ensembles can increase accuracy, but using multiple models in an ensemble may introduce extra computational cost at runtime. So, we investigate whether an ensemble can be more accurate than a single model that has the same computational cost. We analyze a series of models, EfficientNet-B0 to EfficientNet-B7, that have different levels of accuracy and FLOPS when applied to ImageNet inputs. The ensemble predictions are computed by averaging the predictions of each individual model.

We find that ensembles are significantly more cost-effective in the large computation regime (>5B FLOPS). For example, an ensemble of two EfficientNet-B5 models matches the accuracy of a single EfficientNet-B7 model, but does so using ~50% fewer FLOPS. This demonstrates that instead of using a large model, in this situation, one should use an ensemble of multiple considerably smaller models, which will reduce computation requirements while maintaining accuracy. Moreover, we find that the training cost of an ensemble can be much lower (e.g., two B5 models: 96 TPU days total; one B7 model: 160 TPU days). In practice, model ensemble training can be parallelized using multiple accelerators leading to further reductions. This pattern holds for the ResNet and MobileNet families as well.

Ensembles outperform single models in the large computation regime (>5B FLOPS).

Power and Simplicity of Cascades
While we have demonstrated the utility of model ensembles, applying an ensemble is often wasteful for easy inputs where a subset of the ensemble will give the correct answer. In these situations, cascades save computation by allowing for an early exit, potentially stopping and outputting an answer before all models are used. The challenge is to determine when to exit from the cascade.

To highlight the practical benefit of cascades, we intentionally choose a simple heuristic to measure the confidence of the prediction — we take the confidence of the model to be the maximum of the probabilities assigned to each class. For example, if the predicted probabilities for an image being either a cat, dog, or horse were 20%, 80% and 20%, respectively, then the confidence of the model’s prediction (dog) would be 0.8. We use a threshold on the confidence score to determine when to exit from the cascade.

To test this approach, we build model cascades for the EfficientNet, ResNet, and MobileNetV2 families to match either computation costs or accuracy (limiting the cascade to a maximum of four models). By design in cascades, some inputs incur more FLOPS than others, because more challenging inputs go through more models in the cascade than easier inputs. So we report the average FLOPS computed over all test images. We show that cascades outperform single models in all computation regimes (when FLOPS range from 0.15B to 37B) and can enhance accuracy or reduce the FLOPS (sometimes both) for all models tested.

Cascades of EfficientNet (left), ResNet (middle) and MobileNetV2 (right) models on ImageNet. When using similar FLOPS, cascades obtain a higher accuracy than single models (shown by the red arrows pointing up). Cascades can also match the accuracy of single models with significantly fewer FLOPS e.g., 5.4x for B7 (green arrows pointing left).
Summary of accuracy vs. FLOPS for ensembles and cascades. Squares and stars represent ensembles and cascades, respectively,, and the “+” notation indicates the models that comprise the ensemble or cascade. For example, ”B3+B4+B5+B7” at a star refers to a cascade of EfficientNet-B3, B4, B5 and B7 models.

In some cases it is not the average computation cost but the worst-case cost that is the limiting factor. By adding a simple constraint to the cascade building procedure, one can guarantee an upper bound to the computation cost of the cascade. See the paper for more details.

Other than convolutional neural networks, we also consider a Transformer-based architecture, ViT. We build a cascade of ViT-Base and ViT-Large models to match the average computation or accuracy of a single state-of-the-art ViT-Large model, and show that the benefit of cascades also generalizes to Transformer-based architectures.

        Single Models Cascades – Similar Throughput    Cascades – Similar Accuracy
Top-1 (%) Throughput Top-1 (%) Throughput △Top-1 Top-1 (%) Throughput SpeedUp
ViT-L-224 82.0 192 83.1 221 1.1 82.3 409 2.1x
ViT-L-384 85.0 54 86.0 69 1.0 85.2 125 2.3x
Cascades of ViT models on ImageNet. “224” and “384” indicate the image resolution on which the model is trained. Throughput is measured as the number of images processed per second. Our cascades can achieve a 1.0% higher accuracy than ViT-L-384 with a similar throughput or achieve a 2.3x speedup over that model while matching its accuracy.

Earlier works on cascades have also shown efficiency improvements for state-of-the-art models, but here we demonstrate that a simple approach with a handful of models is sufficient.

Inference Latency
In the above analysis, we average FLOPS to measure the computational cost. It is also important to verify that the FLOPS reduction obtained by cascades actually translates into speedup on hardware. We examine this by comparing on-device latency and speed-up for similarly performing single models versus cascades. We find a reduction in the average online latency on TPUv3 of up to 5.5x for cascades of models from the EfficientNet family compared to single models with comparable accuracy. As models become larger the more speed-up we find with comparable cascades.

Average latency of cascades on TPUv3 for online processing. Each pair of same colored bars has comparable accuracy. Notice that cascades provide drastic latency reduction.

Building Cascades from Large Pools of Models
Above, we limit the model types and only consider ensembles/cascades of at most four models. While this highlights the simplicity of using ensembles, it also allows us to check all combinations of models in very little time so we can find optimal model collections with only a few CPU hours on a held out set of predictions.

When a large pool of models exists, we would expect cascades to be even more efficient and accurate, but brute force search is not feasible. However, efficient cascade search methods have been proposed. For example, the algorithm of Streeter (2018), when applied to a large pool of models, produced cascades that matched the accuracy of state-of-the-art neural architecture search–based ImageNet models with significantly fewer FLOPS, for a range of model sizes.

Conclusion
As we have seen, ensemble/cascade-based models obtain superior efficiency and accuracy over state-of-the-art models from several standard architecture families. In our paper we show more results for other models and tasks. For practitioners, this outlines a simple procedure to boost accuracy while retaining efficiency using off-the-shelf models. We encourage you to try it out!

Acknowledgement
This blog presents research done by Xiaofang Wang (while interning at Google Research), Dan Kondratyuk, Eric Christiansen, Kris M. Kitani, Yair Alon (prev. Movshovitz-Attias), and Elad Eban. We thank Sergey Ioffe, Shankar Krishnan, Max Moroz, Josh Dillon, Alex Alemi, Jascha Sohl-Dickstein‎, Rif A Saurous, and Andrew Helton for their valuable help and feedback.

Read More