High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks

High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks

Fig. 1: The central hypothesis: within a dataset with finite samples, there are correlations between the high-frequency component and the “semantic” component of the images. As a result, the model will perceive both the high-frequency component as well as the “semantic” ones, leading to generalization behavior counterintuitive to humans.

It’s all about data

There are many works aiming to explain the generalization behavior of neural networks using heavy mathematical machinery, but we will do something different here: with a simple and intuitive twist of data, we will show that many generalization mysteries (like adversarial vulnerability, BatchNorm’s efficacy, and the “generalization paradox”) might be results of our overconfidence in processing data through naked eyes. Or simply: 

The models may have not outsmarted us, but the data has.

Let’s start with an interesting observation (Fig. 2): we trained a ResNet-18 with the Cifar10 dataset, picked a test sample, and plotted the model’s prediction confidence for this sample. Then we mapped the sample into the frequency domain through Fourier transform, and cut the frequency representation into its high-frequency component (HFC) and low-frequency component (LFC). We reconstructed the image through these two components and fed the reconstructed image into the model:

  • HFC-reconstructed images look distinctly different from the original sample but predicted to be the same label.
  • LFC-reconstructed images look similar to the original sample but the model classifies them differently.
Fig. 2: The striking misalignment between human and models: HFC-reconstructed images look distinctly different from the original sample, but predicted to be the same label; LFC-reconstructed images look similar to the original sample but the model classifies them differently.

Although this phenomenon can only be observed with a subset of samples (~600 images), it’s striking enough to raise an alarm.

Why does a model behave like this?

We believe the underlying reason is the coincidental correlation between HFC and the “semantics” depicted within a dataset (Fig. 1). With a finite number of samples from the same distribution, chances are that the human-imperceptible HFC is correlated with how a human annotates the image; thus, when a model is optimized to reduce the training loss, it can pick up either the “semantics” or HFC to reduce the loss, leading to high prediction accuracy even though the model may not truly “understand” the data. 

Please note: we are not claiming that the model itself has a tendency to capture HFC. Instead, our main argument is that a generic model does not have the incentive to learn LFC only; thus, it may end up learning a mixture of LFC and HFC.

Also, one may wonder whether the fact that models can capture HFC is promising or worrisome. One side may argue that this enables the development of models that can surpass humans on the test data (likely only when the test data is from the same distribution as the training data), while the other side may argue that the resulting models, despite performing better on test data from the same distribution, may underperform after deployed on similar data from other distributions (i.e., HFC may be dataset-specific). This post does not intend to resolve this argument, but only to offer the observations we made.

Observations

We can leverage the main observation to help explain multiple previously elusive empirical phenomena. In this post, we will highlight two discussions from our paper.

One of the roots of adversarial vulnerability

Conceptually similar to the argument made in a preceding paper, we show that the predictive signal from HFC is one of the roots of adversarial vulnerability. However, in contrast to this prior work, we offer a concrete proposal regarding what the adversarial features might be: signals from HFC.

To investigate the relationship, we trained an adversarially robust model with Madry’s adversarial training method and studied the convolutional kernels of a robust model and a vanilla model. We notice that the convolutional kernels of a robust model tend to be smoother (“smooth” in the sense that differences between adjacent values are small), as shown in Fig. 3. Relevant mathematical tools suggest that smooth convolutional kernels only consider a negligible amount of HFC in data, thus linking HFC to adversarial vulnerability.

Fig. 3. Left: visualization of convolutional kernels of a vanilla CNN; Right: visualization of an adversarially robust CNN from adversarial training.

With this knowledge, a more enticing question is whether we can directly smooth a vanilla model’s convolutional kernel to improve its adversarial robustness. To answer this question, we tested multiple methods to smooth convolutional kernels:

  • To heuristically adjust the weights of trained kernels to improve the smoothness.
  • To filter out the high-frequency information of trained kernels (not reported in our paper).
  • To design regularization schemes limiting the differences of adjacent values of kernels during training (not reported in our paper).

Unfortunately, only marginal improvements are observed. Thus, we can conclude that adversarially robust models tend to have smooth convolutional kernels, but the inverse is not necessarily true. In other words, HFC is one of the issues of adversarial vulnerability, but not all of the issues.

However, one solution inspired by this observation can indeed defend against most adversarial attacks at a remarkable rate:

  • To filter out HFC of input images before feeding the data into models.

The method can improve the robustness of a model, but the method is effectively masking the gradient of the model, thus not solving the adversarial attack & defense problem the research community focuses on.

The Efficacy of BatchNormalization

Another interesting observation concerns the efficacy of BatchNorm. BatchNorm is one of the most effective training heuristics in modern deep learning research, especially in computer vision applications. However, why BatchNorm can help training so significantly is not yet well understood. Interestingly, our experiments offer another perspective on why BatchNorm often helps.

Fig. 4. How test accuracy changes along with the increment of epochs during training. Each panel depicts the performances of different heuristics. Each color represents the performance from a different radius to cut LFC and HFC. Solid lines represent performances for LFC and dashed lines show those for HFC. The higher the curves of dashed lines, the more HFC a method exploits.

In Fig. 4, along with the increment of training epochs, we report the accuracies of training data and various copies of test data, where r refers to the radius we used to cut LFC and HFC, and solid/dashed line denotes the performances from LFC/HFC, respectively. Thus, the higher the curves get in dashed lines, the more HFC a model takes advantage of.

Surprisingly, the model trained with BatchNorm exploits a significant amount of HFC, as the dashed curves of the 4th panel are remarkably higher than those of the other panels. This observation suggests that one of the reasons why BatchNorm helps is that it encourages the usage of HFC. As we argued previously, there are multiple predictive signals (LFC and HFC) in the data. It is expected that the more signals a model uses, the higher test accuracy a model can get, consistent with the fact that BatchNorm is widely known as a method to improve testing accuracy.

Intuitively, we conjecture that the performance boost is due to the fact that HFC usually only involves pixels with very small magnitude (as the reconstructed images look mostly black to humans). BatchNorm conveniently rescales these signals through normalization, thus leading to improved accuracy.

One may naturally wonder what our observation implies regarding the usage of BatchNorm: we suggest that we might need to reevaluate the value of BatchNorm, especially for models to meet the expectations of being robust in multiple datasets. Our observation also conveniently aligns with another observation suggesting that BatchNorm may encourage adversarial vulnerability.

In our paper, we also have discussions relating to the paradox widely known as “rethinking generalization”, formal results on the tradeoff between accuracy and robustness, and experiments suggesting that these interesting phenomena may appear in other vision tasks such as object detection.

Conclusions

For more information, please refer to our paper, where we mainly draw three conclusions:

  • Since HFC may be dataset-specific, SOTA (state-of-the-art) may not be as important as the community thought; the alignment between human and the models may be more important.
  • We may need a new testing regime for computer vision; for example, a simple way is to always test the models over LFC-reconstructed images in addition to the original testing images.
  • Explicit inductive bias design to align the models and the human may play an important role.

Relevant resources:

DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of CMU.

Read More

Maintaining the Illusion of Reality:  Transfer in RL by Keeping Agents in the DARC

Maintaining the Illusion of Reality: Transfer in RL by Keeping Agents in the DARC

Reinforcement learning (RL) is often touted as a promising approach for costly and risk-sensitive applications, yet practicing and learning in those domains directly is expensive. It costs time (e.g., OpenAI’s Dota2 project used 10,000 years of experience), it costs money (e.g., “inexpensive” robotic arms used in research typically cost $10,000 to $30,000), and it could even be dangerous to humans. How can an intelligent agent learn to solve tasks in environments in which it cannot practice?

For many tasks, such as assistive robotics and self-driving cars, we may have access to a different practice area, which we will call the source domain. While the source domain has different dynamics than the target domain, experience in the source domain is much cheaper to collect. For example, a computer simulation of the real world can run much faster than real time, collecting (say) a year of experience in an hour; it is much cheaper to simulate 1,000 robot manipulators in parallel than to maintain 1,000 robot manipulators. The source domain need not be a simulator, but rather could be any practice facility, such as a farm of robot arms, a playpen for learning to walk, or a controlled testing facility for self-driving vehicles. Our aim is to learn a policy in the source domain that will achieve high reward in a different target domain, using a limited amount of experience from the target domain.

This problem (domain adaptation for RL) is challenging because the source and target domains might have different physics and appearances. These differences mean that strategies that work well in the practice facility often don’t transfer to the target domain. For example, a good approach to driving a car around a dry racetrack (the source domain) may entail aggressive acceleration and cutting corners. If the target domain is an icy, public road, this approach may cause the car to skid off the road or hit oncoming traffic. While prior work has thoroughly studied the domain adaptation of appearance in RL [Bousmalis 2017, Ganin 2016, Higgins 2017], it ignores the domain adaptation of the physics.

Method

Our approach stems from the idea that the agent’s experience in the source domain should look similar to its experience in the target domain. We can achieve this goal by compensating for the difference in dynamics by modifying the reward function. The figure below above shows an illustrative example, where the real world (left) contains an obstacle that is unmodeled in the simulator (center).  Our method automatically learns to assign a high penalty when the agent takes actions in the simulator that would not be possible in the real world. For this example, our method puts a high penalty (red region) in the middle of the practice facility, as transitions through that region will not be possible in the real world. In effect, the agent is penalized for transitions that would indicate that the agent is interacting with the source domain, rather than the target domain.

How can a robot solve tasks in a target domain (left) that is different from the source domain where it practices (center)? Our method modifies the reward function to force the agent to learn behaviors that will be feasible in the target domain.

Formally, we will search for a policy whose behavior in the source domain both receives high reward and has high likelihood under the target domain dynamics. It turns out that this objective can be written as a combination of the standard MaxEnt RL objective [Ziebart 2010] together with a reward correction, $$max_pi mathbb{E}_pibigg[sum_t underbrace{r(s_t, a_t) + mathcal{H}_pi[a_t mid s_t]}_{text{MaxEnt RL}} + underbrace{Delta r(s_t, a_t, s_{t+1})}_{text{reward correction}} bigg],$$ where the reward correction ( Delta r) measures the discrepancy between the dynamics in the source domain (p_source) and the target domain (p_target): $$Delta r(s_t, a_t, s_{t+1}) = log p_text{target}(s_{t+1} mid s_t, a_t) – log p_text{source}(s_{t+1} mid s_t, a_t).$$

The idea behind our method, DARC, is to avoid taking actions that would reveal that you’re not in the real world. (left) As an example, In the Truman Show, the main character discovers that he has been living in a simulated world when he sails his boat into the wall of the set. (right) When we train a robot with our method, it avoids stepping off the stage, as doing so would indicate whether the robot is in the “real world” or the simulator. In short, our method avoids breaking the fourth wall.

The combined objective has a number of intuitive interpretations:

  • Viewing the next state shouldn’t give you any additional information about whether you are in the source or target domain.
  • You should maximize reward, subject to the constraint that physics seen by your robot in the source domain looks like the physics of the target domain.
  • Our method aims to acquire a policy that remains ignorant of whether it is interacting with the simulator or the real world.

While (Delta r) is defined in terms of dynamics, it turns out that we can estimate (Delta r) without learning a dynamics model. Applying Bayes’ Rule, we can rewrite (Delta r) as $$begin{aligned} Delta r(s_t, a_t, s_{t+1}) =& color{orange}{log p(text{target} mid s_t, a_t, s_{t+1})} – color{blue}{log p(text{target} mid s_t, a_t)} \ & – color{orange}{log p(text{source} mid s_t, a_t, s_{t+1})} + color{blue}{log p(text{source} mid s_t, a_t)}. end{aligned}$$ The orange terms can be estimated by learning a classifier that distinguishes source vs target given ((s, a, s’)); the blue terms can be estimated using a classifier that distinguishes source vs target given ((s, a)). We therefore call our method Domain Adaptation with Rewards from Classifiers, or DARC. While we do require a limited amount of experience from the target domain to learn these classifiers, our experiments show that learning a classifier (which has a binary output) is easier than learning a dynamics model (which must predict a high-dimensional output). DARC is simple to implement on top of any MaxEnt RL algorithm (e.g., on-policy, off-policy, and model-based).

Connection with Observation Adaptation: Now that we’ve introduced the DARC algorithm, we pause briefly to highlight the relationship between domain adaptation of dynamics versus observations. Consider tasks where the state (s_t triangleq (z_t, o_t)) is a combination of the system latent state (z_t) (e.g., the poses of all objects in a scene) and an observation (o_t) (e.g., a camera observation). We will define (q(o_t mid z_t)) and (p(o_t mid z_t)) as the observation models for the source and target domains (e.g., as defined by the rendering engine or the camera model). In this special case, we can write (Delta r) as the sum of two terms, which correspond to observation adaptation and dynamics adaptation: $$begin{aligned} Delta r(s_t, a_t, s_{t+1}) =& underbrace{log p_{text{target}}(o_t mid z_t) – log p_{text{source}}(o_t mid z_t)}_{text{Observation Adaptation}} \ & + underbrace{log p_{text{target}}(z_{t+1} mid z_t, a_t) – log p_{text{source}}(z_{t+1} mid z_t, a_t)}_{text{Dynamics Adaptation}} . end{aligned}$$ While prior methods excel at addressing observation adaptation [Bousmalis 2017, Gamrian 2019, Fernando 2013, Hoffman 2016, Wulfmeier 2017], only our method captures both the adaptation and dynamics terms.

Experiments

We apply DARC to tasks where the robot is either broken in the target domain and where new obstacles are added in the target domain. Note that naïvely ignoring the shift in dynamics (green dashed line) performs quite poorly, and learning an explicit dynamics model (pink line) learns more slowly than our method.

We applied DARC to a number of simulated robotics tasks. On some tasks, we crippled one of the robot’s joints in the target domain, so the robot has to use practice on the fully-functional robot (source domain) to learn a policy that will work well on the broken robot. In another task, the target domain includes a new obstacle, so the robot has to use practice in a source domain (without the obstacle) to learn a policy that will avoid the obstacle. We show the results of this experiment in the figure above, plotting the reward on the target domain as a function of the number of transitions in the target domain. On all tasks, the policy learned in the simulator does not transfer to the target domain. On three of the four tasks our method matches (or even surpasses) the asymptotic performance of doing RL on the target domain, despite never doing RL on experience from the target domain, and despite observing 5 – 10x less experience from the target domain. However, as we scale to more complex tasks (“broken ant” and “half cheetah obstacle”), we observe that all baselines perform poorly.

Our method accounts for domain shift in the termination condition. In this experiment, the source domain includes safeguards that “catch” the robot and end the episode if the robot starts to fall; the target domain does not contain such safeguards. As shown on the right, our method learns to avoid taking actions that would trigger the safeguards.

Our final experiment showed how DARC can be used for safety. In many applications, robots have freedom to practice in a safe and cushioned practice facility, where they are preemptively stopped when they try to take unsafe actions. However, the real world may not contain such safeguards, so we want our robots to avoid relying on these safeguards. To study this setting, we used a simulated human robot. In the source domain, the episode terminates if the agent starts to fall; the target domain does not include this safeguard. As shown in the plot above, our method learns to remain standing for nearly the entire episode, while baselines fall almost immediately. While DARC was not designed as a method for safe RL [Tamar 2013, Achaim 2017, Berkenkamp 2017, Eysenbach 2017], this experiment suggests that safety may emerge automatically from DARC, without any manual reward function design.

Conclusion

The main idea of this project is that agents should avoid taking actions that would reveal whether they are in the source domain or the target domain. Even when practicing in a simulator, the agent should attempt to maintain the illusion that it is in the real world. The main limitation of our method is that the source dynamics must be sufficiently stochastic, an assumption that can usually be satisfied by adding noise to the dynamics, or ensembling a collection of sources.

For more information, check out the paper or code!

Acknowledgements: Thanks to Swapnil Asawa, Shreyas Chaudhari, Sergey Levine, Misha Khodak, Paul Michel, and Ruslan Salakhutdinov for feedback on this post.

Read More

In defense of weight-sharing for neural architecture search: an optimization perspective

In defense of weight-sharing for neural architecture search: an optimization perspective

Figure 1: Animation of how DARTS and other weight-sharing methods replace the discrete assignment of one of four operations (oin O) to an edge (e) with a (theta)-weighted combination of their outputs. At each edge (e) in the network, the value at input node (e_{in}) is passed to each operation in (O={texttt{1:Conv 3×3,  2:Conv 5×5, 3:Pool 3×3, 4:Skip Connect}}); the value at output node (e_{out}) will then be the sum of the operation outputs weighted by parameters (theta_{e,o}in[0,1]) that satisfy (sum_{oin O}theta_{e,o}=1).

This post is cross-listed on the Determined AI blog.

Neural architecture search (NAS) — selecting which neural model to use for your learning problem — is a promising but computationally expensive direction for automating and democratizing machine learning. The weight-sharing method, whose initial success at dramatically accelerating NAS surprised many in the field, has come under scrutiny due to its poor performance as a surrogate for full model-training (a miscorrelation problem known as rank disorder) and inconsistent results on recent benchmarks. In this post, we give a quick overview of weight-sharing and argue in favor of its continued use for NAS. To do so, we will consider a simple optimization formulation that reveals the following key takeaways:

  1. The fact that weight-sharing works should not really be too surprising to a community that embraces non-convex optimization of over-parameterized models.
  2. Rank disorder is not a concern for most weight-sharing methods, since we care about obtaining high-quality architectures rather than a ranking of them.
  3. The sometimes-poor performance of weight-sharing is a result of optimization issues that can be fixed while still using weight-sharing. We propose such a fix — a geometry-aware exponentiated algorithm (GAEA) — that is applicable to many popular NAS methods and achieves state-of-the-art results across several settings.

A brief history of NAS with weight-sharing

NAS is usually formulated as a bilevel optimization problem in which we are searching over some architecture domain (A) for the architecture (ain A) that achieves the lowest validation loss (ell_V(w_a,a)) after training weights (w_a) that minimize the training loss (ell_T(cdot,a)):

$$
min_{ain A}~~ell_V(w_a,a)qquadtextrm{s.t.}qquad w_ainargmin_{winmathbb R^d}~~ell_T(w,a)
$$

First-generation NAS methods were astronomically expensive due to the combinatorially large search space, requiring the training of thousands of neural networks to completion. Then, in their 2018 ENAS (for Efficient NAS) paper, Pham et al. introduced the idea of weight-sharing, in which only one shared set of model parameters (winmathbb R^d) is trained for all architectures. The validation losses (ell_V(w,a)) of different architectures (ain A) computed using these shared weights are then used as estimates of their validation losses (ell_V(w_a,a)) using standalone weights (w_a) (i.e. weights trained individually for each architecture by solving the inner optimization problem above). Because only one set of parameters has to be trained, weight-sharing led to a massive speedup over earlier methods, reducing search time on CIFAR-10 from 2,000-20,000 GPU-hours to just 16. Of great surprise to many was that the validation accuracies (ell_V(w,a)) computed using shared weights (w) were sufficiently good surrogates for (ell_V(w_a,a)), meaning ENAS was able to find good models cheaply. We will see that this correlation is actually a sufficient but not necessary condition for weight-sharing to do well and that weight-sharing’s overall success is less surprising than initially thought.

Following the ENAS breakthrough, several simpler methods such as DARTS and GDAS were proposed in which the categorical architecture decisions (e.g. for each network edge (ein E), which of some fixed set of operations (O) to use at (e)) are relaxed into continuous parameters (thetainTheta) (e.g. so (Theta) is a product of (|E|) simplices of size (|O|)). As animated in Figure 1, these architecture parameters govern the architecture used for updating the shared weights using the gradient of the training loss; for example, in DARTS, (theta_{e,o}) determines the weighting in the weighted sum of operations output by edge (e) in the network. Together, this parameterization leads to a continuous relaxation of the earlier bilevel problem:

$$
min_{thetainTheta}~~ell_V(w_theta,theta)qquadtextrm{s.t.}qquad w_thetainargmin_{winmathbb R^d}~~ell_T(w,theta)
$$

Since (Theta) is a constrained subset of (mathbb R^{|E||O|}), DARTS and GDAS avoid simplex projections by instead re-parameterizing to “logit” parameters (alphainmathbb R^{|E||O|}), with (theta=textrm{Softmax}(alpha)) defined as

$$
theta_{e,o}=frac{exp(alpha_{e,o})}{sum_{o’in O}exp(alpha_{e,o’})}
$$

The relaxed optimization problem can then be approximated via the following alternating gradient update scheme (here (eta>0) is a stepsize):

$$
begin{aligned}
1.quad&texttt{initialize }w^{(0)}inmathbb R^d,alpha^{(0)}inmathbb R^{|E||O|}\
2.quad&texttt{for iteration }i=0,dots,n-1texttt{ do:}\
3.quad&quad w^{(i+1)}gets w^{(i)}-etanabla_well_T(w^{(i)},alpha^{(i)})\
4.quad&quad alpha^{(i+1)}getsalpha^{(i)}-etanabla_alphaell_V(w^{(i+1)},alpha^{(i)})\
5.quad &texttt{return }theta=textrm{Softmax}(alpha^{(n)})
end{aligned}
$$

Note that at the end of search, we need to recover a discrete architecture (ain A) from the architecture weights (theta); in DARTS this is done in a pruning step that simply chooses the highest-weighted operations. This post-hoc pruning is a source of error that our method, GAEA, ameliorates, as we discuss later.

A further simplification, and perhaps the most striking example of using shared weights as surrogates for standalone weights, is the Random Search with Weight-Sharing (RS-WS) method, in which the shared parameters are optimized by taking gradient steps using architectures sampled uniformly at random from the search space. Despite not even updating architecture parameters, RS-WS achieved competitive and, in some cases, state-of-the-art results.

Should we be using weight-sharing?

More recently, weight-sharing has come under increased scrutiny: does sharing weights between models really accelerate NAS? Can it preclude the recovery of optimal architectures? In particular, several papers have observed the issue of rank disorder, which is when the shared-weight performance of architectures does not correlate well with their stand-alone performance; this issue is illustrated in the figure below. Rank disorder is a problem for methods such as RS-WS that rely on the shared-weights performance to rank architectures for evaluation, as it will cause them to ignore networks that achieve high accuracy when their parameters are trained without sharing. 

Figure 2: Illustration of rank-disorder issue from Yu et al., 2020. It occurs when the performance ordering of architectures evaluated using shared-weights (right) does not match the true architecture performance when individual weights are trained from scratch (left).

This skepticism has been reinforced by recent cases where well-known weight-sharing methods have performed poorly; in particular, DARTS was found to overfit to the upper-level validation set in a recent robustness evaluation, and both GDAS and DARTS were outperformed by standard hyperparameter optimization methods on NAS-Bench-201 given a similar time budget. Here DARTS had especially poor performance, converging to architectures consisting of only identity maps (skip-connections).

Given the questions raised about weight-sharing and recent poor results, is it time to rethink its use? In the next section we answer in the negative by showing that (a) correlation of the performance of the weight-sharing “supernet” with that of fully trained models is a sufficient but not necessary condition for weight-sharing to work, i.e. we need not be afraid of rank disorder, and (b) obtaining high-quality architectural solutions using weight-sharing mostly comes down to regularization and optimization, two well-understood aspects of machine learning.

Vanilla weight-sharing is just empirical risk minimization

Weight-sharing is made possible by the fact that architecture parameters, unlike hyperparameters such as regularization and stepsize, directly affect the loss function, in that changing from one architecture (ain A) to a different architecture (a’in A) causes a change in the loss from (ell(w,a)) to (ell(w,a’)), as in the latter case a different function is being used for prediction. On the other hand, changing the stepsize setting would not change the loss unless the weights were also changed by training (w) using the new stepsize; this would mean the weights were no longer shared by different hyperparameter settings.

In fact, we can use the fact that architecture parameters can be subsumed as parameters of a supernet to pose NAS with weight-sharing as empirical risk minimization (ERM) over an extended class of functions encoded by both weights (winmathbb R^d) and architecture parameters (thetainTheta):

$$min_{winmathbb R^d,thetainTheta}ell_T(w,theta)$$

Most (but not all) empirical work on NAS uses the bilevel formulation described earlier rather than solving this single-level ERM problem. However, we believe ERM should be the baseline object of study in NAS because it is better understood statistically and algorithmically; the more common bilevel optimization can be viewed as a (possibly critical) method of regularizing by splitting the data.

The single-level formulation makes clear that, rather than rank disorder, NAS can fail for either of the usual reasons ERM fails: unsuccessful optimization or poor generalization. For example, these respective failures can largely explain the issues faced by DARTS on NAS-Bench-201 and NAS-Bench-1Shot1 that were discussed above. Of course, it is not surprising that supernets might face optimization and generalization issues, since they are non-convex and over-parameterized models; however, NAS practitioners are usually very comfortable training regular deep nets, which are also non-convex and have almost as many parameters. One major difference is that, in the latter case, we have had many years of intensive effort designing regularization schemes and optimization algorithms; if we were to dedicate a similar amount of NAS research to these two issues, then perhaps we would be no more afraid of weight-sharing than we are of training standard deep nets.

One caveat to this discussion is that NAS has the additional challenge of recovering discrete architectures from continuously-relaxed architecture weights. The optimization scheme we propose next ameliorates this issue by promoting sparse architectural parameters in the first place.

Fixing differentiable NAS with weight-sharing: a geometry-aware approach

Our discussion above reduces the problem of designing NAS methods to that of designing good regularization and optimization schemes for the supernet. There has been a good amount of recent work on better regularization schemes for the supernet, including partial channel connections, penalizing the Hessian of the validation loss, and the bilevel formulation itself; we thus focus instead on improving the optimization scheme, which we do with our geometry-aware exponentiated algorithm (GAEA).

As usual, we want an optimization method that can converge to a better solution faster. In the case of weight-sharing NAS, a high-quality solution will not only have good generalization but also induce sparse architecture weights (thetainTheta), which recall is related to the optimized parameters (alphainmathbb R^{|E||O|}) via a softmax. We expect sparse architecture parameters to be better because, as discussed earlier, the architecture parameters are rounded in a post-processing step to derive the final discrete architecture.

Figure 3: We want the final architecture weights (theta) to be sparse so that, when rounded, the resulting discrete architecture (ain A) is close to the supernet encoded by (theta). Otherwise the difference between the validation loss of the discrete architecture can be very different from that of the supernet. Since the elements of (Theta) lie in a product of simplices, sparsity means that, in each simplex, a single entry is 1 and the rest are 0.

In order to achieve this, we use exponentiated gradient descent to directly optimize over the elements (thetainTheta) instead of over unconstrained values (alphainmathbb R^{|E||O|}). The update scheme replaces subtracting the gradient w.r.t. (alpha) (line 4 in the pseudo-code) with element-wise multiplication by the negative exponentiated gradient w.r.t. (theta) (4.a), followed by projections to the simplices comprising (Theta) (4.b):

$$
begin{align*}
4.aquad&qquadtildethetagetstheta^{(i)}odotexpleft(-etanabla_thetaell(w^{(i+1)},theta^{(i)})right)\
4.bquad&qquadtexttt{for }ein E,oin Otexttt{ do: }theta_{e,o}^{(i+1)}getsfrac{tildetheta_{e,o}}{sum_{o’in O}tildetheta_{e,o’}}
end{align*}
$$

Note that each iteration is roughly as expensive as in SGD.

For convex problems, exponentiated gradient is known to be well-suited for the simplex geometry, with iteration complexity depending only logarithmically on the size (k) of the simplex, rather than the (mathcal O(k)) dependence of gradient descent. Under the mirror descent view of this result for linear prediction (video link), the improvement stems from the implicit regularizer encouraging larger updates when far away from a sparse target solution. For our non-convex problem, we obtain a similar guarantee by extending recent mirror descent results of Zhang & He to show that alternating the exponentiated update to the architecture parameters with SGD updates to the shared-weights yields an (varepsilon)-stationary-point in (mathcal O(log k/varepsilon^2)) iterations. We also show experimentally in Figure 4 that this approach encourages sparser solutions than DARTS and PC-DARTS.


Figure 4: Sparsity of architecture weights obtained by GAEA compared to the method it modifies on two different search spaces. Sparsity is measured using average entropy across architecture decision simplices.

Our GAEA approach, which is applicable to any method using the softmax formulation described earlier (this includes DARTS, GDAS, PC-DARTS, and others), can be summarized in two simple steps:

  1. Replace architecture weights (alpha) passed to the softmax with weights (theta) lying directly on the architecture decision simplices.
  2. Use the exponentiated gradient scheme (4.a & 4.b) to update these architecture weights (theta).

Empirical Evaluation of GAEA


Figure 5: Evaluation of GAEA on NAS-Bench-201. Standard hyperparameter optimization methods are in blue, while weight-sharing schemes are in purple. The optimal in the search space is in black. GAEA is the first weight-sharing scheme to outperform standard hyperparameter optimization on this search space and the only one to get within a standard deviation of the optimum on two of the three datasets, CIFAR-10 and CIFAR-100.

So does the sparsity and faster convergence rates of GAEA result in better performance empirically?  To test this, we simply apply the two steps above to modify existing state-of-the-art NAS methods. First, we evaluate GAEA applied to DARTS on the NAS-Bench-201 search space by Dong et al.  Of the methods evaluated by Dong et al., non-weight-sharing methods outperformed weight-sharing methods on all three datasets.  However, GAEA DARTS applied to the single-level ERM objective achieves the best accuracy across all three datasets, reaching near oracle-optimal performance on two of them. Notably, it fixes the catastrophically poor performance of DARTS on this space and is the only weight-sharing method that beats standard hyperparameter optimization.


Figure 6: Evaluation of GAEA on the DARTS search space. Weight-sharing methods are in purple, while non-weight-sharing methods are in blue. Note that the non-weight-sharing methods require more than 10,000 times as many GPU-hours as GAEA for search.

We also evaluated GAEA applied to PC-DARTS on the original DARTS CNN search space.  With improved regularization for the weight-sharing optimization problem, PC-DARTS was able to recently match the performance of computationally expensive non-weight-sharing methods on CIFAR-10 and ImageNet.  We are able to further boost the performance of PC-DARTS with GAEA and achieve state-of-the-art performance on this search space, demonstrating the importance of efficiently optimizing in the right geometry.

Interested in learning more?

To learn more about our results, weight-sharing, and NAS you can:

  • See our paper, joint with Nina Balcan and Ameet Talwalkar, for more details.
  • Download our code for a full implementation of GAEA.
  • Read about the original weight-sharing method, ENAS, and about the baseline weight-sharing method, RS-WS.
  • Read a survey of the field (Elsken, Metzen, & Hutter, JMLR 2019).

DISCLAIMER: All opinions expressed in this posts are those of the authors and do not represent the views of CMU.

Read More

Carnegie Mellon University at ICML 2020

Carnegie Mellon University at ICML 2020

Carnegie Mellon University is proud to present 44 papers at the 37th International Conference on Machine Learning (ICML 2020), which will be held virtually this week. CMU is also involved in organizing 5 workshops at the conference, and our faculty and researchers are giving invited talks at 6 workshops.

Here is a quick overview of the areas our researchers are working on:

We are also proud to collaborate with many other researchers in academia and industry:

Publications

Check out the full list of papers below, along with their presentation times and links to preprints/code so you can check out our work.

Learning Theory

Familywise Error Rate Control by Interactive Unmasking
Boyan Duan (Carnegie Mellon University); Aaditya Ramdas (Carnegie Mellon University); Larry Wasserman (Carnegie Mellon University)
Learning Theory, Tue Jul 14 07:00 AM — 07:45 AM & Tue Jul 14 06:00 PM — 06:45 PM (PDT)

Stochastic Regret Minimization in Extensive-Form Games
Gabriele Farina (Carnegie Mellon University); Christian Kroer (Columbia University); Tuomas Sandholm (CMU, Strategy Robot, Inc., Optimized Markets, Inc., Strategic Machine, Inc.)
Learning Theory, Tue Jul 14 07:00 AM — 07:45 AM & Tue Jul 14 06:00 PM — 06:45 PM (PDT)

Strategyproof Mean Estimation from Multiple-Choice Questions
Anson Kahng (Carnegie Mellon University); Gregory Kehne (Carnegie Mellon University); Ariel D Procaccia (Harvard University)
Learning Theory, Tue Jul 14 08:00 AM — 08:45 AM & Tue Jul 14 07:00 PM — 07:45 PM (PDT)

On Learning Language-Invariant Representations for Universal Machine Translation
Han Zhao (Carnegie Mellon University); Junjie Hu (Carnegie Mellon University); Andrej Risteski (CMU)
Learning Theory, Wed Jul 15 05:00 AM — 05:45 AM & Wed Jul 15 04:00 PM — 04:45 PM (PDT)

Class-Weighted Classification: Trade-offs and Robust Approaches
Ziyu Xu (Carnegie Mellon University); Chen Dan (Carnegie Mellon University); Justin Khim (Carnegie Mellon University); Pradeep Ravikumar (Carnegie Mellon University)
Learning Theory, Wed Jul 15 08:00 AM — 08:45 AM & Wed Jul 15 09:00 PM — 09:45 PM (PDT)

Sparsified Linear Programming for Zero-Sum Equilibrium Finding
Brian H Zhang (Carnegie Mellon University); Tuomas Sandholm (CMU, Strategy Robot, Inc., Optimized Markets, Inc., Strategic Machine, Inc.)
Learning Theory, Thu Jul 16 06:00 AM — 06:45 AM & Thu Jul 16 06:00 PM — 06:45 PM (PDT)

Sharp Statistical Guarantees for Adversarially Robust Gaussian Classification
Chen Dan (Carnegie Mellon University); Yuting Wei (CMU); Pradeep Ravikumar (Carnegie Mellon University)
Learning Theory, Thu Jul 16 06:00 AM — 06:45 AM & Thu Jul 16 06:00 PM — 06:45 PM (PDT)

Uniform Convergence of Rank-weighted Learning
Justin Khim (Carnegie Mellon University); Liu Leqi (Carnegie Mellon University); Adarsh Prasad (Carnegie Mellon University); Pradeep Ravikumar (Carnegie Mellon University)
Learning Theory, Thu Jul 16 07:00 AM — 07:45 AM & Thu Jul 16 06:00 PM — 06:45 PM (PDT)

Online Control of the False Coverage Rate and False Sign Rate
Asaf Weinstein (The Hebrew University of Jerusalem); Aaditya Ramdas (Carnegie Mellon University)
Online Learning, Active Learning, and Bandits, Tue Jul 14 10:00 AM — 10:45 AM & Tue Jul 14 09:00 PM — 09:45 PM (PDT)

On conditional versus marginal bias in multi-armed bandits
Jaehyeok Shin (Carnegie Mellon University); Aaditya Ramdas (Carnegie Mellon University); Alessandro Rinaldo (Carnegie Mellon University)
Online Learning, Active Learning, and Bandits, Wed Jul 15 08:00 AM — 08:45 AM & Wed Jul 15 07:00 PM — 07:45 PM (PDT)

General Machine Learning

Influence Diagram Bandits: Variational Thompson Sampling for Structured Bandit Problems
Tong Yu (Carnegie Mellon University); Branislav Kveton (Google Research); Zheng Wen (DeepMind); Ruiyi Zhang (Duke University); Ole J. Mengshoel (Carnegie Mellon University)
Online Learning, Active Learning, and Bandits, Tue Jul 14 10:00 AM — 10:45 AM & Tue Jul 14 10:00 PM — 10:45 PM (PDT)

The Implicit Regularization of Stochastic Gradient Flow for Least Squares
Alnur Ali (Stanford University); Edgar Dobriban (University of Pennsylvania); Ryan Tibshirani (Carnegie Mellon University)
Supervised Learning, Thu Jul 16 09:00 AM — 09:45 AM & Thu Jul 16 08:00 PM — 08:45 PM (PDT)

Near Input Sparsity Time Kernel Embeddings via Adaptive Sampling
David Woodruff (CMU); Amir Zandieh (EPFL)
General Machine Learning Techniques, Tue Jul 14 02:00 PM — 02:45 PM & Wed Jul 15 03:00 AM — 03:45 AM (PDT)

InfoGAN-CR and Model Centrality: Self-supervised Model Training and Selection for Disentangling GANs [code]Zinan Lin (Carnegie Mellon University); Kiran K Thekumparampil (University of Illinois at Urbana-Champaign); Giulia Fanti (CMU); Sewoong Oh (University of Washington)
Representation Learning, Wed Jul 15 08:00 AM — 08:45 AM & Wed Jul 15 08:00 PM — 08:45 PM (PDT)

LTF: A Label Transformation Framework for Correcting Label Shift
Jiaxian Guo (The University of Sydney); Mingming Gong (University of Melbourne); Tongliang Liu (The University of Sydney); Kun Zhang (Carnegie Mellon University); Dacheng Tao (The University of Sydney)
Transfer, Multitask and Meta-learning, Tue Jul 14 07:00 AM — 07:45 AM & Tue Jul 14 07:00 PM — 07:45 PM (PDT)

Optimizing Dynamic Structures with Bayesian Generative Search
Minh Hoang (Carnegie Mellon University); Carleton Kingsford (Carnegie Mellon University)
Transfer, Multitask and Meta-learning, Thu Jul 16 06:00 AM — 06:45 AM & Thu Jul 16 05:00 PM — 05:45 PM (PDT)

Label-Noise Robust Domain Adaptation
Xiyu Yu (Baidu Inc.); Tongliang Liu (The University of Sydney); Mingming Gong (University of Melbourne); Kun Zhang (Carnegie Mellon University); Kayhan Batmanghelich (University of Pittsburgh); Dacheng Tao (The University of Sydney)
Unsupervised and Semi-Supervised Learning, Wed Jul 15 05:00 AM — 05:45 AM & Wed Jul 15 07:00 PM — 07:45 PM (PDT)

Input-Sparsity Low Rank Approximation in Schatten Norm
Yi Li (Nanyang Technological University); David Woodruff (Carnegie Mellon University)
Unsupervised and Semi-Supervised Learning, Thu Jul 16 06:00 AM — 06:45 AM & Thu Jul 16 06:00 PM — 06:45 PM (PDT)

A Pairwise Fair and Community-preserving Approach to k-Center Clustering
Brian Brubach (University of Maryland); Darshan Chakrabarti (Carnegie Mellon University); John P Dickerson (University of Maryland); Samir Khuller (Northwestern University); Aravind Srinivasan (University of Maryland College Park); Leonidas Tsepenekas (University of Maryland, College Park)
Unsupervised and Semi-Supervised Learning, Thu Jul 16 06:00 AM — 06:45 AM & Thu Jul 16 05:00 PM — 05:45 PM (PDT)

Poisson Learning: Graph Based Semi-Supervised Learning At Very Low Label Rates
Jeff Calder (University of Minnesota); Brendan Cook (University of Minnesota); Matthew Thorpe (University of Manchester); Dejan Slepcev (Carnegie Mellon University)
Unsupervised and Semi-Supervised Learning, Thu Jul 16 07:00 AM — 07:45 AM & Thu Jul 16 06:00 PM — 06:45 PM (PDT)

Trustworthy Machine Learning

Explaining Groups of Points in Low-Dimensional Representations [code]Gregory Plumb CMU); Jonathan Terhorst (U-M LSA); Sriram Sankararaman (UCLA); Ameet Talwalkar (CMU)
Accountability, Transparency and Interpretability, Tue Jul 14 07:00 AM — 07:45 AM & Tue Jul 14 06:00 PM — 06:45 PM (PDT)

Overfitting in adversarially robust deep learning [code]Leslie Rice (Carnegie Mellon University); Eric Wong (Carnegie Mellon University); Zico Kolter (Carnegie Mellon University)
Adversarial Examples, Tue Jul 14 08:00 AM — 08:45 AM & Tue Jul 14 07:00 PM — 07:45 PM (PDT)

Adversarial Robustness Against the Union of Multiple Perturbation Models
Pratyush Maini (IIT Delhi); Eric Wong (Carnegie Mellon University); Zico Kolter (Carnegie Mellon University)
Adversarial Examples, Wed Jul 15 09:00 AM — 09:45 AM & Wed Jul 15 09:00 PM — 09:45 PM (PDT)

Characterizing Distribution Equivalence and Structure Learning for Cyclic and Acyclic Directed Graphs
AmirEmad Ghassami (UIUC); Alan Yang (University of Illinois at Urbana-Champaign); Negar Kiyavash (École Polytechnique Fédérale de Lausanne); Kun Zhang (Carnegie Mellon University)
Causality, Tue Jul 14 07:00 AM — 07:45 AM & Tue Jul 14 08:00 PM — 08:45 PM (PDT)

Certified Robustness to Label-Flipping Attacks via Randomized Smoothing
Elan Rosenfeld (Carnegie Mellon University); Ezra Winston (Carnegie Mellon University); Pradeep Ravikumar (Carnegie Mellon University); Zico Kolter (Carnegie Mellon University)
Trustworthy Machine Learning, Tue Jul 14 09:00 AM — 09:45 AM & Tue Jul 14 08:00 PM — 08:45 PM (PDT)

FACT: A Diagnostic for Group Fairness Trade-offs
Joon Sik Kim (Carnegie Mellon University); Jiahao Chen (JPMorgan AI Research); Ameet Talwalkar (CMU)
Fairness, Equity, Justice, and Safety, Thu Jul 16 06:00 AM — 06:45 AM & Thu Jul 16 05:00 PM — 05:45 PM (PDT)

Is There a Trade-Off Between Fairness and Accuracy? A Perspective Using Mismatched Hypothesis Testing
Sanghamitra Dutta (CMU); Dennis Wei (IBM Research); Hazar Yueksel (IBM Research); Pin-Yu Chen (IBM Research); Sijia Liu (IBM Research); Kush R Varshney (IBM Research)
Fairness, Equity, Justice, and Safety, Thu Jul 16 07:00 AM — 07:45 AM & Thu Jul 16 06:00 PM — 06:45 PM (PDT)

Deep Learning

Combining Differentiable PDE Solvers and Graph Neural Networks for Fluid Flow Prediction [code]Filipe de Avila Belbute-Peres (Carnegie Mellon University); Thomas D. Economon (SU2 Foundation); Zico Kolter (Carnegie Mellon University)
Deep Learning – General, Wed Jul 15 05:00 AM — 05:45 AM & Wed Jul 15 04:00 PM — 04:45 PM (PDT)

Optimizing Data Usage via Differentiable Rewards
Xinyi Wang (Carnegie Mellon University); Hieu Pham (Carnegie Mellon University); Paul Michel (Carnegie Mellon University); Antonios  Anastasopoulos (Carnegie Mellon University); Jaime Carbonell (Carnegie Mellon University); Graham Neubig (Carnegie Mellon University)
Deep Learning – Algorithms, Thu Jul 16 08:00 AM — 08:45 AM & Thu Jul 16 07:00 PM — 07:45 PM (PDT)

A Sample Complexity Separation between Non-Convex and Convex Meta-Learning
Nikunj Saunshi (Princeton University); Yi Zhang (Princeton); Mikhail Khodak (Carnegie Mellon University); Sanjeev Arora (Princeton University)
Deep Learning – Theory, Tue Jul 14 10:00 AM — 10:45 AM & Tue Jul 14 09:00 PM — 09:45 PM (PDT)

Stabilizing Transformers for Reinforcement Learning
Emilio Parisotto (Carnegie Mellon University); Francis Song (DeepMind); Jack Rae (Deepmind); Razvan Pascanu (Google Deepmind); Caglar Gulcehre (DeepMind); Siddhant Jayakumar (DeepMind); Max Jaderberg (DeepMind); Raphaël Lopez Kaufman (DeepMind); Aidan Clark (DeepMind); Seb Noury (DeepMind); Matthew Botvinick (Google); Nicolas Heess (DeepMind); Raia Hadsell (Deepmind)
Reinforcement Learning – Deep RL, Wed Jul 15 05:00 AM — 05:45 AM & Wed Jul 15 04:00 PM — 04:45 PM (PDT)

Planning to Explore via Self-Supervised World Models [code]Ramanan Sekar (University of Pennsylvania); Oleh Rybkin (University of Pennsylvania); Kostas Daniilidis (University of Pennsylvania); Pieter Abbeel (UC Berkeley); Danijar Hafner (Google); Deepak Pathak (CMU, FAIR)
Reinforcement Learning – Deep RL, Wed Jul 15 08:00 AM — 08:45 AM & Wed Jul 15 07:00 PM — 07:45 PM (PDT)

One Policy to Control Them All: Shared Modular Policies for Agent-Agnostic Control [code]Wenlong Huang (UC Berkeley); Igor Mordatch (Google); Deepak Pathak (CMU, FAIR)
Reinforcement Learning – Deep RL, Thu Jul 16 08:00 AM — 08:45 AM & Thu Jul 16 08:00 PM — 08:45 PM (PDT)

VideoOneNet: Bidirectional Convolutional Recurrent OneNet with Trainable Data Steps for Video Processing
“Zoltán Á Milacski (Eötvös Loránd University); Barnabas Poczos (Carnegie Mellon University); Andras Lorincz (Eötvös Loránd University)
Sequential, Network, and Time-Series Modeling, Wed Jul 15 02:00 PM — 02:45 PM & Thu Jul 16 01:00 AM — 01:45 AM (PDT)

Applications

An EM Approach to Non-autoregressive Conditional Sequence Generation
Zhiqing Sun (Carnegie Mellon University); Yiming Yang (Carnegie Mellon University)
Applications – Language, Speech and Dialog, Tue Jul 14 08:00 AM — 08:45 AM & Tue Jul 14 07:00 PM — 07:45 PM (PDT)

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation [code]Junjie Hu (Carnegie Mellon University); Sebastian Ruder (DeepMind); Aditya Siddhant (Google Research); Graham Neubig (Carnegie Mellon University); Orhan Firat (Google); Melvin Johnson (Google)
Applications – Language, Speech and Dialog, Tue Jul 14 10:00 AM — 10:45 AM & Tue Jul 14 09:00 PM — 09:45 PM (PDT)

Learning Factorized Weight Matrix for Joint Filtering
Xiangyu Xu (Carnegie Mellon University); Yongrui Ma (SenseTime); Wenxiu Sun (SenseTime Research)
Applications – Computer Vision, Thu Jul 16 03:00 PM — 03:45 PM & Fri Jul 17 04:00 AM — 04:45 AM (PDT)

Uncertainty-Aware Lookahead Factor Models for Quantitative Investing
Lakshay Chauhan (Euclidean Technologies); John Alberg (Euclidean Technologies LLC); Zachary Lipton (Carnegie Mellon University)
Applications – Other, Tue Jul 14 08:00 AM — 08:45 AM & Tue Jul 14 08:00 PM — 08:45 PM (PDT)

Learning Robot Skills with Temporal Variational Inference
Tanmay Shankar (Facebook AI Research); Abhinav Gupta (CMU/FAIR)
Applications – Other, Thu Jul 16 06:00 AM — 06:45 AM & Thu Jul 16 05:00 PM — 05:45 PM (PDT)

Optimization

Nearly Linear Row Sampling Algorithm for Quantile Regression
Yi Li (Nanyang Technological University); Ruosong Wang (Carnegie Mellon University); Lin Yang (UCLA); Hanrui Zhang (Duke University)
Optimization – Large Scale, Parallel and Distributed, Tue Jul 14 09:00 AM — 09:45 AM & Tue Jul 14 08:00 PM — 08:45 PM (PDT)

The Non-IID Data Quagmire of Decentralized Machine Learning
Kevin Hsieh (Microsoft Research); Amar Phanishayee (Microsoft Research); Onur Mutlu (ETH Zurich); Phillip B Gibbons (CMU)
Optimization – Large Scale, Parallel and Distributed, Wed Jul 15 08:00 AM — 08:45 AM & Wed Jul 15 09:00 PM — 09:45 PM (PDT)

Refined bounds for algorithm configuration: The knife-edge of dual class approximability
Maria-Florina Balcan (Carnegie Mellon University); Tuomas Sandholm (CMU, Strategy Robot, Inc., Optimized Markets, Inc., Strategic Machine, Inc.); Ellen Vitercik (Carnegie Mellon University)
Optimization – General, Thu Jul 16 06:00 AM — 06:45 AM & Thu Jul 16 06:00 PM — 06:45 PM (PDT)

Probabilistic Inference

Confidence Sets and Hypothesis Testing in a Likelihood-Free Inference Setting [code]Niccolo Dalmasso (Carnegie Mellon University); Rafael Izbicki (UFSCar); Ann Lee (Carnegie Mellon University)
Probabilistic Inference – Models and Probabilistic Programming, Tue Jul 14 07:00 AM — 07:45 AM & Tue Jul 14 06:00 PM — 06:45 PM (PDT)

Empirical Study of the Benefits of Overparameterization in Learning Latent Variable Models [code]Rares-Darius Buhai (MIT); Yoni Halpern (Google); Yoon Kim (Harvard University); Andrej Risteski (CMU); David Sontag (MIT)
Probabilistic Inference – Models and Probabilistic Programming, Thu Jul 16 06:00 AM — 06:45 AM & Thu Jul 16 05:00 PM — 05:45 PM (PDT)

Workshops

Check out the full list of organized workshops below, along with their times and links to the program.

Invited Speakers

Participatory Approaches to Machine Learning
Alexandra Chouldechova (CMU)
Fri Jul 17 06:00 AM — 01:45 PM (PDT)

Workshop on AI for Autonomous Driving
Drew Bagnell (Aurora and CMU)
Fri Jul 17 05:00 AM — 03:00 PM (PDT)

Bridge Between Perception and Reasoning: Graph Neural Networks & Beyond
Zico Kolter (CMU)
Sat Jul 18 05:50 AM — 02:30 PM (PDT)

Real World Experiment Design and Active Learning
Aaditya Ramdas (CMU)
Sat Jul 18 07:00 AM — 03:35 PM (PDT)

Incentives in Machine Learning
Nihar Shah (CMU)
Sat Jul 18 08:00 AM — 11:00 AM (PDT)

2nd ICML Workshop on Human in the Loop Learning (HILL)
Christian Lebiere (CMU); Pradeep Ravikumar, (CMU)
Sat Jul 18 11:00 AM — 03:00 AM (PDT)

Organizers

Federated Learning for User Privacy and Data Confidentiality
Nathalie Baracaldo (IBM Research Almaden, USA); Olivia Choudhury (Amazon, USA); Gauri Joshi (Carnegie Mellon University, USA); Ramesh Raskar (MIT Media Lab, USA); Shiqiang Wang (IBM T. J. Watson Research Center, USA); Han Yu (Nanyang Technological University, Singapore)
Sat Jul 18 05:45 AM — 02:35 PM (PDT)

MLRetrospectives: A Venue for Self-Reflection in ML Research
Ryan Lowe, (Mila / McGill University); Jessica Forde, (Brown University); Jesse Dodge, CMU); Mayoore Jaiswal, IBM Research); Rosanne Liu, (Uber AI Labs); Joelle Pineau, (Mila / McGill University / Facebook AI); Yoshua Bengio, (Mila)
Sat Jul 18 05:50 AM — 02:30 PM (PDT)

Bridge Between Perception and Reasoning: Graph Neural Networks & Beyond
Jian Tang (HEC Montreal & MILA); Le Song (Georgia Institute of Technology); Jure Leskovec (Stanford University); Renjie Liao (University of Toronto); Yujia Li (DeepMimd); Sanja Fidler (University of Toronto, NVIDIA); Richard Zemel (U Toronto); Ruslan Salakhutdinov (CMU)
Sat Jul 18 05:50 AM — 02:30 PM (PDT)

Workshop on Learning in Artificial Open Worlds
William H. Guss (CMU and OpenAI); Katja Hofmann (Microsoft); Brandon Houghton (CMU and OpenAI); Noburu (Sean) Kuno (Microsoft); Ruslan Salakhutdinov (CMU); Kavya Srinet (Facebook AI Research); Arthur Szlam (Facebook AI Research)
Sat Jul 18 07:00 AM — 02:00 PM (PDT)

Real World Experiment Design and Active Learning
Ilija Bogunovic (ETH Zurich); Willie Neiswanger (Carnegie Mellon University); Yisong Yue (Caltech)
Sat Jul 18 07:00 AM — 03:35 PM (PDT)

Read More

CATER: A diagnostic dataset for Compositional Actions and Temporal Reasoning

CATER: A diagnostic dataset for Compositional Actions and Temporal Reasoning

Introducing CATER: A diagnostic dataset for video understanding that, by design, requires temporal reasoning to be solved.

While deep features have revolutionized static image analysis, deep video descriptors have struggled to outperform classic hand-crafted descriptors. Though recent works have shown improvements by through spatio-temporal architectures, simpler frame-based architectures still routinely appear among top performers in video challenge benchmarks. This raises the natural question: are videos trivially understandable by simply aggregating the predictions over a sampled set of frames?

At some level, the answer must be no. Reasoning about high-level cognitive concepts such as intentions, goals, and causal relations requires reasoning over long-term temporal structure and order. For example, consider the cup-and-ball parlor game shown next. In these games, an operator puts a target object (ball) under one of multiple container objects (cups), and moves them about, possibly revealing the target at various times and recursively containing cups within other cups. The task at the end is to tell which of the cups is covering the ball. Even in its simplest instantiation, one can expect any human or computer system that solves this task to require the ability to model state of the world over long temporal horizons, reason about occlusion, understand the spatiotemporal implications of containment, etc.

Humans (and apparently even cats!) are able to solve long temporal reasoning tasks like locating the ball as the cups are shuffled. Can we design similarly hard spatiotemporal reasoning tasks for computers?

Given such intricate requirements for spatiotemporal reasoning, why don’t spatiotemporal models readily outperform their frame-based counterparts? We posit that this is due to limitations of existing video benchmarks. Even though datasets have evolved significantly in the past few years, tasks have remained highly correlated to the scene and object context. In this work, we take an alternate approach to developing a video understanding dataset. Inspired by the CLEVR dataset, and the adversarial parlor games described above, we introduce CATER, a diagnostic dataset for Compositional Actions and TEmporal Reasoning in dynamic tabletop scenes. We define three tasks on the dataset, each with an increasingly higher level of complexity, but set up as classification problems in order to be comparable to existing benchmarks for easy transfer of existing models and approaches. Next, we introduce the dataset, associated tasks, and evaluate state of the art video models, showing that they struggle on CATER’s temporal reasoning tasks.

The CATER Dataset

  • 2 objects move at a time
  • All objects move
  • The camera also moves
Sample videos from our proposed CATER dataset.
The CLEVR dataset was designed to evaluate the visual reasoning ability of question-answering models as shown above. Given a complex scene as shown, the task was to answer corresponding automatically generated questions shown below. Figure taken from the original paper.

Our CATER dataset builds upon the CLEVR dataset, which was originally proposed for question-answering based visual reasoning tasks (an example on left). CATER inherits and extends the set of object shapes, sizes, colors and materials present from CLEVR. Specifically, we add two new object shapes: inverted cones and a special object called a ‘snitch’, which we created to look like three intertwined toruses in metallic gold color (see if you can find it in the above videos!). In addition, we define four atomic actions: ‘rotate’, ‘pick-place’, ‘slide’ and ‘contain’; a subset of which is afforded by each object. While the first three are self-explanatory, the final action, ‘contain’, is a special operation, only afforded by the cones. In it, a cone is pick-placed on top of another object, which may be a sphere, a snitch or even a smaller cone. This also allows for recursive containment, as a cone can contain a smaller cone that contains another object. Once a cone ‘contains’ an object, it is constrained to only ‘slide’ actions and effectively slides all objects contained within the cone. This holds until the top-most cone is ‘pick-place’d to another location, effectively ending the containment for that top-most cone. Given these objects and actions, the videos are generated by first spawning a random number of objects with random parameters at random locations, and then adding a series of actions for all or subset of the objects. As the samples from CATER above show, we can control for the complexity of the data by limiting the number of objects, number of moving objects, as well as the motion of the camera.

Tasks on CATER

We define three tasks with increasing temporal reasoning complexity on CATER, as illustrated in the figure above.

  • The first and perhaps the simplest task is Atomic Action Recognition. This task requires the models to classify what individual actions happen in a given video, out of a total of 14 action classes we define in CATER. This is set up as a multi-label classification problem, since multiple actions can happen in a video, and the model needs to predict all of them. The predictions are evaluated using mean Average Precision (mAP), which is a standard metric metric used in similar problem settings.
  • The second task on CATER is Compositional Action Recognition. Here, the models need to reason about ordering of different actions in the video (i.e., X happens “before”/”after”/”during” Y), and predict all the combinations that are active in the given video, out of the 301 unique combinations possible using the 14 atomic actions. Similar to Task 1, it is evaluated using mAP.
  • Finally, the flagship task on CATER is Snitch Localization, which is directly inspired from the cup-and-ball shell game described earlier. The task now is to predict the location of the snitch at the end of the video. While it may seem trivial to localize it from the last frame, it may not always be possible to do that due to occlusions and recursive containments. For simplicity, we pose this as a classification problem by quantizing the (6 times 6) grid on the floor into 36 cells, and evaluate the performance using a standard accuracy metric.

How do video models fare on CATER?

We consider state of the art video architecture: both based on the 2D convolution over frames (left) and 3D, or spatio-temporal, convolutions over video clips (right).

We evaluate multiple state of the art video recognition models on all tasks defined on CATER. Specifically, we focus on three broad model architectures:

  • Frame-level models: These models are similar to image classification models, and are applied to a sampled set of frames from the video. The specific architecture we use is Inception-V2, based on the popular TSN paper, which obtained strong performance on multiple video understanding tasks.
  • Spatio-temporal models: These models extend the idea of 2D convolution (over the height-width of an image) to a 3D convolution (over the height-width-time of a video clip), as shown in the figure above. These models are better capable of capturing the temporal dynamics in videos. Specifically, we experiment with the ResNet-3D architecture, along with its non-local extension, from the Non-Local Neural Networks paper, which also obtains strong performance on video recognition tasks.
  • Tracker: For task 3, we also experiment with a tracking based solution, where we initialize a state of the art tracker to a box around the initial ground truth position of the snitch on the first frame. At the last frame, we take the center point of the tracked box in the image plane, and project it to the 3D world plane using a precomputed homography between the image and world planes. The projected 3D point is then converted to the quantized grid position and evaluated for the top-1 accuracy.

Since most video models are designed to operate on short clips of the whole video, we experiment with two approaches to aggregate the predictions over the whole video:

  • Average pooling (Avg): This is the typical approach used on most benchmarks, where we average the last layer predictions (logits) for each frame/clip of the video.
  • LSTM: We train a 2-layer LSTM over the features from clips in the video. While LSTMs haven’t shown strong improvements on standard video classification benchmarks, we experiment with them here given the temporal nature of our tasks.
Model Top-1
(Avg)
Top-1
(LSTM)
Random 2.8 2.8
Frame-level RGB model 14.1 25.6
Spatio-temporal RGB model 57.4 60.2
Tracker 33.9 33.9
Comparing models’ performance on Task 3.
Task 1 and 2 are presented in the paper.
Tracker failure case

We observed that most models are easily able to solve the Task 1, since that only requires short-term temporal reasoning to recognize individual actions. Task 2 and 3, on the other hand, pose more of a challenge to video models, given their temporal nature. While full results are provided in the paper, we highlight some key results on the snitch localization task (Task 3) in the attached table. First, we note that the 2D, or frame-level models, perform quite poorly on this task. This is unlike the observation on other standard video datasets, where the frame-level models are quite competitive with spatio-temporal models. Second, we note that a state of the art tracking method (Zhu et al., ECCV’18), is only able to solve about a third of the videos, showing that low-level trackers may not solve this task either (see, for example, the attached video of a tracker failure case). Third, the performance of LSTM aggregation was always better than average pooling, since that is better suited to capture the temporal nature of the tasks. And finally, we note that best performance is still fairly low on this task, suggesting scope for future work.

Diagnostics using CATER

CATER allows for diagnosing model performance by splitting the test set based on various parameters. For example, left graph shows models’ performance w.r.t whether the snitch is visible at the end or not, and the right one shows the performance w.r.t the frame at which the snitch last moves (out of 300 total frames in each video).

Having close control over the dataset generation process enables us to perform diagnostics impossible with any previous dataset. While full analysis over multiple control parameters (like camera motion, number of objects etc.) is provided in the paper, we highlight two ablations in the figure above. First, we evaluate the performance of our models with different aggregation schemes, over the cases when the snitch was visible at the end of the video, and when it was contained. As expected, the performance drops when it’s contained, with the tracking based model dropping the most suggesting that the tracker is not effective at handling the containments and occlusions. In the second ablation, we evaluate the performance as a function of the last time in the video the snitch moves. We observe that the performance of all models consistently drops as the snitch keeps moving throughout the video. Interestingly, the tracker is the most resilient approach here, since it attempts to explicitly keep track of the snitch’s position until the end, unlike the other models that rely on aggregating belief over the snitch’s location over all clips of the video. The following video provides some qualitative examples of the videos where models perform well (easy cases) and the ones where they do not (hard cases).

Analysis of which videos are easiest for models and which are the hardest. We find that videos where the snitch suddenly moves in the end are the hardest, corroborating the diagnostic analysis seen earlier.

Discussion

To conclude, in this work we introduced and used CATER to analyze several leading network designs on hard spatiotemporal reasoning tasks. We found most models struggle on CATER, especially on the snitch localization task which requires long term reasoning. Such temporal reasoning challenges are common in the real world, and solving those would be the cornerstone of the next improvements in machine video understanding. That said, CATER is, by no means, a complete solution to the video understanding problem. Like any other synthetic or simulated dataset, it should be considered in addition to real world benchmarks. One of our findings is that while high-level semantic tasks such as activity recognition may be addressable with current architectures given a richly labeled dataset, “mid-level” tasks such as tracking still pose tremendous challenges, particularly under long-term occlusions and containment. We believe addressing such challenges will enable broader temporal reasoning tasks that capture intentions, goals, and causal behavior.

Additionally, note that CATER is also not limited to video classification. Recent papers have also used CATER for other related tasks like video reconstruction and for learning object permanence. We have released a pre-generated version of CATER, along with full metadata, which can be used to define arbitrarily fine-grained spatio-temporal reasoning tasks. Additionally, we have released the code to generate CATER as well as the baselines discussed above, which can be used to reproduce our experiments or generate a custom variant of CATER for other tasks.

Interested in more details?

Check out the links to the paper, complete codebase with pre-trained models, talk and project webpage below.

Read More