Referring to Screen Texts with Voice Assistants

Voice assistants help users make phone calls, send messages, create events, navigate, and do a lot more. However, assistants have limited capacity to understand their users’ context. In this work, we aim to take a step in this direction. Our work dives into a new experience for users to refer to phone numbers, addresses, email addresses, URLs, and dates on their phone screens. Our focus lies in reference understanding, which becomes particularly interesting when multiple similar texts are present on screen, similar to visual grounding. We collect a dataset and propose a lightweight…Apple Machine Learning Research

5IDER: Unified Query Rewriting for Steering, Intent Carryover, Disfluencies, Entity Carryover and Repair

*=Equal Contributors
Providing voice assistants the ability to navigate multi-turn conversations is a challenging problem. Handling multi-turn interactions requires the system to understand various conversational use-cases, such as steering, intent carryover, disfluencies, entity carryover, and repair. The complexity of this problem is compounded by the fact that these use-cases mix with each other, often appearing simultaneously in natural language. This work proposes a non-autoregressive query rewriting architecture that can handle not only the five aforementioned tasks, but also complex…Apple Machine Learning Research

Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps

Optimal transport (OT) theory focuses, among all maps that can morph a probability measure onto another, on those that are the “thriftiest”, i.e. such that the averaged cost between and its image be as small as possible. Many computational approaches have been proposed to estimate such Monge maps when is the distance, e.g., using entropic maps (Pooladian and Niles-Weed, 2021), or neural networks (Makkuva et al., 2020;
Korotin et al., 2020). We propose a new model for transport maps, built on a family of translation invariant costs , where and is a regularizer. We propose a…Apple Machine Learning Research

The Monge Gap: A Regularizer to Learn All Transport Maps

Optimal transport (OT) theory has been been used in machine learning to study and characterize maps that can push-forward efficiently a probability measure onto another.
Recent works have drawn inspiration from Brenier’s theorem, which states that when the ground cost is the squared-Euclidean distance, the “best” map to morph a continuous measure in into another must be the gradient of a convex function.
To exploit that result, , Makkuva et al. (2020); Korotin et al. (2020) consider maps , where is an input convex neural network (ICNN), as defined by Amos et al. 2017, and fit with SGD using…Apple Machine Learning Research

Private Online Prediction from Experts: Separations and Faster Rates

*= Equal Contributors
Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries. For approximate differential privacy, our algorithms achieve regret bounds of for the stochastic setting and for oblivious adversaries (where is the number of experts). For pure DP, our algorithms are the first to obtain sub-linear regret for oblivious adversaries in the…Apple Machine Learning Research

Stabilizing Transformer Training by Preventing Attention Entropy Collapse

m*= Equal Contributors
Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy…Apple Machine Learning Research