SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

This paper was accepted at the UniReps Workshop at NeurIPS 2023.
The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise. Our method integrates techniques of multi-task learning, continual…Apple Machine Learning Research

Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs

Recent work in Natural Language Processing and Computer Vision has been using textual information – e.g., entity names and descriptions – available in knowledge graphs to ground neural models to high-quality structured data. However, when it comes to non-English languages, the quantity and quality of textual information are comparatively scarce. To address this issue, we introduce the novel task of automatic Knowledge Graph Enhancement (KGE) and perform a thorough investigation on bridging the gap in both the quantity and quality of textual information between English and non-English…Apple Machine Learning Research

What Algorithms can Transformers Learn? A Study in Length Generalization

This paper was accepted at the MATH workshop at NeurIPS 2023.
Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algorithm for solving a task. We study the scope of Transformers’ abilities in the specific setting of length generalization on algorithmic tasks. Here, we propose a unifying framework to understand when and how Transformers can exhibit strong length generalization on a given task. Specifically, we…Apple Machine Learning Research

Diffusion Models as Masked Audio-Video Learners

This paper was accepted at the Machine Learning for Audio Workshop at NeurIPS 2023.
Over the past several years, the synchronization between audio and visual signals has been leveraged to learn richer audio-visual representations. Aided by the large availability of unlabeled videos, many unsupervised training frameworks have demonstrated impressive results in various downstream audio and video tasks. Recently, Masked Audio-Video Learners (MAViL) has emerged as a state-of-the-art audio-video pre-training framework. MAViL couples contrastive learning with masked autoencoding to jointly…Apple Machine Learning Research

How to Scale Your EMA

*=Equal Contributors
Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule; for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum…Apple Machine Learning Research

Automating Behavioral Testing in Machine Translation

Behavioral testing in NLP allows fine-grained evaluation of systems by examining their linguistic capabilities through the analysis of input-output behavior. Unfortunately, existing work on behavioral testing in Machine Translation (MT) is currently restricted to largely handcrafted tests covering a limited range of capabilities and languages. To address this limitation, we propose using Large Language Models (LLMs) to generate a diverse set of source sentences tailored to test the behavior of MT models in a range of situations. We can then verify whether the MT model exhibits the expected…Apple Machine Learning Research

Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

Spotting user-defined flexible keyword in real-time is challenging because
the keyword is represented in text. In this work, we propose a novel architecture
to efficiently detect the flexible keywords based on the following ideas. We contsruct the representative acousting embeding of a keyword using graphene-to-phone conversion. The phone-to-embedding conversion is done by looking up the embedding dictionary which is built by averaging the corresponding embeddings (from audio encoder) of each phone during the training. The key benefit of our approach is that both text embedding and audio…Apple Machine Learning Research

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications. However, their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. Despite recent trends favoring alternative activation functions such as GELU or SiLU, known for increased computation, this study strongly advocates for reinstating ReLU activation in LLMs. We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer…Apple Machine Learning Research

Agnostically Learning Single-Index Models using Omnipredictors

We give the first result for agnostically learning Single-Index Models (SIMs) with arbitrary monotone and Lipschitz activations. All prior work either held only in the realizable setting or required the activation to be known. Moreover, we only require the marginal to have bounded second moments, whereas all prior work required stronger distributional assumptions (such as anticoncentration or boundedness). Our algorithm is based on recent work by [GHK+23] on omniprediction using predictors satisfying calibrated multiaccuracy. Our analysis is simple and relies on the relationship between…Apple Machine Learning Research