ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications. However, their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. Despite recent trends favoring alternative activation functions such as GELU or SiLU, known for increased computation, this study strongly advocates for reinstating ReLU activation in LLMs. We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer…Apple Machine Learning Research

Guiding Instruction-based Image Editing via Multimodal Large Language Models

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing…Apple Machine Learning Research

How Far Are We from Intelligent Visual Deductive Reasoning?

This paper was accepted at the How Far Are We from AGI? workshop at ICLR 2024.
Vision-Language Models (VLMs) such as GPT-4V have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindspots in the current SOTA VLMs. Specifically, we leverage Raven’s Progressive Matrices (RPMs), to assess VLMs’ abilities to perform multi-hop relational and deductive reasoning relying solely on visual clues. We perform comprehensive evaluations of several popular VLMs…Apple Machine Learning Research

MOFI: Learning Image Representation from Noisy Entity Annotated Images

In this paper, we introduce a novel approach to automatically assign entity labels to images from existing noisy image-text pairs. The approach employees a named entity recognition model to extract entities from text, and uses a CLIP model to select the right entities as the labels of the paired image. The approach is simple, and can be readily scaled up to billions of image-text pairs mined from the web, through which we have successfully created a dataset with 2 millions of distinct entities. We study new training approaches on the collected new dataset with large scale entity labels…Apple Machine Learning Research

Large Language Models as Generalizable Policies for Embodied Tasks

We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and act solely through environmental interactions. We show that LLaRP is robust to complex paraphrasings of task instructions and can generalize to new tasks that require novel optimal behavior. In…Apple Machine Learning Research

Compressing LLMs: The Truth is Rarely Pure and Never Simple

Despite their remarkable achievements, modern Large Language Models (LLMs) encounter exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs achieving 50-60% sparsity and reducing the bit-width down to 3 or 4 bits per weight, with negligible perplexity degradation over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back, and re-evaluates the effectiveness of existing…Apple Machine Learning Research

When can transformers reason with abstract symbols?

We investigate the capabilities of transformer models on relational reasoning tasks. In these tasks, models are trained on a set of strings encoding abstract relations, and are then tested out-of-distribution on data that contains symbols that did not appear in the training dataset. We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relations and generalize to the test set when trained by gradient descent on sufficiently large quantities of training data. This is in contrast to classical fully-connected networks, which we prove fail to…Apple Machine Learning Research

Poly-View Contrastive Learning

Contrastive learning typically matches pairs of related views among a number of unrelated negative views.
Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics.
We show that with unlimited computation, one should maximize the number of related views, and with a fixed compute budget, it is beneficial to decrease the number of unique samples whilst increasing the number of views of…Apple Machine Learning Research

Label-Efficient Sleep Staging Using Transformers Pre-trained with Position Prediction

Sleep staging is a clinically important task for diagnosing various sleep disorders but remains challenging to deploy at scale because it requires clinical expertise, among other reasons. Deep learning models can perform the task but at the expense of large labeled datasets, which are unfeasible to procure at scale. While self-supervised learning (SSL) can mitigate this need, recent studies on SSL for sleep staging have shown performance gains saturate after training with labeled data from only tens of subjects, hence are unable to match peak performance attained with larger datasets. We…Apple Machine Learning Research

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7× Faster Pre-training on Web-scale Image-Text Data

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable 2.7…Apple Machine Learning Research