Accelerating LLM Inference on NVIDIA GPUs with ReDrafter

Accelerating LLM inference is an important ML research problem, as auto-regressive token generation is computationally expensive and relatively slow, and improving inference efficiency can reduce latency for users. In addition to ongoing efforts to accelerate inference on Apple silicon, we have recently made significant progress in accelerating LLM inference for the NVIDIA GPUs widely used for production applications across the industry.
Earlier this year, we published and open sourced Recurrent Drafter (ReDrafter), a novel approach to speculative decoding that achieves state of the art…Apple Machine Learning Research

ARMADA: Augmented Reality for Robot Manipulation and Robot-Free Data Acquisition

Teleoperation for robot imitation learning is bottlenecked by hardware availability. Can high-quality robot data be collected without a physical robot? We present a system for augmenting Apple Vision Pro with real-time virtual robot feedback. By providing users with an intuitive understanding of how their actions translate to robot motions, we enable the collection of natural barehanded human data that is compatible with the limitations of physical robot hardware. We conducted a user study with 15 participants demonstrating 3 different tasks each under 3 different feedback conditions and…Apple Machine Learning Research

BayesCNS: A Unified Bayesian Approach to Address Cold Start and Non-Stationarity in Search Systems at Scale

Information Retrieval (IR) systems used in search and recommendation platforms frequently employ Learning-to-Rank (LTR) models to rank items in response to user queries. These models heavily rely on features derived from user interactions, such as clicks and engagement data. This dependence introduces cold start issues for items lacking user engagement and poses challenges in adapting to non-stationary shifts in user behavior over time. We address both challenges holistically as an online learning problem and propose BayesCNS, a Bayesian approach designed to handle cold start and…Apple Machine Learning Research

Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted Language Models

*Equal Contributors
Large language models (LLMs) are increasingly being adapted to achieve task-specificity for deployment in real-world decision systems. Several previous works have investigated the bias transfer hypothesis (BTH) by studying the effect of the fine-tuning adaptation strategy on model fairness to find that fairness in pre-trained masked language models have limited effect on the fairness of models when adapted using fine-tuning. In this work, we expand the study of BTH to causal models under prompt adaptations, as prompting is an accessible, and compute-efficient way to deploy…Apple Machine Learning Research

Momentum Approximation in Asynchronous Private Federated Learning

This paper was accepted for presentation at the International Workshop on Federated Foundation Models (FL@FM-NeurIPS’24), held in conjunction with NeurIPS 2024.
Asynchronous protocols have been shown to improve the scalability of federated learning (FL) with a massive number of clients. Meanwhile, momentum-based methods can achieve the best model quality in synchronous FL. However, naively applying momentum in asynchronous FL algorithms leads to slower convergence and degraded model performance. It is still unclear how to effective combinie these two techniques together to achieve a win-win…Apple Machine Learning Research

Apple Machine Learning Research at NeurIPS 2024

Apple researchers are advancing the field of ML through fundamental research that improves the world’s understanding of this technology and helps to redefine what is possible with it. This work may lead to advancements in Apple’s products and services, and the benefits of the research extend beyond the Apple ecosystem as it is shared with the broader research community through publication, open source resources, and engagement at industry and research community events.
Next week, the 38th annual Conference on Neural Information Processing Systems (NeurIPS), will be held in Vancouver, Canada…Apple Machine Learning Research

Private and Personalized Frequency Estimation in a Federated Setting

*Equal Contributors
Motivated by the problem of next word prediction on user devices we introduce and study the problem of personalized frequency histogram estimation in a federated setting. In this problem, over some domain, each user observes a number of samples from a distribution which is specific to that user. The goal is to compute for all users a personalized estimate of the user’s distribution with error measured in KL divergence. We focus on addressing two central challenges: statistical heterogeneity and protection of user privacy. Our approach to the problem relies on discovering…Apple Machine Learning Research

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

The remarkable advancements in Multimodal Large Language Models (MLLMs) have not rendered them immune to challenges, particularly in the context of handling deceptive information in prompts, thus producing hallucinated responses under such conditions. To quantitatively assess this vulnerability, we present MAD-Bench, a carefully curated benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models…Apple Machine Learning Research

Learning Elastic Costs to Shape Monge Displacements

Given a source and a target probability measure supported on Rdmathbb{R}^dRd, the Monge problem aims for the most efficient way to map one distribution to the other.
This efficiency is quantified by defining a cost function between source and target data.
Such a cost is often set by default in the machine learning literature to the squared-Euclidean distance, ℓ22(x,y)=12∥x−y∥22ell^2_2(x,y)=tfrac12|x-y|_2^2ℓ22​(x,y)=21​∥x−y∥22​.
The benefits of using elastic costs, defined through a regularizer τtauτ as c(x,y)=ℓ22(x,y)+τ(x−y)c(x, y)=ell^2_2(x,y)+tau(x-y)c(x,y)=ℓ22​(x,y)+τ(x−y), was…Apple Machine Learning Research