Cubify Anything: Scaling Indoor 3D Object Detection

We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish Cubify Transformer…Apple Machine Learning Research

Humanoid Policy ~ Human Policy

Training manipulation policies for humanoid robots with
diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data
collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans
from both the data and modeling perspectives. We collect an egocentric task-oriented dataset (PH2D)…Apple Machine Learning Research

Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs

Current Large Language Models (LLMs) are predominantly designed with English as the primary language, and even the few that are multilingual tend to exhibit strong English-centric biases. Much like speakers who might produce awkward expressions when learning a second language, LLMs often generate unnatural outputs in non-English languages, reflecting English-centric patterns in both vocabulary and grammar. Despite the importance of this issue, the naturalness of multilingual LLM outputs has received limited attention. In this paper, we address this gap by introducing novel automatic…Apple Machine Learning Research

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing…Apple Machine Learning Research

Matrix3D: Large Photogrammetry Model All-in-One

We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D’s large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs…Apple Machine Learning Research

Local Pan-Privacy for Federated Analytics

Pan-privacy was proposed by Dwork et al. (2010) as an approach to designing a private analytics system that retains its privacy properties in the face of intrusions that expose the system’s internal state. Motivated by federated telemetry applications, we study local pan-privacy, where privacy should be retained under repeated unannounced intrusions on the local state. We consider the problem of monitoring the count of an event in a federated system, where event occurrences on a local device should be hidden even from an intruder on that device. We show that under reasonable constraints, the…Apple Machine Learning Research

Mechanisms of Projective Composition of Diffusion Models

We study the theoretical foundations of composition in diffusion models, with a particular focus on out-of-distribution extrapolation and length-generalization. Prior work has shown that composing distributions via linear score combination can achieve promising results, including length-generalization in some cases (Du et al., 2023; Liu et al., 2022). However, our theoretical understanding of how and why such compositions work remains incomplete. In fact, it is not even entirely clear what it means for composition to “work”. This paper starts to address these fundamental gaps. We begin by…Apple Machine Learning Research

Improved Sample Complexity for Private Nonsmooth Nonconvex Optimization

We study differentially private (DP) optimization algorithms for stochastic and empirical objectives which are neither smooth nor convex, and propose methods that return a Goldstein-stationary point with sample complexity bounds that improve on existing works.
We start by providing a single-pass (ϵ,δ)(epsilon,delta)(ϵ,δ)-DP algorithm that returns an (α,β)(alpha,beta)(α,β)-stationary point as long as the dataset is of size…Apple Machine Learning Research

Classifier-Free Guidance is a Predictor-Corrector

We investigate the theoretical foundations of classifier-free guidance (CFG). CFG is the dominant method of conditional sampling for text-to-image diffusion models, yet unlike other aspects of diffusion, it remains on shaky theoretical footing. In this paper, we disprove common misconceptions, by showing that CFG interacts differently with DDPM (Ho et al., 2020) and DDIM (Song et al., 2021), and neither sampler with CFG generates the gamma-powered distribution p(x|c)^γp(x)^{1−γ}. Then, we clarify the behavior of CFG by showing that it is a kind of predictor-corrector method (Song et al., 2020)…Apple Machine Learning Research

How to Verify Any (Reasonable) Distribution Property: Computationally Sound Argument Systems for Distributions

As statistical analyses become more central to science, industry and society, there is a growing need to ensure correctness of their results. Approximate correctness can be verified by replicating the entire analysis, but can we verify without replication? Building on a recent line of work, we study proof-systems that allow a probabilistic verifier to ascertain that the results of an analysis are approximately correct, while drawing fewer samples and using less computational resources than would be needed to replicate the analysis. We focus on distribution testing problems: verifying that an…Apple Machine Learning Research