While federated learning (FL) and differential privacy (DP) have been extensively studied, their application to automatic speech recognition (ASR) remains largely unexplored due to the challenges in training large transformer models. Specifically, large models further exacerbate issues in FL as they are particularly susceptible to gradient heterogeneity across layers, unlike the relatively uniform gradient behavior observed in shallow models. As a result, prior works struggle to converge with standard optimization techniques, even in the absence of DP mechanisms. To the best of our knowledge…Apple Machine Learning Research
Overcoming Vocabulary Constraints with Pixel-level Fallback
Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that…Apple Machine Learning Research
ILuvUI: Instruction-Tuned Language-Vision Modeling of UIs from Machine Conversations
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but
many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image
training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike
prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a
dataset of 335K conversational examples paired with UIs that cover Q&A, UI…Apple Machine Learning Research
AXLearn: Modular Large Model Training on Heterogeneous Infrastructure
We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure. AXLearn’s internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on heterogeneous compute infrastructure. We introduce a novel method of quantifying modularity via…Apple Machine Learning Research
Faster Rates for Private Adversarial Bandits
We design new differentially private algorithms for the problems of adversarial bandits and bandits with expert advice. For adversarial bandits, we give a simple and efficient conversion of any non-private bandit algorithms to private bandit algorithms. Instantiating our conversion with existing non-private bandit algorithms gives a regret upper bound of O(KTε)Oleft(frac{sqrt{KT}}{sqrt{varepsilon}}right)O(εKT), improving upon the existing upper bound O(KTlog(KT)ε)Oleft(frac{sqrt{KT log(KT)}}{varepsilon}right)O(εKTlog(KT)) in all privacy regimes. In particular, our algorithms…Apple Machine Learning Research
Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions
Wearable devices record physiological and behavioral signals that can improve health predictions. While foundation models are increasingly used for such predictions, they have been primarily applied to low-level sensor data, despite behavioral data often being more informative due to their alignment with physiologically relevant timescales and quantities. We develop foundation models of such behavioral signals using over 2.5B hours of wearable data from 162K individuals, systematically optimizing architectures and tokenization strategies for this unique dataset. Evaluated on 57 health-related…Apple Machine Learning Research
Addressing Misspecification in Simulation-based Inference through Data-driven Calibration
Driven by steady progress in deep generative modeling, simulation-based inference (SBI) has emerged as the workhorse for inferring the parameters of stochastic simulators. However, recent work has demonstrated that model misspecification can compromise the reliability of SBI, preventing its adoption in important applications where only misspecified simulators are available. This work introduces robust posterior estimation~(RoPE), a framework that overcomes model misspecification with a small real-world calibration set of ground-truth parameter measurements. We formalize the misspecification…Apple Machine Learning Research
Target Concrete Score Matching: A Holistic Framework for Discrete Diffusion
Discrete diffusion is a promising framework for modeling and generating discrete data. In this work, we present Target Concrete Score Matching (TCSM), a novel and versatile objective for training and fine-tuning discrete diffusion models. TCSM provides a general framework with broad applicability. It supports pre-training discrete diffusion models directly from data samples, and many existing discrete diffusion approaches naturally emerge as special cases of our more general TCSM framework. Furthermore, the same TCSM objective extends to post-training of discrete diffusion models, including…Apple Machine Learning Research
Shielded Diffusion: Generating Novel and Diverse Images using Sparse Repellency
Diffusion models are generating ever more realistic images. Yet, when gener-
ating images repeatedly with the same prompt, practitioners often obtain slight
variations of the same, highly-likely mode. As a result, most models fail to re-
flect the inherent diversity seen in data, which hinders their relevance to creative
tasks or ability to power world models. This work proposes a highly effective and
general method to repel generated images away from a reference set of images.
This is achieved by introducing data-driven repellence terms within diffusions dy-
namically, throughout their…Apple Machine Learning Research
A Variational Framework for Improving Naturalness in Generative Spoken Language Models
The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range…Apple Machine Learning Research