Accessibility is crucial for inclusive app usability, yet developers often struggle to identify and fix app accessibility issues due to a lack of awareness, expertise, and inadequate tools. Current accessibility testing tools can identify accessibility issues but may not always provide guidance on how to address them. We introduce FixAlly, an automated tool designed to suggest source code fixes for accessibility issues detected by automated accessibility scanners. FixAlly employs a multi-agent LLM architecture to generate fix strategies, localize issues within the source code, and propose code…Apple Machine Learning Research
Retrieval-Augmented Correction of Named Entity Speech Recognition Errors
In recent years, end-to-end automatic speech recognition (ASR) systems have proven themselves remarkably accurate and performant, but these systems still have a significant error rate for entity names which appear infrequently in their training data. In parallel to the rise of end-to-end ASR systems, large language models (LLMs) have proven to be a versatile tool for various natural language processing (NLP) tasks. In NLP tasks where a database of relevant knowledge is available, retrieval augmented generation (RAG) has achieved impressive results when used with LLMs. In this work, we propose…Apple Machine Learning Research
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate “any resolution” on top of Ferret to…Apple Machine Learning Research
UI-JEPA: Towards Active Perception of User Intent Through Onscreen User Activity
Generating user intent from a sequence of user interface (UI) actions is a core challenge in comprehensive UI understanding. Recent advancements in multimodal large language models (MLLMs) have led to substantial progress in this area, but their demands for extensive model parameters, computing power, and high latency makes them impractical for scenarios requiring lightweight, on-device solutions with low latency or heightened privacy. Additionally, the lack of high-quality datasets has hindered the development of such lightweight models. To address these challenges, we propose UI-JEPA, a…Apple Machine Learning Research
Optimizing Byte-level Representation for End-to-End ASR
This paper was accepted at the IEEE Spoken Language Technology Workshop (SLT) 2024.
In this paper, we propose an algorithm to optimize a byte-level representation for end-to-end (E2E) automatic speech recognition (ASR). Byte-level representation is often used by large scale multilingual ASR systems when the character set of the supported languages is large. The compactness and universality of byte-level representation allow the ASR models to use smaller output and therefore, provides more flexibility. UTF-8 is the most commonly used byte-level representation and has been successfully applied…Apple Machine Learning Research
Apple Workshop on Privacy-Preserving Machine Learning 2024
At Apple, we believe privacy is a fundamental human right. It’s also one of our core values, influencing both our research and the design of Apple’s products and services.
Understanding how people use their devices often helps in improving the user experience. However, accessing the data that provides such insights — for example, what users type on their keyboards and the websites they visit — can compromise user privacy. We develop system architectures that enable learning at scale by leveraging advances in machine learning (ML), such as private federated learning (PFL), combined with…Apple Machine Learning Research
Positional Description for Numerical Normalization
We present a Positional Description Scheme (PDS) tailored for digit sequences, integrating placeholder value information for each digit. Given the structural limitations of subword tokenization algorithms, language models encounter critical Text Normalization (TN) challenges when handling numerical tasks. Our schema addresses this challenge through straightforward pre-processing, preserving the model architecture while significantly simplifying number normalization, rendering the problem tractable. This simplifies the task and facilitates more compact production-ready models capable of…Apple Machine Learning Research
Classifier-Free Guidance Is a Predictor-Corrector
We investigate the unreasonable effectiveness of classifier-free guidance (CFG).
CFG is the dominant method of conditional sampling for text-to-image diffusion models, yet
unlike other aspects of diffusion, it remains on shaky theoretical footing. In this paper, we disprove common misconceptions, by showing that CFG interacts differently with DDPM and DDIM, and neither sampler with CFG generates the gamma-powered distribution.
Then, we clarify the behavior of CFG by showing that it is a kind of Predictor-Corrector (PC) method that alternates between denoising and sharpening, which we call…Apple Machine Learning Research
On the Benefits of Pixel-Based Hierarchical Policies for Task Generalization
Reinforcement learning practitioners often avoid hierarchical policies, especially in image-based observation spaces. Typically, the single-task performance improvement over flat-policy counterparts does not justify the additional complexity associated with implementing a hierarchy. However, by introducing multiple decision-making levels, hierarchical policies can compose lower-level policies to more effectively generalize between tasks, highlighting the need for multi-task evaluations. We analyze the benefits of hierarchy through simulated multi-task robotic control experiments from pixels…Apple Machine Learning Research
Can You Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?
Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-banks as inputs, and thus, training them on self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised speech features inherently include information required for a downstream speaker verification task, and therefore, we can simplify the downstream model without sacrificing performance. To this end, we revisit the…Apple Machine Learning Research