Enhancing CTC-based Speech Recognition with Diverse Modeling Units

In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning architectures like transformer. On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model’s N-best hypotheses with a phoneme-based model. This raises an interesting question about where the improvements come from other than the system combination effect. We examine the underlying mechanisms driving these gains and propose an efficient joint training approach, where E2E models are trained jointly…Apple Machine Learning Research

Transfer Learning for Structured Pruning under Limited Task Data

This paper was accepted at the Efficient Natural Language and Speech Processing (ENLSP-III) Workshop at NeurIPS.
Large, pre-trained models are problematic to use in resource constrained applications. Fortunately, task-aware structured pruning methods offer a solution. These approaches reduce model size by dropping structural units like layers and attention heads in a manner that takes into account the end-task. However, these pruning algorithms require more task-specific data than is typically available. We propose a framework which combines structured pruning with transfer learning to reduce…Apple Machine Learning Research

Accurate Knowledge Distillation via N-best Reranking

We propose utilizing n-best reranking to enhance Sequence-Level Knowledge Distillation (Kim and Rush, 2016) where we extract pseudo-labels for student model’s training data from top n-best hypotheses and leverage a diverse set of models with different inductive biases, objective functions or architectures, including some publicly-available large language models, to pick the highest-quality hypotheses as labels. The effectiveness of our proposal is validated through experiments on the WMT’21 German ↔ English and Chinese ↔ English translation tasks. Our results demonstrate that utilizing…Apple Machine Learning Research

Bytes Are All You Need: Transformers Operating Directly On File Bytes

Modern deep learning approaches usually utilize modality-specific processing. For example, the most common deep learning approach to image classification involves decoding image file bytes into an RGB tensor which is passed into a neural network. Instead, we investigate modality-independent representation learning by performing classification directly on file bytes, without the need for decoding files at inference time. This enables models to operate on various modalities without any hand-designed, modality-specific processing. Our model, ByteFormer, improves ImageNet Top-1 classification…Apple Machine Learning Research

Private Vector Mean Estimation in the Shuffle Model: Optimal Rates Require Many Messages

We study the problem of private vector mean estimation in the shuffle model of privacy where nnn users each have a unit vector in ddd dimensions. We propose a new multi-message protocol that achieves the optimal error using O~(min⁡(nε2,d))tilde{mathcal{O}}left(min(nvarepsilon^2,d)right)O~(min(nε2,d)) messages per user. Moreover, we show that any (unbiased) protocol that achieves optimal error requires each user to send Ω(min⁡(nε2,d)/log⁡(n))Omega(min(nvarepsilon^2,d)/log(n))Ω(min(nε2,d)/log(n)) messages, demonstrating the optimality of our message complexity up to logarithmic…Apple Machine Learning Research

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models’ compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and…Apple Machine Learning Research

Optimization Without Retraction on the Random Generalized Stiefel Manifold

Optimization over the set of matrices X that satisfy X^TBX = Ip, referred to as the generalized Stiefel manifold, appears in many applications involving sampled covariance matrices such as the canonical correlation analysis (CCA), independent component analysis (ICA), and the generalized eigenvalue problem (GEVP). Solving these problems is typically done by iterative methods that require a fully formed B. We propose a cheap stochastic iterative method that solves the optimization problem while having access only to a random estimates of B. Our method does not enforce the constraint in every…Apple Machine Learning Research

Revisiting Non-separable Binary Classification and its Applications in Anomaly Detection

The inability to linearly classify XOR has motivated much of deep learning. We revisit this age-old problem and show that linear classification of XOR is indeed possible. Instead of separating data between halfspaces, we propose a slightly different paradigm, equality separation, that adapts the SVM objective to distinguish data within or outside the margin. Our classifier can then be integrated into neural network pipelines with a smooth approximation. From its properties, we intuit that equality separation is suitable for anomaly detection. To formalize this notion, we introduce closing…Apple Machine Learning Research

Applying RLAIF for Code Generation with API-usage in Lightweight LLMs

This paper was accepted at the Natural Language Reasoning and Structured Explanations workshop at ACL 2024.
Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains, including mitigating harm in LLM outputs, enhancing text summarization, and mathematical reasoning. This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (Apple Machine Learning Research