This paper was accepted at the NeurIPS 2023 workshop on Diffusion Models.
We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in the production of music in 44.1kHz stereo audio with sampling-time guidance. The scenarios we consider include continuation, inpainting and regeneration of musical audio, the creation of smooth transitions between two different music tracks, and the transfer of desired stylistic characteristics to existing audio clips. We achieve this by applying guidance at sampling time in a simple framework that…Apple Machine Learning Research
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation
This paper was accepted at the workshop I Can’t Believe It’s Not Better! (ICBINB) at NeurIPS 2023.
Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap, and find that pre-trained language models offer limited help in auto-regressive text-to-image generation. We provide a two-fold explanation by analyzing tokens from each modality…Apple Machine Learning Research
Generating Molecular Conformers with Manifold Diffusion Fields
This paper was accepted at Generative AI and Biology workshop at NeurIPS 2023.
In this paper we tackle the problem of generating a molecule conformation in 3D space given its 2D structure. We approach this problem through the lens of a diffusion model for functions in Riemannian Manifolds. Our approach is simple and scalable, and obtains results that are on par with state-of-the-art while making no assumptions about the explicit structure of molecules.Apple Machine Learning Research
Manifold Diffusion Fields
This paper was accepted at the Diffusion Models workshop at NeurIPS 2023.
Score-based models have quickly become the de facto choice for generative modeling of images, text and more recently molecules. However, to adapt a score-based generative modeling to these domains the score network needs to be carefully designed, hampering its applicability to arbitrary data domains. In this paper we tackle this problem by taking a textit{functional} view of data. This functional view allows to cast seemingly different domains to a common shared representation. We then re-formulate the score function to…Apple Machine Learning Research
Fast Optimal Locally Private Mean Estimation via Random Projections
We study the problem of locally private mean estimation of high-dimensional vectors in the Euclidean ball. Existing algorithms for this problem either incur sub-optimal error or have high communication and/or run-time complexity. We propose a new algorithmic framework, ProjUnit, for private mean estimation that yields algorithms that are computationally efficient, have low communication complexity, and incur optimal error up to a 1+o(1)-factor. Our framework is deceptively simple: each randomizer projects its input to a random low-dimensional subspace, normalizes the result, and then runs an…Apple Machine Learning Research
FLEEK: Factual Error Detection and Correction with Evidence Retrieved from External Knowledge
Large language models’ inability to attribute their claims to external knowledge and their tendency to hallucinate makes it difficult to trust their responses. Even humans are prone to factual errors in their writing. Therefore verifying the factual accuracy of textual information, whether generated by large language models or curated by humans, is an important task. However, manually validating and correcting factual errors tends to be a tedious and labor-intensive process. In this paper, we propose FLEEK for automatic fact verification and correction. FLEEK automatically extracts factual…Apple Machine Learning Research
4M: Massively Multimodal Masked Modeling
*=Equal Contributors
Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision. In this paper, we take a step in this direction and propose a multimodal training scheme called 4M. It consists of training a single unified Transformer encoder-decoder using a masked modeling objective across a wide range of input/output modalities – including text, images, geometric, and semantic…Apple Machine Learning Research
Adaptive Weight Decay
We propose adaptive weight decay, which automatically tunes the hyper-parameter for weight decay during each training iteration. For classification problems, we propose changing the value of the weight decay hyper-parameter on the fly based on the strength of updates from the classification loss (i.e., gradient of cross-entropy), and the regularization loss (i.e., -norm of the weights). We show that this simple modification can result in large improvements in adversarial robustness — an area which suffers from robust overfitting — without requiring extra data across various datasets and…Apple Machine Learning Research
What Algorithms can Transformers Learn? A Study in Length Generalization
This paper was accepted at the MATH workshop at NeurIPS 2023.
Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algorithm for solving a task. We study the scope of Transformers’ abilities in the specific setting of length generalization on algorithmic tasks. Here, we propose a unifying framework to understand when and how Transformers can exhibit strong length generalization on a given task. Specifically, we…Apple Machine Learning Research