Apple – Page 4 – Vedere AI

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

June 4, 2025

by Apple

Apple Machine Learning Research

Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection

June 3, 2025

by Apple

*Equal Contributors
Identifying mistakes (i.e., miscues) made while reading aloud is commonly approached post-hoc by comparing automatic speech recognition (ASR) transcriptions to the target reading text. However, post-hoc methods perform poorly when ASR inaccurately transcribes verbatim speech. To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. Our contributions include: first, demonstrating that…Apple Machine Learning Research

Distillation Scaling Laws

June 3, 2025

by Apple

We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level…Apple Machine Learning Research

SpeakStream: Streaming Text-to-Speech with Interleaved Data

May 30, 2025

by Apple

With the increasing integration of speech front-ends and large language models (LLM),
there is a need to explore architectures that integrate these modalities.
While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler.
Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses a technical problem because they need entire utterances to generate sytlistic audio.
In this paper we present a ‘streaming’ TTS that can generate audio from…Apple Machine Learning Research

World-Consistent Video Diffusion With Explicit 3D Modeling

May 30, 2025

by Apple

As diffusion models dominating visual content generation, efforts have been made to adapt these models for multi-view image generation to create 3D content. Traditionally, these methods implicitly learn 3D consistency by generating only RGB frames, which can lead to artifacts and inefficiencies in training. In contrast, we propose generating Normalized Coordinate Space (NCS) frames alongside RGB frames. NCS frames capture each pixel’s global coordinate, providing strong pixel correspondence and explicit supervision for 3D consistency. Additionally, by jointly estimating RGB and NCS frames…Apple Machine Learning Research

Interleaved Reasoning for Large Language Models via Reinforcement Learning

May 28, 2025

by Apple

Long chain-of-thought (CoT) significantly enhances large language models’ (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective rule-based reward to incentivize correct intermediate steps…Apple Machine Learning Research

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation

May 28, 2025

by Apple

Auscultation, particularly heart sound, is a non-invasive
technique that provides essential vital sign information.
Recently, self-supervised acoustic representation founda-
tion models (FMs) have been proposed to offer insights
into acoustics-based vital signs. However, there has been
little exploration of the extent to which auscultation is
encoded in these pre-trained FM representations. In this
work, using a publicly available phonocardioram (PCG)
dataset and a heart rate (HR) estimation model, we con-
duct a layer-wise investigation of six acoustic representa-
tion FMs: HuBERT, wav2vec2…Apple Machine Learning Research

CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling

May 27, 2025

by Apple

Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16…Apple Machine Learning Research

What Makes for a Good Stereoscopic Image?

May 22, 2025

by Apple

This paper was accepted at the CV4Metaverse Workshop at CVPR 2025.
With rapid advancements in virtual reality (VR) headsets, effectively measuring stereoscopic quality of experience (SQoE) has become essential for delivering immersive and comfortable 3D experiences. However, most existing stereo metrics focus on isolated aspects of the viewing experience such as visual discomfort or image quality, and have traditionally faced data limitations. To address these gaps, we present SCOPE (Stereoscopic COntent Preference Evaluation), a new dataset comprised of real and synthetic stereoscopic images…Apple Machine Learning Research

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

May 22, 2025

by Apple

With the rapid expansion in the scale of large
language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed
inference techniques such as Tensor Parallelism
pose a significant challenge to achieve scalability
and low latency. Therefore, we introduce a novel
optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that…Apple Machine Learning Research

Vedere AI

Posts in category: Apple

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection

Distillation Scaling Laws

SpeakStream: Streaming Text-to-Speech with Interleaved Data

World-Consistent Video Diffusion With Explicit 3D Modeling

Interleaved Reasoning for Large Language Models via Reinforcement Learning

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation

CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling

What Makes for a Good Stereoscopic Image?

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.