Merge Vision Foundation Models via Multi-Task Distillation

As the repository of publicly available pre-trained vision foundation models (VFMs) — such as CLIP, DINOv2, and SAM — grows, users face challenges in storage, memory, and computational efficiency when deploying multiple models concurrently. To address these concerns, we introduce a unique approach that merges the capabilities of multiple VFMs into a single efficient multi-task model. Our method, termed “joint distillation,” seamlessly integrates teacher-student learning with self-distillation, operating with just unlabeled image data and drastically cutting down on computational requirements…Apple Machine Learning Research