This paper has been accepted to the UniReps Workshop in NeurIPS 2023.
Contrastive language image pretraining has become the standard approach for training vision language models. Despite the utility of CLIP visual features as global representations for images, they have limitations when it comes to tasks involving object localization, pixel-level understanding of the image, or 3D perception. Multi-task training is a popular solution to address this drawback, but collecting a large-scale annotated multi-task dataset incurs significant costs. Furthermore, training on separate task specific…Apple Machine Learning Research