Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning

This paper was accepted at the Self-Supervised Learning – Theory and Practice (SSLTP) Workshop at NeurIPS 2024.
Image-based Joint-Embedding Predictive Architecture (IJEPA) offers an attractive alternative to Masked Autoencoder (MAE) for representation learning using the Masked Image Modeling framework. IJEPA drives representations to capture useful semantic information by predicting in latent rather than input space. However, IJEPA relies on carefully designed context and target windows to avoid representational collapse. The encoder modules in IJEPA cannot adaptively modulate the type of…Apple Machine Learning Research