Understanding contrastive versus reconstructive self-supervised learning of Vision Transformers

Self-supervised learning (SSL) Workshop at NeurIPS


While self-supervised learning on Vision Transformers (ViTs) has led to state-ofthe-art results on image classification benchmarks, there has been little research on understanding the differences in representations that arise from different training methods. We address this by utilizing Centred Kernel Alignment for comparing neural representations learned by contrastive learning and reconstructive learning, two leading paradigms for self-supervised learning. We find that the representations learned by reconstructive learning are significantly dissimilar from representations learned by contrastive learning. We analyze these differences, and find that they start to arise early in the network depth and are driven mostly by the attention and normalization layers in a transformer block. We also find that these representational differences translate to class predictions and linear separability of classes in the pretrained models. Finally, we analyze how fine-tuning affects these representational differences, and discover that a fine-tuned reconstructive model becomes more similar to a pre-trained contrastive model.

Featured Publications