Self-supervision deep learning models are better models of human high-level visual cortex: The roles of multi-modality and dataset training size
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
With the rapid development of Artificial Neural Network based visual models, many studies have shown that these models show unprecedented potence in predicting neural responses to images in visual cortex. Lately, advances in computer vision have introduced self-supervised models, where a model is trained using supervision from natural properties of the training set. This has led to examination of their neural prediction performance, which revealed better prediction of self-supervised than supervised models for models trained with language supervision or with image-only supervision. In this work, we delve deeper into the models’ ability to explain neural representations of object categories. We compare models that differed in their training objectives to examine where they diverge in their ability to predict fMRI and MEG recordings while participants are presented with images of different object categories. Results from both fMRI and MEG show that self-supervision was advantageous in comparison to classification training. In addition, language supervision is a better predictor for later stages of visual perception, while image-only supervision shows a consistent advantage over a longer duration, beginning from 80ms after exposure. Examination of the effect of data size training revealed that large dataset did not necessarily improve neural predictions, in particular in visual self-supervised models. Finally, examination of the correspondence of the hierarchy of each model to visual cortex showed that image-only self-supervision led to better correspondence than image only models. We conclude that while self-supervision shows consistently better prediction of fMRI and MEG recordings, each type of supervision reveals a different property of neural activity, with language-supervision explaining later onsets, while image-only self-supervision explains long and very early latencies of the neural response, with the model hierarchy naturally sharing corresponding hierarchical structure as the brain.