Beyond category-supervision: instance-level contrastive learning models predict human visual system responses to objects

Read the full article See related articles


Anterior regions of the ventral visual stream have substantial information about object categories, prompting theories that category-level forces are critical for shaping visual representation. The strong correspondence between category-supervised deep neural networks and ventral stream representation supports this view, but does not provide a viable learning model, as these deepnets rely upon millions of labeled examples. Here we present a fully self-supervised model which instead learns to represent individual images, where views of the same image are embedded nearby in a low-dimensional feature space, distinctly from other recently encountered views. We find category information implicitly emerges in the feature space, and critically that these models achieve parity with category-supervised models in predicting the hierarchical structure of brain responses across the human ventral visual stream. These results provide computational support for learning instance-level representation as a viable goal of the ventral stream, offering an alternative to the category-based framework that has been dominant in visual cognitive neuroscience.

Article activity feed