Does Human-Like Contextual Object Recognition Emerge from Language Supervision and Language-Guided Inference?
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Human vision is an active, context-sensitive process that interprets objects in relation to their surroundings. While behavioral research has long shown that scene context facilitates object recognition, the underlying computational mechanisms, and the extent to which artificial vision models replicate this ability, remain unclear. Here, we addressed this gap by combining human behavioral experiments with computational modeling to investigate how structured scene context influences object recognition. Using a novel 3D simulation framework, we embedded target objects into indoor scenes, and manipulated contextual coherence between objects and scenes by using either intact scenes or their phase-scrambled versions. Humans showed a robust object recognition advantage in coherent scenes, particularly under challenging conditions such as occlusion, crowding, or non-canonical viewpoints. Conventional vision models, including convolutional neural networks (CNNs) and vision transformers (ViTs), failed to replicate this effect. In contrast, vision-language models (VLMs), particularly those using ViT architectures and trained with language supervision (e.g., CLIP), approached human-like accuracy. This shows that semantically rich and category-structured representations are required for modelling context sensitivity. Notably, context sensitive behavior was closest to humans in VLMs when using language-guided inference at test time. This suggests that how a model accesses its representations during inference is relevant for enabling context-sensitive behavior. Together, this work offers steps towards a computational account of contextual facilitation of objects by scenes, and highlights zero-shot inference as an interesting alignment metric when benchmarking artificial and biological vision.