Does Human-Like Contextual Object Recognition Emerge from Language Supervision and Language-Guided Inference?

Karim Rajaei
Radoslaw Martin Cichy
Hamid Soltanian-Zadeh

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Human vision is an active, context-sensitive process that interprets objects in relation to their surroundings. While behavioral research has long shown that scene context facilitates object recognition, the underlying computational mechanisms, and the extent to which artificial vision models replicate this ability, remain unclear. Here, we addressed this gap by combining human behavioral experiments with computational modeling to investigate how structured scene context influences object recognition. Using a novel 3D simulation framework, we embedded target objects into indoor scenes, and manipulated contextual coherence between objects and scenes by using either intact scenes or their phase-scrambled versions. Humans showed a robust object recognition advantage in coherent scenes, particularly under challenging conditions such as occlusion, crowding, or non-canonical viewpoints. Conventional vision models, including convolutional neural networks (CNNs) and vision transformers (ViTs), failed to replicate this effect. In contrast, vision-language models (VLMs), particularly those using ViT architectures and trained with language supervision (e.g., CLIP), approached human-like accuracy. This shows that semantically rich and category-structured representations are required for modelling context sensitivity. Notably, context sensitive behavior was closest to humans in VLMs when using language-guided inference at test time. This suggests that how a model accesses its representations during inference is relevant for enabling context-sensitive behavior. Together, this work offers steps towards a computational account of contextual facilitation of objects by scenes, and highlights zero-shot inference as an interesting alignment metric when benchmarking artificial and biological vision.

Version published to 10.1101/2025.07.24.666375 on bioRxiv
Jul 24, 2025

Semantic Saliency from Multi-Modal Large Language Model Scene Understanding Maps

This article has 5 authors:
1. Shravan Murlidaran
2. Ziqi Wen
3. Jonathan Skaza
4. William Wang
5. Miguel P Eckstein
This article has no evaluationsLatest version Aug 1, 2025
Place Recognition Meet Multiple Modalities: A Comprehensive Review, Current Challenges and Future Development

This article has 4 authors:
1. Zhenyu Li
2. Tianyi Shang
3. Pengjie Xu
4. Zhaojun Deng
This article has no evaluationsLatest version Jun 17, 2025
Langmark: annotations for scenes with inconsistent objects connecting distributional semantic models to vision science

This article has 7 authors:
1. Marek A. Pedziwiatr
2. Sophie Heer
3. Chisom Aniebo
4. Melissa L.-H. Vo
5. Antoine Coutrot
6. Peter Bex
7. Isabelle Mareschal
This article has no evaluationsLatest version Jun 24, 2025

Listed in

Abstract

Article activity feed

Related articles

Semantic Saliency from Multi-Modal Large Language Model Scene Understanding Maps

Place Recognition Meet Multiple Modalities: A Comprehensive Review, Current Challenges and Future Development

Langmark: annotations for scenes with inconsistent objects connecting distributional semantic models to vision science