The Semantic Scaffold: Functional Dissociation of Visual and Language-derived Features Shapes Human Natural Scene Understanding
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Natural scene understanding requires the seamless integration of high-resolution sensory inputs with abstract conceptual knowledge. Conventional computational models often treat scene comprehension as a feed-forward, visual-centric process. Here, we challenge this view by proposing the Semantic Scaffold framework, positing that language-derived semantic knowledge acts as a foundational component that actively shapes visual perception. To test this, we leveraged unimodal (visual-only, language-only) and multimodal (visual-language) encoding models as computational probes on the massive 7T fMRI Natural Scenes Dataset (NSD) to systematically dissect the functional topography of the human cortex. We reveal a fundamental cortical dissociation: perceptually-driven visual features are confined to the visual cortex, whereas language-derived features robustly predict activity across expansive frontal and temporal association cortices. Crucially, multimodal integration is necessary to model neural activity at the interface of these systems, providing empirical support for an integrated mechanism where top-down semantic knowledge contextually modulates visual input. Furthermore, we characterize the internal structure of this semantic scaffold, revealing unified atlas organized along a dominant animate-inanimate axis with robust left-hemisphere lateralization. Our study repositions language-derived knowledge from a secondary consequence to a primary cognitive scaffold, advancing an integrated mechanistic understanding of how the human brain constructs a coherent perception of the world. Graphic Abstract