More than meets the eye: Neural evidence for scene grammar representations during individual object processing.

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objects in real-world scenes adhere to regular arrangements. The resulting compositions can be described by a scene grammar – a framework that captures hierarchically structured relationships between objects in real-world scenes wherein phrases refer to clusters of frequently co-occuring objects. Within such phrases, anchor objects (e.g., sink) are predictive of surrounding local objects (e.g., toothbrush). Do neural representations of objects follow the structure suggested by scene grammar? In this EEG study we characterize the temporal dynamics of phrase-specific shared representations quantified via cross- classification analysis. That is, training classifiers on an object categorization task based on neural data from one type of object (anchor or local) and testing generalizations to the held out set of objects. We find an early cluster of timepoints between 130 and 160 ms which carry phrase specific shared representations. Next, we predict the format of shared representations from a range of encoded features in a generalized linear model using representational similarity analysis (RSA). We find that, in general, classifiers use similarity in high-level visual and semantic features when generalizing between anchor and local objects and not just similarity in low-level visual features between stimuli. Crucially, “upward” generalization from local to anchor objects was driven by co-occurrence statistics and high-level visual features. “Downward” generalization was driven by high-level semantic features, action similarity and co-occurrence statistics. We offer novel insights into the dynamics and format of neural representations underlying the intricate hierarchical network of scene grammar. We suggest that shared representations emerge not only from a purely visual diet but from active interactions of agents with their environment, using multiple objects to achieve behavioral goals, and semantic representations from language.

Article activity feed