Investigating the temporal dynamics and modelling of mid-level feature representations in humans
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Scene perception is a key function of biological visual systems. According to the hierarchical processing view, scene perception in the human brain begins with low-level features, progresses to mid-level features, and ends with high-level features. While low- and high-level feature processing is well-studied, research on mid-level features remains limited. Here, we addressed this gap by investigating when mid-level features are processed in humans using a novel stimulus set of naturalistic scenes as images and videos, accompanied with ground-truth annotations for five mid-level features (reflectance, lighting, world normals, scene depth and skeleton position), and two framing features: one low-level (edges) and one high-level feature (action). To reveal when low-, mid- and high-level features are represented in the brain, we collected electroencephalography (EEG) data from human participants during stimulus presentation and trained encoding models to predict EEG data from ground-truth annotations. We revealed that mid-level features were best represented between ~100 and ~250 ms post-stimulus, between low- and high-level features. Moreover, we assessed scene- and action-trained convolutional neural networks (CNNs) as models of mid-level feature processing in humans. We found a comparable processing order for mid- but not low- or high-level features with humans. Overall, our results characterize mid-level feature processing in humans in the temporal domain and reveal CNNs as suitable models of the processing hierarchy of mid-level vision in humans.