Investigating the temporal dynamics and modelling of mid-level feature representations in humans

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Scene perception is a key function of biological visual systems. According to the hierarchical processing view, scene perception in the human brain begins with low-level features, progresses to mid-level features, and ends with high-level features. While low- and high-level feature processing is well-studied, research on mid-level features remains limited. Here, we addressed this gap by investigating when mid-level features are processed in humans using a novel stimulus set of naturalistic scenes as images and videos, accompanied with ground-truth annotations for five mid-level features (reflectance, lighting, world normals, scene depth and skeleton position), and two framing features: one low-level (edges) and one high-level feature (action). To reveal when low-, mid- and high-level features are represented in the brain, we collected electroencephalography (EEG) data from human participants during stimulus presentation and trained encoding models to predict EEG data from ground-truth annotations. We revealed that mid-level features were best represented between ~100 and ~250 ms post-stimulus, between low- and high-level features. Moreover, we assessed scene- and action-trained convolutional neural networks (CNNs) as models of mid-level feature processing in humans. We found a comparable processing order for mid- but not low- or high-level features with humans. Overall, our results characterize mid-level feature processing in humans in the temporal domain and reveal CNNs as suitable models of the processing hierarchy of mid-level vision in humans.

Article activity feed