Shared texture-like representations, not global form, underlie deep neural network alignment with human visual processing
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Deep neural networks (DNNs) are a leading computational framework for understanding neural visual processing. A standard approach for evaluating their similarity to brain function uses DNN activations to predict human neural responses to the same images, yet which visual properties drive this alignment remains unclear. Here, we show that texture-like representations – operationalized as global summaries of local image statistics – largely underlie this alignment. We recorded electroencephalography (EEG) from 57 participants viewing three image types: natural scenes, ‘texture-synthesized’ versions that preserve global summaries of local statistics while disrupting global form, and isolated objects without backgrounds. Representational-similarity analysis showed the strongest DNN-EEG alignment when both systems processed texture-synthesized images. Cross-prediction – using features from one image condition to predict EEG responses to another – showed that features from texture-synthesized images generalized to natural scenes. Crucially, we observed a dissociation between DNN-EEG alignment and decodable object category information: alignment increased for texture-synthesized images even when object information was reduced. Together, our findings identify global summaries of local image statistics as a common currency linking DNNs and human visual processing, clarifying that global form features are not required for high DNN-EEG alignment. Our findings highlight the shared importance of local image statistics in artificial and biological visual systems.
Significance Statement
Deep neural networks (DNNs) accurately predict human neural responses to images, but the image properties driving this alignment remain unclear. We recorded brain activity from people viewing natural photographs of objects, texture-only versions of those photos (which preserved fine details but no recognizable objects), and isolated objects. DNN predictions matched the human brain signals best for texture-only images, despite their lack of semantic information; those same texture-based features also generalized to predicting brain responses to the natural photos. Strikingly, DNN’s ability to predict brain responses was dissociated from decodable object category information present in the brain activity. These findings suggest that broad texture patterns, rather than object shapes, underlie the alignment between DNNs and human vision, challenging shape-centric theories of visual processing.