Ubiquitous cortical sensitivity to visual information during naturalistic, audiovisual movie viewing
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Both vision and language carry rich information useful for social understanding in the real world, yet the neural processing of these signals have been mostly studied separately. Even most prior work with naturalistic stimuli does not model the contributions of vision and language signals together. Here we combined established fMRI localizer experiments, which identify social interaction perception- and language-selective regions, with a fMRI movie-viewing paradigm in the same individual participants (n=34). To pinpoint how multi-modal signals contribute to movie responses, we densely labeled the movie using vision and language deep neural networks (DNNs) and use these to predict neural responses. We found that vision model (motion and image) embeddings of movie frames predict significant activity across the cortex, while language model (speech, word, and sentence) embeddings of the spoken language predict well only in portions of the STS. We find that the individually localized motion and social interaction regions are best explained by vision model embeddings. Language regions, on the other hand, are well predicted by speech, word, and sentence language model embeddings and, surprisingly, are as equally well predicted by vision model embeddings. In an analysis of the vision model’s layer-wise and unit-wise predictivity, we find that the most predictive model units in social interaction and language regions are distinct from those in lower-level motion regions. Exploratory analyses suggest that the most predictive vision model units in social interaction and language regions contain social-semantic information conveyed by vision. Together, these results suggest that high-level visual information drives neural responses across cortex, even in language-selective regions, with varying integration of spoken language information across the STS.