Ubiquitous cortical sensitivity to visual information during naturalistic, audiovisual movie viewing

Hannah Small
Haemy Lee Masson
Ericka Wodka
Stewart H. Mostofsky
Leyla Isik

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Both vision and language carry rich information useful for social understanding in the real world, yet the neural processing of these signals have been mostly studied separately. Even most prior work with naturalistic stimuli does not model the contributions of vision and language signals together. Here we combined established fMRI localizer experiments, which identify social interaction perception- and language-selective regions, with a fMRI movie-viewing paradigm in the same individual participants (n=34). To pinpoint how multi-modal signals contribute to movie responses, we densely labeled the movie using vision and language deep neural networks (DNNs) and use these to predict neural responses. We found that vision model (motion and image) embeddings of movie frames predict significant activity across the cortex, while language model (speech, word, and sentence) embeddings of the spoken language predict well only in portions of the STS. We find that the individually localized motion and social interaction regions are best explained by vision model embeddings. Language regions, on the other hand, are well predicted by speech, word, and sentence language model embeddings and, surprisingly, are as equally well predicted by vision model embeddings. In an analysis of the vision model’s layer-wise and unit-wise predictivity, we find that the most predictive model units in social interaction and language regions are distinct from those in lower-level motion regions. Exploratory analyses suggest that the most predictive vision model units in social interaction and language regions contain social-semantic information conveyed by vision. Together, these results suggest that high-level visual information drives neural responses across cortex, even in language-selective regions, with varying integration of spoken language information across the STS.

Version published to 10.31234/osf.io/b5p4n_v1 on OSF Preprints
Sep 20, 2025

Evidence for hierarchical representations of written and spoken words from an open-science human neuroimaging dataset

This article has 8 authors:
1. Suneel Banerjee
2. Kimberly Jin
3. Plamen Nikolov
4. Philip Cho
5. Vishnu R. Pendri
6. Lillian Chang
7. Srikanth R. Damera
8. Maximilian Riesenhuber
This article has no evaluationsLatest version Sep 8, 2025
Distinct perceptual and conceptual representations of natural actions along the lateral and dorsal visual streams: an EEG-fMRI fusion study

This article has 3 authors:
1. Diana C. Dima
2. Jody C. Culham
3. Yalda Mohsenzadeh
This article has no evaluationsLatest version Sep 17, 2025
Encoding and combining naturalistic motion cues in the ferret higher visual cortex

This article has 3 authors:
1. Thomas Schaffhauser
2. Pascal Mamassian
3. Yves Boubenec
This article has no evaluationsLatest version Sep 25, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evidence for hierarchical representations of written and spoken words from an open-science human neuroimaging dataset

Distinct perceptual and conceptual representations of natural actions along the lateral and dorsal visual streams: an EEG-fMRI fusion study

Encoding and combining naturalistic motion cues in the ferret higher visual cortex