Comprehensive Neural Representations of Naturalistic Stimuli through Multimodal Deep Learning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
A central challenge in cognitive neuroscience is understanding how the brain represents and predicts complex, multimodal experiences in naturalistic settings. Traditional neural encoding models, often based on unimodal or static features, fall short in capturing the rich, dynamic structure of real-world cognition. Here, we address this challenge by introducing a video-text alignment encoding framework that predicts whole-brain neural responses by integrating visual and linguistic features across time. Using a state-of-the-art deep learning model (VALOR), we achieve more accurate and generalizable encoding than unimodal (AlexNet, WordNet) and static multimodal (CLIP) baselines. Beyond improving prediction, our model automatically maps cortical semantic spaces, aligning with human-annotated dimensions without requiring manual labeling. We further uncover a hierarchical predictive coding gradient, where different brain regions anticipate future events over distinct timescales—an organization that correlates with individual cognitive abilities. These findings provide new evidence that temporal multimodal integration is a core mechanism of real-world brain function. Our results demonstrate that deep learning models aligned with naturalistic stimuli can reveal ecologically valid neural mechanisms, offering a powerful, scalable approach for investigating perception, semantics, and prediction in the human brain. This framework advances naturalistic neuroimaging by bridging computational modeling and real-world cognition.