The Human Brain as a Dynamic Mixture of Expert Models in Video Understanding

Christina Sartzetaki
Anne W. Zonneveld
Pablo Oyarzo
Alessandro T. Gifford
Radoslaw M. Cichy
Pascal Mettes
Iris I.A. Groen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The human brain is the most efficient and versatile system for processing dynamic visual input. By comparing representations from deep video models to brain activity, we can gain insights into mechanistic solutions for effective video processing, important to better understand the brain and to build better models. Current works in model-brain alignment primarily focus on fMRI measurements, leaving open questions about fine-grained dynamic processing. Here, we introduce the first large-scale benchmarking of both static and temporally-integrating deep neural networks on brain alignment to dynamic electroencephalography (EEG) recordings of short natural videos. We analyze 100+ models across the axes of temporal integration, classification task, architecture and pretraining using our proposed Cross-Temporal Representational Similarity Analysis (CT-RSA) , which matches the best time-unfolded model features to dynamically evolving brain responses, distilling 10 ⁷ alignment scores. Our findings reveal novel insights on how continuous visual input is integrated in the brain, beyond the standard temporal processing hierarchy from low to high-level representations. Responses in posterior electrodes, after initial alignment to hierarchical static object processing, best align to mid-level representations of temporally-integrative actions and closely match the unfolding video content. In contrast, responses in frontal electrodes best align with high-level static action representations and show no temporal correspondence to the video. Additionally, state space models show superior alignment to intermediate posterior activity through mid-level action features, in which self-supervised pretraining is also beneficial. We draw a metaphor to a dynamic mixture of expert models for the changing neural preference in tasks and temporal integration reflected in the alignment to different model types across time. We posit that a single best-aligned model would need task-independent training to combine these capacities as well as an architecture that supports dynamic switching.

Version published to 10.1101/2025.10.27.684803 on bioRxiv
Oct 28, 2025

Encoding neural representations of time-continuous stimulus-response transformations in the human brain with advanced deep neural networks

This article has 3 authors:
1. Sabine Haberland
2. Hannes Ruge
3. Holger Frimmel
This article has no evaluationsLatest version Sep 19, 2025
Ubiquitous cortical sensitivity to visual information during naturalistic, audiovisual movie viewing

This article has 5 authors:
1. Hannah Small
2. Haemy Lee Masson
3. Ericka Wodka
4. Stewart H. Mostofsky
5. Leyla Isik
This article has no evaluationsLatest version Sep 20, 2025
Encoding models in functional magnetic resonance imaging: the Voxelwise Encoding Model framework

This article has 4 authors:
1. Matteo Visconti di Oleggio Castello
2. Fatma Deniz
3. Tom Dupré la Tour
4. Jack L. Gallant
This article has no evaluationsLatest version Sep 17, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Encoding neural representations of time-continuous stimulus-response transformations in the human brain with advanced deep neural networks

Ubiquitous cortical sensitivity to visual information during naturalistic, audiovisual movie viewing

Encoding models in functional magnetic resonance imaging: the Voxelwise Encoding Model framework