Decoding the Moving Mind: Multi-Subject fMRI-to-Video Retrieval with MLLM Semantic Grounding

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Decoding dynamic visual information from brain activity remains challenging due to inter-subject neural heterogeneity, limited per-subject data availability, and the substantial temporal resolution gap between fMRI signals (0.5 Hz) and video dynamics (30 Hz). Current approaches face persistent challenges in addressing these temporal mismatches, demonstrate limited capacity to integrate subject-specific neural patterns with shared representational frameworks, and lack adequate semantic granularity for aligning neural responses with visual content. To bridge these gaps, we propose a framework addressing these limitations through three innovations: (1) a Dynamic Temporal Alignment module that resolves temporal mismatches via exponentially weighted multi-frame fusion with adaptive decay coefficients; (2) a Brain Mixture-of-Experts architecture that combines subject-specific extractors with shared expert layers through parameter-efficient tri-modal contrastive learning; and (3) a Multi perspective Semantic Hyper-Anchoring module that resolves cross-subject attention bias via multi-dimensional semantic decomposition, leveraging multimodal LLMs for fine-grained video semantic extraction—enabling the model to match individual attention patterns as different subjects naturally focus on distinct aspects of the same visual stimulus. This module boosts Top-10/Top-100 retrieval by 17.7%/6.6%. Experiments on two video-fMRI datasets demonstrate state-of-the-art performance, with 39%/30% improvements in Top-10/Top-100 accuracy over single-subject baselines and 27% gains against multi-subject models. The framework exhibits remarkable few-shot adaptability, retaining 97% performance when using only 10% training data for new subjects. Visualization analysis confirms this generalization capability stems from effective disentanglement of subject-specific and shared neural representations.

Article activity feed