Reconstructing What the Brain Hears: Cross-Subject Music Decoding from fMRI via Prior-Guided Diffusion Model

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Reconstructing music directly from brain activity offers a unique window onto the representational geometry of the auditory system and paves the way for next-generation brain–computer interfaces. We introduce a fully data-driven pipeline that combines cross-subject functional alignment with bayesian decoding in the latent space of a diffusion-based audio generator. Functional alignment projects individual fMRI responses onto a shared representational manifold, increasing the performance of cross-participant accuracy with respect to anatomically normalized baselines. A bayesian search over latent trajectories then selects the most plausible waveform candidate, stabilizing reconstructions against neural noise. Crucially, we bridge CLAP’s multi-modal embeddings to music-domain latents through a dedicated aligner, eliminating the need for hand-crafted captions and preserving the intrinsic structure of musical features. Evaluated on ten diverse genres, the model achieves a cross-subject-averaged Identification Accuracy of 0.914 ± 0.019, and produces audio that naïve listeners recognize above chance in 85.7% of trials. Voxel-wise analyses locate the predictive signal within a bilateral circuit spanning early auditory, inferior-frontal, and premotor cortices, consistent with hierarchical and sensorimotor theories of music perception. The framework establishes a principled bridge between generative audio models and cognitive neuroscience, opening avenues for thought-driven composition, objective metrics for music-based therapy, and translational applications in non-verbal communication and neurotechnologies.

Article activity feed