Convergent representations and spatiotemporal dynamics of speech and language in brain and deep neural networks
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Recent studies have explored the correspondence between single-modality DNN models (speech or text) and specific brain networks for speech and language. The key factors underlying these correlations and their spatiotemporal evolution within the brain language network remain unclear, particularly across different DNN modalities. To address these questions, we analyzed the representation similarity between self-supervised learning (SSL) models for speech (Wav2Vec2) and language (GPT-2), against neural responses to naturalistic speech captured via high-density electrocorticography. Our results indicated high prediction accuracy of both types of SSL models relative to neural activity before and after word onsets. It was the shared components between Wav2Vec2.0 and GPT-2 that explained the majority portion of the SSL-brain similarity. Furthermore, we observed distinct spatiotemporal dynamics: both models showed high encoding accuracy 40 milliseconds before word onset, especially in the mid-superior temporal gyrus (mid-STG), which can be explained by the shared contextual components in the SSL models; the Wav2Vec2.0 also peaked at 200 milliseconds after word onset around the posterior STG, which was mainly attributed to the acoustic-phonetic and static semantic information encoded in the SSL models. These results highlight how contextual and acoustic-phonetic cues encoded in DNNs align with spatiotemporal neural activity patterns, suggesting a significant overlap in how artificial and biological systems process linguistic information.