Naturalistic Audiovisual Stimulation Reveals the Tonotopic Organization of Human Auditory Cortex
Curation statements for this article:-
Curated by eLife
eLife Assessment
This valuable study uses large-scale 7T naturalistic fMRI data and nonlinear pRF modeling to map the tonotopic organization of the human auditory cortex, linking spectral tuning to speech selectivity and cortical hierarchy. The evidence is solid, providing a compelling demonstration that movie-based stimuli can recover robust population-level auditory maps and offering useful tools for leveraging existing datasets, although there is room for improvement in relating static tonotopy to dynamic speech processing and in presentation clarity. The study will be of interest to a broad audience working on auditory cortex organization and mapping.
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (eLife)
Abstract
Despite the importance of audition in spatial, semantic, and social function, there is no consensus regarding the detailed organisation of human auditory cortex. Using a novel application of a population receptive field model to a high-powered naturalistic audiovisual movie-watching dataset, we simultaneously estimate the basic spectral tuning properties and category selectivity of human auditory cortex. This revealed unprecedentedly clear tonotopic maps which showcase the modes of organization and computational motifs of the auditory cortex. Specifically, we find that regions more remote from the auditory core exhibit more compressive, non-linear response properties with finely-tuned, speech selectivity in low frequency portions of their tonotopic maps. These patterns of organisation mirror aspects of the visual cortical hierarchy, wherein tuning properties progress from a stimulus category-agnostic ‘front end’ towards more advanced regions increasingly optimised for behaviorally relevant stimulus categories.
Article activity feed
-
eLife Assessment
This valuable study uses large-scale 7T naturalistic fMRI data and nonlinear pRF modeling to map the tonotopic organization of the human auditory cortex, linking spectral tuning to speech selectivity and cortical hierarchy. The evidence is solid, providing a compelling demonstration that movie-based stimuli can recover robust population-level auditory maps and offering useful tools for leveraging existing datasets, although there is room for improvement in relating static tonotopy to dynamic speech processing and in presentation clarity. The study will be of interest to a broad audience working on auditory cortex organization and mapping.
-
Reviewer #1 (Public review):
This paper reports an auditory-directed analysis of the HCP 7T short movie dataset. It has the goal of using the film audio to create tonotopic (pRF) maps and combine these with other HCP-provided data (e.g., T1/T2 ratio) to improve understanding of auditory cortex organization and relative functional segregation, particularly in reference to speech processing.
The paper is ambitious, uses well-founded existing tools for combining data across subjects, and in the Discussion in particular, makes a lot of careful points about interpretation. The paper shows that, at least for a very large dataset on 7T (and for at least a few individual participants) good quality cross-subject-average tonotopic maps can be extracted from fMRI movie datasets via basic spectral modelling of the movie soundtracks. It also …
Reviewer #1 (Public review):
This paper reports an auditory-directed analysis of the HCP 7T short movie dataset. It has the goal of using the film audio to create tonotopic (pRF) maps and combine these with other HCP-provided data (e.g., T1/T2 ratio) to improve understanding of auditory cortex organization and relative functional segregation, particularly in reference to speech processing.
The paper is ambitious, uses well-founded existing tools for combining data across subjects, and in the Discussion in particular, makes a lot of careful points about interpretation. The paper shows that, at least for a very large dataset on 7T (and for at least a few individual participants) good quality cross-subject-average tonotopic maps can be extracted from fMRI movie datasets via basic spectral modelling of the movie soundtracks. It also suggests ways that these movie-based maps can be combined to come up with potential models of cortical organization. The PCA analysis is a creative way of combining maps (see below for comments)
These are valuable tools for the field in exploiting/exploring existing data, and I look forward to trying them out myself. I want to emphasize that this is not 'damning with faint praise' - a concrete demonstration of this approach with freely available tools/examples is not only the product of a lot of effort (thank you!), but will be an impetus to research going forward.
In terms of the contribution to our understanding of auditory cortex organization, using this large N cohort, they replicate a number of findings in the literature from the last couple of decades, including the overlap of low frequency preference with greater speech stimulus preference (e.g. Moerel, de Martino, & Formisano, 2012, J Neuro), patterns of BF width across cortex (Moerel et al., various; Thomas et al. 2015), use of shorter and longer natural sounds (Moerel et al., 2012, 2014; Dick et al., 2012), the importance/influence of sustained spectral attention for tonotopic mapping (da Costa et al., 2013; Dick et al., 2017; Riecke et al. 2017), the use of tonotopy and 'myelin' mapping to establish areal or regional boundaries (Dick et al., 2012; de Martino et al., 2015; Besle et al., 2018, etc) and the overall shape and consistency of tonotopic maps (e.g., Talavage et al., 2004, Humphries et al., 2010 and many others). To my knowledge/memory, this is the first tonotopy paper that has used the cross-subject cortical-surface-based averaging techniques that are driven by more than curvature/sulcal alignment.
The paper focuses in particular on creating new sets of ROIs based on the various maps derived from the data. Despite being quite familiar with this body of work, I found it difficult to follow how the ROIs were derived, and how and why they were different and/or an improvement over existing parcellation schemes (see for instance Sereno, Sood, & Huang, 2022 for a comprehensive parcellation framework across modalities including auditory, based on combined receptive surface mapping, myelin estimates, and other metrics).
Given the hour of fast(ish) fMRI data on a 7T with pretty big voxels (so high SNR), one aspect of the results that I found surprising - and potentially informative - was the lack of reliable tonotopic 'mappability' in the majority of participants. The authors' analytic approach to computing the pRFs seems completely reasonable (and shows good average maps), and yet individual maps seem unreliable except for the very best examples. I wondered if this might be due to problems in data collection with earbuds becoming slightly uncoupled and therefore delivering a lot less lower-frequency response and also not preventing scanner noise from getting to the ear; this is often a problem with any in-scanner earbud system (including the Sensimetrics). I wondered if the robustness of the 'speech maps' was associated with that of tonotopy; if they are highly associated, that would suggest that either there were huge individual differences in auditory attention, or perhaps that there was some variability in the acoustic signal delivered to each participant.
-
Reviewer #2 (Public review):
Summary:
In this manuscript, the authors leverage a high-powered 7T fMRI dataset of subjects viewing naturalistic audiovisual movies to elucidate the topographic organization of the human auditory cortex. By applying a nonlinear pRF model, they successfully map tonotopic gradients extending beyond the auditory core into the STG and STS areas. A primary finding is a medial-to-lateral gradient of increasing response compressivity, which the authors claim mirrors the hierarchical cascade architecture of the visual system. Furthermore, the modeling reveals that regions exhibiting high speech selectivity predominantly occupy the low-frequency portions of non-primary tonotopic maps. The authors argue that this architecture reflects an efficient coding mechanism where the cortex magnifies specific spectral features …
Reviewer #2 (Public review):
Summary:
In this manuscript, the authors leverage a high-powered 7T fMRI dataset of subjects viewing naturalistic audiovisual movies to elucidate the topographic organization of the human auditory cortex. By applying a nonlinear pRF model, they successfully map tonotopic gradients extending beyond the auditory core into the STG and STS areas. A primary finding is a medial-to-lateral gradient of increasing response compressivity, which the authors claim mirrors the hierarchical cascade architecture of the visual system. Furthermore, the modeling reveals that regions exhibiting high speech selectivity predominantly occupy the low-frequency portions of non-primary tonotopic maps. The authors argue that this architecture reflects an efficient coding mechanism where the cortex magnifies specific spectral features to facilitate the transition from acoustic encoding to flexible speech representation.
Overall, the study presents concise analyses and compelling high-resolution results that advance our understanding of auditory cortical organization. However, the manuscript currently exhibits several significant theoretical and methodological gaps that temper its broader claims. Most notably, the authors' reliance on a spatial, retinotopic-like analogy overlooks the fundamentally temporal nature of audition. Decoding continuous, natural speech relies heavily on dynamic, full-spectrum temporal integration and contextual recurrent computations, which are difficult to reconcile with the purely static, low-frequency spatial tuning observed here.
Strengths:
(1) The utilization of ultra-high-field 7T functional imaging combined with large-scale, naturalistic continuous stimuli provides an excellent signal-to-noise ratio and captures cortical responses under ecologically valid conditions.
(2) The application of a non-linear pRF encoding model provides a robust, quantitative method for parameterizing and mapping tonotopic features across the cortex, moving beyond simple contrast-based parcellations.
(3) The manuscript effectively demonstrates the relationship between category selectivity (e.g., speech) and underlying tonotopy, drawing an elegant and structurally useful analogy to the well-established relationship between category selectivity and retinotopy in the visual cortex.
Weaknesses:
(1) While the PCA mapping of the functional and structural parameter space is visually compelling, the robustness of this representational geometry across varying acoustic contexts remains ambiguous. Because the model relies on the specific statistical regularities of a single naturalistic audiovisual stimulus set, it is unclear if this low-dimensional structure would hold when tested against isolated speech sounds, environmental noise, or spectrally matched non-speech control stimuli.
(2) The methodological descriptions currently lack the computational precision required for replication and deep evaluation. I would suggest that the exact mathematical formulation of the encoding model be fully specified in the Methods section. This should include an explicit definition of the objective function, a clear accounting of all terms and hyperparameters utilized during the fitting process, and the exact dimensionalities of both the input feature space and the resulting parameter space.
(3) There is a critical theoretical disconnect between the observed static, low-frequency tuning in the STG and the known acoustic requirements for continuous speech perception. Speech is a full-spectrum signal; while fundamental frequencies and formants dominate the lower spectrum (which is vital for processing dynamic pitch contours), high-frequency bands (>1 kHz) carry indispensable phonetic information, such as the rapid spectrotemporal dynamics of consonants, especially fricatives. If the speech-responsive cortex is primarily and statically tuned to a low-frequency spectrum, it is unclear how the dynamic, high-frequency spectral information required for semantic decoding is represented. A rich body of electrophysiological literature documents diverse spectrogram coding in the STG. For example, Mesgarani et al. (Science, 2014) demonstrated using spectrotemporal receptive field models that neural populations in the STG are tuned to both low and high-frequency spectrograms well above 1 kHz. The authors must address this discrepancy and attempt to reconcile their static tonotopic findings with the existing literature on dynamic speech encoding.
(4) While drawing parallels between visual and auditory processing hierarchies is conceptually attractive, the modalities face fundamentally different computational challenges. Vision is largely resolved in space, making a retinotopic spatial coding strategy ecologically and computationally sound. Audition, however, evolves continuously in time. Complex temporal structure, continuous temporal integration, and contextual recurrent computations are paramount for auditory processing, particularly for speech comprehension. In this sense, a purely spatial or tonotopic coding framework is insufficient to fully explain the complex temporal processing dynamics required in the higher-order auditory domain.
-
Reviewer #3 (Public review):
Summary:
The work has the potential to identify the topographical organization of the auditory cortex, which remains controversial with current unnaturalistic sound stimulation, using an elegant approach developed in the visual domain with population receptive field mapping to study the organization of the visual system with naturalistic stimulation conditions.
Strengths:
This work presents an analysis of the topographic study of auditory cortical organization, using a substantial Human Connectome Project 7-Tesla functional imaging dataset in which 174 participants viewed naturalistic movies.
Weaknesses:
The key issue for the paper is that even the authors seem undecided on what the topographical results are and whether these results are consistent with, refute, or expand our notion of human auditory …
Reviewer #3 (Public review):
Summary:
The work has the potential to identify the topographical organization of the auditory cortex, which remains controversial with current unnaturalistic sound stimulation, using an elegant approach developed in the visual domain with population receptive field mapping to study the organization of the visual system with naturalistic stimulation conditions.
Strengths:
This work presents an analysis of the topographic study of auditory cortical organization, using a substantial Human Connectome Project 7-Tesla functional imaging dataset in which 174 participants viewed naturalistic movies.
Weaknesses:
The key issue for the paper is that even the authors seem undecided on what the topographical results are and whether these results are consistent with, refute, or expand our notion of human auditory cortical field organization using this massive dataset obtained under movie-watching conditions. Short of this clarity, and much of the discussion of the issues surrounding topographic mapping is buried in the Supplementary materials section, it is not clear what the authors think the advance of the current work is beyond the large datasets.
On the flip side, there is little consideration of the challenges of mapping the auditory cortex using naturalistic stimuli that prevent dissociating visual from auditory stimulation conditions, contributing to this clarity or lack thereof in tonotopic mapping.
As such, the current manuscript struggles to achieve its full potential.
-