Largely distinct networks mediate perceptually-relevant auditory and visual speech representations

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Visual speech is an integral part of communication, but it remains unclear whether information carried by lip movements is represented in the same brain regions that mediate acoustic speech comprehension. Our ability to understand acoustic speech seems independent from that to understand visual speech, yet neuroimaging studies suggest that the neural representations largely overlap. Addressing this discrepancy, we tested where the brain represents acoustically and visually conveyed word identities in a full-brain MEG study. Our analyses dissociate cerebral representations that merely reflect the physical stimulus from those that also predict comprehension, and suggest that these overlap only in specific temporal and frontal regions. Moreover, representations predictive of auditory and visual comprehension converge only in angular and inferior frontal regions. These results provide a neural explanation for the behavioural dissociation of acoustic and visual speech comprehension and suggest that cerebral representations encoding word identities may be more modality-specific than often upheld.

Article activity feed

  1. ###This manuscript is in revision at eLife

    The decision letter after peer review, sent to the authors on April 22, 2020, follows.

    Summary

    The revised paper presents a better-fitting analysis, and does a more nuanced job in discussing the results than the original manuscript. However, there are still a few major criticisms that we have for the analysis, detailed below.

    Essential Revisions

    1. Brain-wide, multiple-comparison corrected tests comparing auditory versus visual decoding are still lacking. The authors have now provided vertex-wise Bayes factors within areas that showed significant decoding in each individual condition. Unfortunately, this is not satisfactory, because these statistics are (1) potentially circular because ROIs were pre-selected based on an analysis of individual conditions, (2) not multiple-comparison corrected, and (3) rely on an arbitrary prior that is not calibrated to the expected effect size. Still, ignoring these issues, the only area that appears to contain vertices with "strong evidence" for a difference in neuro-behavioral decoding is the MOG, which wouldn't really support the claim of "largely distinct networks" supporting audio vs. visual speech representation.

    The authors may address these issues, for instance, by (I) presenting additional whole-brain results - e.g. for a direct comparison of auditory and visual classification (in Figure 2) and of perceptual prediction (in Figure 3). (ii) presenting voxel-wise maps of Bayesian evidence values (as in Supplementary Figure 3) for the statistical comparisons shown in Figure 2D, and Figure 3D (iii) in the text included in Figure 2D and 3D making clear what hypotheses correspond to the null hypothesis and to the alternative hypothesis (i.e. auditory = visual, auditory <> visual).

    1. As noted before, the classifiers used in this study do not discriminate between temporal versus spatial dimensions of decoding accuracy. This leaves it unclear whether the reported results are driven by (dis)similarity of spatial patterns of activity (as in fMRI-based MVPA), temporal patterns of activity (e.g., oscillatory "tracking" of the speech signal), or some combination. As these three possibilities could lead to very different interpretations of the data, it seems critical to distinguish between them. For example, the authors write "the encoding of the acoustic speech envelope is seen widespread in the brain, but correct word comprehension correlates only with focal activity in temporal and motor regions," but, as it stands, their results could be partly driven by this non-specific entrainment to the acoustic envelope.

    In their response, the authors show that classifier accuracy breaks down when spatial or temporal information is degraded, but it would be more informative to show how these two factors interact. For example, the methods article cited by the authors (Grootswagers 2017) shows classification accuracy for successive time bins after stimulus onset (i.e., they train different classifiers for each time bin 0-100 ms, 100-200 ms, etc.). The timing of decoding accuracy in different areas could also help to distinguish between different plausible explanations of the results.

    Finally, it is somewhat unclear how spatial and temporal information are combined in the current classifier. Supplemental Figure 5 creates the impression that the time-series for each vertex within a spotlight were simply concatenated. However, this would conflate within-vertex (temporal) and across-vertex (spatial) variance.

    1. The concern that the classifier could conceivably index factors influencing "accuracy" rather than the perceived stimulus does not appear to be addressed sufficiently. Indeed, the classifier is referred to as identifying "sensory representations" throughout the manuscript, when it could just as well identify areas involved in any other functions (e.g., attention, motor function) that would contribute to accurate behavioral performance. This limitation should be acknowledged in the manuscript. The authors could consider using the timing of decoding accuracy in different areas to disambiguate these explanations.

    The authors state in their response that classifying based on the participant's reported stimulus (rather than response accuracy) could "possibly capture representations not related to speech encoding but relevant for behaviour only (e.g. pre-motor activity). These could be e.g. brain activity that leads to perceptual errors based on intrinsic fluctuations in neural activity in sensory pathways, noise in the decision process favouring one alternative response among four choices, or even noise in the motor system that leads to a wrong button press without having any relation to sensory representations at all."

    But, it seems that all of these issues would also effect the accuracy-based classifier as well. Moreover, it seems that intrinsic fluctuations in sensory pathways, or possibly noise in the decision process, are part of what the authors are after. If noise in a sensory pathway can be used to predict particular innacurate responses, isn't that strong evidence that it encodes behaviorally-relevant sensory representations? For example, instrinsic noise in V1 has been found to predict responses in a simple visual task in non-human primates, with false alarm trials exhibiting noise patterns that are similar to target responses (Seidemann, E., & Geisler, W. S. (2018)). Showing accurate trial-by-trial decoding of participants' incorrect responses could similarly provide stronger evidence that a certain area contributes to behavior.