A tradeoff between acoustic and linguistic feature encoding in spoken language comprehension

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This study provides convincing evidence supporting the important finding that acoustic and linguistic features contribute to brain responses as people listen to speech. However, the innovation of the methodological advance relative to other papers in the subfield is not entirely clear.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

When we comprehend language from speech, the phase of the neural response aligns with particular features of the speech input, resulting in a phenomenon referred to as neural tracking . In recent years, a large body of work has demonstrated the tracking of the acoustic envelope and abstract linguistic units at the phoneme and word levels, and beyond. However, the degree to which speech tracking is driven by acoustic edges of the signal, or by internally-generated linguistic units, or by the interplay of both, remains contentious. In this study, we used naturalistic story-listening to investigate (1) whether phoneme-level features are tracked over and above acoustic edges, (2) whether word entropy, which can reflect sentence- and discourse-level constraints, impacted the encoding of acoustic and phoneme-level features, and (3) whether the tracking of acoustic edges was enhanced or suppressed during comprehension of a first language (Dutch) compared to a statistically familiar but uncomprehended language (French). We first show that encoding models with phoneme-level linguistic features, in addition to acoustic features, uncovered an increased neural tracking response; this signal was further amplified in a comprehended language, putatively reflecting the transformation of acoustic features into internally generated phoneme-level representations. Phonemes were tracked more strongly in a comprehended language, suggesting that language comprehension functions as a neural filter over acoustic edges of the speech signal as it transforms sensory signals into abstract linguistic units. We then show that word entropy enhances neural tracking of both acoustic and phonemic features when sentence- and discourse-context are less constraining. When language was not comprehended, acoustic features, but not phonemic ones, were more strongly modulated, but in contrast, when a native language is comprehended, phoneme features are more strongly modulated. Taken together, our findings highlight the flexible modulation of acoustic, and phonemic features by sentence and discourse-level constraint in language comprehension, and document the neural transformation from speech perception to language comprehension, consistent with an account of language processing as a neural filter from sensory to abstract representations.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    Here the authors set out to disentangle neural responses to acoustic and linguistic aspects of speech. Participants heard a short story, which could be in a language they understood or did not (French vs. Dutch stories, presented to Dutch listeners). Additional predictors included a combination of acoustic and linguistic factors: Acoustic, Phoneme Onsets, Phoneme Surprisal, Phoneme Entropy and, Word Frequency. Accuracy of reconstruction of the acoustic amplitude envelope was used as an outcome measure.

    The use of continuous speech and the use of comprehended vs. uncomprehended speech are both significant strengths of the approach. Overall, the analyses are largely appropriate to answer the questions posed.

    1. The reconstruction accuracies (e.g., R^2 values Figure 1) seem lower perhaps than might be expected - some direct comparisons with prior literature would be welcome here. Specifically, the accuracies in Figure 1A are around .002-.003 whereas the range seen in some other papers is about an order of magnitude or more larger (e.g. Broderick et al. 2019 J Neurosci; Ding and Simon 2013 J Neurosci).

    We thank the reviewer for their constructive comments and careful review of our paper. The important point the reviewer makes stems from whether the reconstruction accuracies presented are from the whole brain/sensor space (as in our submission) or from selected channels (Broderick) or selected sources (Ding & Simon). Moreover, we used R2 score for reconstruction accuracy which is generally of a different order of magnitude than correlation coefficients (as used in Ding and Simon 2013). Crucially when we now selected the “auditory cortex,” we can also report reconstruction accuracies around the language network on the same scale as in the previous studies. In Figure 2 A and B (Figure 1 in the first version of the manuscript), we took the average of model accuracies of each source point over whole brain, without selecting any region of interest, to investigate if each speech feature is incrementally increasing the averaged model accuracy which was a more conservative method than selecting the sources with a stronger response to the stimuli (e.g., the average R2 value over all participants of acoustic model in auditory cortex for French stories is 0.01187 and it is 0.01315 for Dutch stories, which is similar in magnitude to e.g. Broderick et al. 2019 J Neuroscience). TRF accuracies on the brain regions outside of the language network are quite small, so the average accuracy on Figure 2 A and B is almost an order of magnitude lower than previous studies. (Ding and Simon 2013 J Neurosci : “To reduce computational complexity, the MEG sensors in each hemisphere were compressed into 3 components using denoising source separation”, averaged accuracy over all subjects is around 0.2 because they used both correlation as a measure of accuracy (not R2) and backward modeling (decoding) instead of forward modeling. Reconstruction accuracy of decoding models are usually higher than forward models; Broderick et al. 2019 J Neurosci: Averaged across frontocentral channels, averaged R2 over all subjects is 0.0171) Figure 2 C shows sources where accuracies of base acoustic model were significantly different than 0. Reconstruction accuracies around the language network is in the similar scale with the previous studies. Figure 2 D shows the sources where each feature significantly improved the reconstruction accuracy compared to the previous model. Accuracy values are smaller than the accuracies of base acoustic model because they are the values that shows how much each speech feature incrementally increased the accuracy. (E.g Phoneme onset accuracy = (Accuracy of the model Acoustic features + Phoneme Onset) – (Accuracy of the model Acoustic Features). Figure captions are updated on the manuscript.

    Figure 2. A) Accuracy improvement (averaged over the sources in whole brain) by each feature for Dutch Stories B) Accuracy improvement (averaged over the sources in whole brain) by each feature for French Stories (Braces in Figure A and B shows the significance values of the contrasts (difference between consecutive models, **** <0.0001, *** <0.001, **<0.01, * < 0.05) in linear mixed effect models (Table 2 and 3) C) Source points where accuracies of base acoustic model were significantly different than 0 D) Source points where reconstruction accuracies of the model were significantly different than previous model. Accuracy values shows how much each linguistic feature increased the reconstruction accuracy compared to the previous model.

    1. One theoretical point relevant to this and similar studies concerns the use of acoustic envelope reconstruction accuracy as the dependent measure. On the one hand, reconstruction accuracy provides an objective measure of "success", and a satisfying link between stimulus and brain activity. On the other hand, as the authors point out, envelope reconstruction is probably not the primary goal of listeners in a conversation: comprehension is. Some discussion of the implications of envelope reconstruction accuracy might be useful in guiding interpretation of the current work, and importantly, helping the field as a whole grapple with this issue.

    Overall, the results support the authors' conclusions that acoustic edges and phoneme features are treated differently depending on whether a listener comprehends the language being spoken. In particular, phoneme features contribute to a greater degree when language is comprehended, whereas acoustic edges contribute similarly regardless of comprehension. These findings are important in part because of prior work suggesting that acoustic edges are critically important for "chunking" continuous speech into linguistic units; the current results re-center language units (phonemes) as critical to comprehension.

    Reviewer #2 (Public Review):

    In this study, the authors used an audiobook listening paradigm and encoding analysis of MEG to examine the independent contributions to MEG responses of putative acoustic and phoneme-level linguistic features in speech and their modulation by higher-level sentence/discourse constraints and language proficiency. The results indicate that:

    1. Acoustic and phoneme features do indeed make independent contributions to MEG responses in frontotemporal language regions (with a left-hemisphere bias for phoneme features).
    1. Brain responses to acoustic and phoneme features are enhanced when sentence/discourse constraints are low (i.e. when word entropy is high).
    1. While brain responses to phoneme features are enhanced when the language is comprehended (or word entropy is high), the opposite is observed for acoustic features.

    These results are taken to support widely held views on the nature of information flow during language processing. On the one hand, processing is hierarchical, consistent with finding 1 above. On the other hand, information flow between lower and high-levels of language processing is also flexible and interactive (finding 2) and modulated by behavioural goals (finding 3).

    This is a methodologically sophisticated study with useful findings that I think will be of interest to the burgeoning community investigating 'neural speech tracking' and also to the wider community interested in language processing and predictive coding. Moreover, the evidence appears convincing.

    I thought the impact was somewhat limited by the results presentation, which I think missed some key details and made the study somewhat hard to follow (but this issue can be addressed).

    Perhaps more major, I do wonder about the novelty of the study as each of the main findings has precedent in the literature. Finding 1 (e.g. Brodbeck, Simon et al.), Finding 2 (e.g. Broderick, Lalor et al.; Molinaro et al.), Finding 3 (e.g. Brodbeck, Simon et al. although here the manipulation of behavioural goals was through a cocktail party listening manipulation and there were was no opposing modulation of acoustic vs phoneme level representations). Thus, while the study appears well executed, overall I am unsure how significant the advance is. Related to this point, the study's findings and theoretical interpretations (e.g. the brain as a hierarchical 'filter') are consistent with widely held views of language processing (at least within cognitive neuroscience) and so again I question the potential advance of the study.

    We are thanking the reviewer for bringing this up. While we started our work with the aim to replicate these patterns seem in the literature – which is especially important in the burgeoning area of neural tracking of speech and language - our key extension of these findings is that we can show that phonemic features are encoded more strongly both in a comprehended language compared to an uncomprehended language, and as a function of word-level statistical information, and that there is a tradeoff between acoustic and linguistic features encoding. As the Reviewer mentions, there is a patchwork of consistent findings from very different experimental circumstances, but in order to have strong evidence for the “tradeoff” of hierarchical feature encoding, it is even more crucial to have a design where features can directly compared as we do, and where acoustic differences are carefully controlled in contrast to the presence of linguistic features and language comprehension.

    While our results are consistent with Molinaro et al. (2021). – as we also provide support for a cost minimization perspective rather than the perception facilitation perspective discussed in Molinaro et al. - it is important to note that Molinaro et al. only examined the tracking of acoustic features, specifically the speech envelope, using the Phase Locking Value, and did not examine the contribution of lower-level linguistic features. Secondly, Molinaro et al. use a condition-based experimental design in contrast to our naturalistic stimulus approach. In our study, our aim was to investigate the dynamics of encoding both acoustic and linguistic features, and we utilized a multivariate linear regression method on low and high constraining words which ‘naturally’ occurred in our audiobook stimulus across languages. Our results revealed a trade-off between the encoding of acoustic and linguistic features that was dependent on the level of comprehension. Specifically, in the comprehended language, the predictability of the following word had a greater influence on the tracking of phoneme features as opposed to acoustic features, while in the uncomprehended language, this trend was reversed. To best of our knowledge, Brodbeck et al. (2020) showed an effect of attention on the tracking of acoustic features only in cocktail party problem but didn’t investigate the encoding of linguistic features. Brodbeck et al. (2018) showed that linguistic features are represented only in the attended speech but they didn’t explicitly compare the acoustics features as in the previous study. Both studies used a mixed speech and investigated the effect of attention rather than comprehension. In our study, we investigated the effect of comprehension where both stimuli were attended. We found that even in the uncomprehended language, linguistic features are represented as opposed to unattended speech in Brodbeck et al. (2018) study, however it was less strong than the comprehended language. Additionally, one of the goals in this study was to investigate the effect of context on the representations of acoustic and phoneme level features. Opposing modulation of acoustic and phonemic features in our study was driven by the contextual information. However, as we also mentioned in the discussion, we don’t expect the effect of context on the uncomprehended language so the modulation of acoustic features could be related to statistical chunking of acoustic signal for frequent words, essentially reflecting recognition of those single function words such as le, la, un, une.

    We have now revised the Discussion (we revised manuscript as highlighted in red in this text) to clarify the advance of this study and how this study adds more on previous studies.

  2. eLife assessment

    This study provides convincing evidence supporting the important finding that acoustic and linguistic features contribute to brain responses as people listen to speech. However, the innovation of the methodological advance relative to other papers in the subfield is not entirely clear.

  3. Reviewer #1 (Public Review):

    Here the authors set out to disentangle neural responses to acoustic and linguistic aspects of speech. Participants heard a short story, which could be in a language they understood or did not (French vs. Dutch stories, presented to Dutch listeners). Additional predictors included a combination of acoustic and linguistic factors: Acoustic, Phoneme Onsets, Phoneme Surprisal, Phoneme Entropy and, Word Frequency. Accuracy of reconstruction of the acoustic amplitude envelope was used as an outcome measure.

    The use of continuous speech and the use of comprehended vs. uncomprehended speech are both significant strengths of the approach. Overall the analyses are largely appropriate to answer the questions posed.

    The reconstruction accuracies (e.g., R^2 values Figure 1) seem lower perhaps than might be expected - some direct comparisons with prior literature would be welcome here. Specifically, the accuracies in Figure 1A are around .002-.003 whereas the range seen in some other papers is about an order of magnitude or more larger (e.g. Broderick et al. 2019 J Neurosci; Ding and Simon 2013 J Neurosci).

    One theoretical point relevant to this and similar studies concerns the use of acoustic envelope reconstruction accuracy as the dependent measure. On the one hand, reconstruction accuracy provides an objective measure of "success", and a satisfying link between stimulus and brain activity. On the other hand, as the authors point out, envelope reconstruction is probably not the primary goal of listeners in a conversation: comprehension is. Some discussion of the implications of envelope reconstruction accuracy might be useful in guiding interpretation of the current work, and importantly, helping the field as a whole grapple with this issue.

    Overall, the results support the authors' conclusions that acoustic edges and phoneme features are treated differently depending on whether a listener comprehends the language being spoken. In particular, phoneme features contribute to a greater degree when language is comprehended, whereas acoustic edges contribute similarly regardless of comprehension. These findings are important in part because of prior work suggesting that acoustic edges are critically important for "chunking" continuous speech into linguistic units; the current results re-center language units (phonemes) as critical to comprehension.

  4. Reviewer #2 (Public Review):

    In this study, the authors used an audiobook listening paradigm and encoding analysis of MEG to examine the independent contributions to MEG responses of putative acoustic and phoneme-level linguistic features in speech and their modulation by higher-level sentence/discourse constraints and language proficiency. The results indicate that:

    1. Acoustic and phoneme features do indeed make independent contributions to MEG responses in frontotemporal language regions (with a left-hemisphere bias for phoneme features).
    2. Brain responses to acoustic and phoneme features are enhanced when sentence/discourse constraints are low (i.e. when word entropy is high).
    3. While brain responses to phoneme features are enhanced when the language is comprehended (or word entropy is high), the opposite is observed for acoustic features.

    These results are taken to support widely held views on the nature of information flow during language processing. On the one hand, processing is hierarchical, consistent with finding 1 above. On the other hand, information flow between lower and high-levels of language processing is also flexible and interactive (finding 2) and modulated by behavioural goals (finding 3).

    This is a methodologically sophisticated study with useful findings that I think will be of interest to the burgeoning community investigating 'neural speech tracking' and also to the wider community interested in language processing and predictive coding. Moreover, the evidence appears convincing.

    I thought the impact was somewhat limited by the results presentation, which I think missed some key details and made the study somewhat hard to follow (but this issue can be addressed).

    Perhaps more major, I do wonder about the novelty of the study as each of the main findings has precedent in the literature. Finding 1 (e.g. Brodbeck, Simon et al.), Finding 2 (e.g. Broderick, Lalor et al.; Molinaro et al.), Finding 3 (e.g. Brodbeck, Simon et al. although here the manipulation of behavioural goals was through a cocktail party listening manipulation and there were was no opposing modulation of acoustic vs phoneme level representations). Thus, while the study appears well executed, overall I am unsure how significant the advance is. Related to this point, the study's findings and theoretical interpretations (e.g. the brain as a hierarchical 'filter') are consistent with widely held views of language processing (at least within cognitive neuroscience) and so again I question the potential advance of the study.

  5. Reviewer #3 (Public Review):

    The manuscript focuses on three central questions (line 64), and having those spelt out explicitly and early on is very helpful. I organize my evaluation around these questions:

    "(1) whether phoneme-level features contribute to neural encoding even when acoustic contributions are carefully controlled, as a function of language comprehension":

    The manuscript finds that phoneme-level features based on language statistics have a much stronger effect in the native language than the foreign language. The result adds important convergent evidence to a body of work suggesting that such features can isolate brain responses associated with higher-order representations which relate to comprehension.

    (2) whether sentence- and discourse-level constraints on lexical information (operationalized as word entropy) impacted the encoding of acoustic and phoneme-level features":

    This is a really interesting question, but I have some potential concerns about the method used to analyze it. The Methods section could definitely benefit from a more explicit description (perhaps analogous to Table 8, which is very helpful), so I apologize if I misinterpreted the analysis. The manuscript says "TRFs including all phoneme features were estimated for each condition and language" (260), implying that separate TRFs were estimated for the high and low entropy conditions: One for only high entropy words, and one for only low entropy words. I don't understand how this was implemented, since the continuous speech/TRF paradigm does not allow neatly sorting words into bins (as could be done in trial-based designs). Instead, the response during each word is a mix of early responses to the current word and late responses to the previous word.

    My interpretation of the available description (260 ff.) is that two versions of each predictor were created, one for high entropy words setting the predictor to zero during low entropy words, and vice versa. Separate TRFs were then estimated for the low- and high-entropy predictor sets. If this is indeed the case, then I am hesitant to interpret the results, because such a high entropy set of predictors is not just predicting a response in high entropy words, it is equally predicting the absence of a response in low entropy words (and vice versa). This might lead to side effects in the estimated TRFs. Furthermore, such models would estimate responses without controlling for ongoing/overlapping responses to preceding words, which may be substantial (Figure 4 implies that condition changes approximately every 2 words).

    "(3) whether tracking of acoustic landmarks (viz., acoustic edges) was enhanced or suppressed as a function of comprehension."

    The analysis suggests that in French (foreign language), acoustic neural responses are enhanced compared to Dutch (native language). This is an interesting data-point, and linked to a theoretically interesting claim (that lower-order representations are suppressed when higher-order categories are activated). There is a potential qualification though. Dutch and French are different languages which are probably associated with different acoustic statistics. Furthermore, the audiobooks were most likely read by different speakers (I did not find this information in the Methods section - apologies if I missed it), which, again, might be associated with different acoustic properties. Differences in acoustic responses may thus also be due to confounded differences in the acoustic structure of the stimuli.