Exposing distinct subcortical components of the auditory brainstem response evoked by continuous naturalistic speech

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Summary: This manuscript describes a type of alteration to speech to make it more peaky, with the goal of inducing stronger responses in the auditory brainstem. Recent work has employed naturalistic speech to investigate subcortical mechanisms of speech processing. However, previous methods were ill equipped to tease apart the neural responses in different parts of the brainstem. The authors show that their speech manipulation improves this: the peaky speech that they develop allows to segregate different waves of the brainstem response. This development may allow further and more refined investigations of the contribution of different parts of the brainstem to speech processing, as well as to hearing deficits.

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Speech processing is built upon encoding by the auditory nerve and brainstem, yet we know very little about how these processes unfold in specific subcortical structures. These structures are deep and respond quickly, making them difficult to study during ongoing speech. Recent techniques begin to address this problem, but yield temporally broad responses with consequently ambiguous neural origins. Here we describe a method that pairs re-synthesized “peaky” speech with deconvolution analysis of EEG recordings. We show that in adults with normal hearing, the method quickly yields robust responses whose component waves reflect activity from distinct subcortical structures spanning auditory nerve to rostral brainstem. We further demonstrate the versatility of peaky speech by simultaneously measuring bilateral and ear-specific responses across different frequency bands, and discuss important practical considerations such as talker choice. The peaky speech method holds promise as a tool for investigating speech encoding and processing, and for clinical applications.

Article activity feed

  1. Reviewer #3:

    This paper describes a novel technique for measuring several distinct subcortical components, using naturalistic speech instead of the more typical clicks and tone-pips. The benefits of using extended speech (e.g., stories) include simultaneous measurement of middle- and late-latency components automatically.

    The technique is of great interest with many potential use cases. The manipulation of the acoustics is reasonable (replacing voiced speech with click trains of the same pitch), does not degrade intelligibility, and reduces sound quality only in minor ways. The manipulation is also described clearly for others to implement.

    The authors also investigate several variations and generalizations of the technique, and their tradeoffs, inducing responses from specific tonotopic bands and ear-specific responses.

    The reliability of the ABR wave I and V responses is remarkable (especially given the previous results of the senior author using unprocessed speech); wave III is less so. Being able to simultaneously record P0, Na, Pa, Nb, P1, N1, and P2 simultaneously shows promise for future clinical applications (and basic science). The practical importance of using a lower fundamental frequency (i.e., typical of male speakers), is clearly established.

    The technique has some overlap with the Chirp spEECh of Miller et al., but with enough tangible additional benefits that it should be considered novel.

    The writing is very clear.

    Major Concerns:

    "wave III was clearly identifiable in 16 of the 22 subjects": Figure 1 indicates that the word "clearly" may be somewhat generous. It would be worthwhile to discuss wave III and its identifiability in more detail (perhaps its identifiability/non-universality could be compared with that of another less prominent peak in traditionally obtained ABRs?).

  2. Reviewer #2:

    General assessment:

    This manuscript presents an improved methodology for extracting distinct early auditory evoked potentials from the EEG response to continuous natural speech, including a novel method for obtaining simultaneous responses from different frequency bands. It is a clever approach and the first results are promising, but more rigorous evaluation of the method and critical evaluation of the results is needed. It could provide a valuable tool for investigating the effect of corticofugal modulation of the early auditory pathway during speech processing. However, the claims made of its use investigating speech encoding or clinical diagnosis seem too speculative and unspecific.

    General comments:

    1. Despite repeated claims, I don't think a convincing case is made here that this method can provide insight on how speech is processed in the early auditory pathway. The response is essentially a click-like response elicited by the glottal pulses in the stimulus; it averages out information related to dynamic variations in envelope and pitch that are essential for speech perception; at the same time, it is highly sensitive to sound features that do not affect speech perception. What reason is there to assume that these responses contain information that is specific or informative about speech processing?

    2. Similarly, the claim that the methodology can be used as a clinical application is not convincing. It is not made clear what pathology these responses can detect that current methods ABR cannot, or why. As explained in the Discussion, the response size is inherently smaller than standard ABRs because of the higher repetition rate of the glottal pulses, and the response may depend on more complex neural interactions that would be difficult to quantify. Do these features not make them less suitable for clinical use?

    3. It needs to be rigorously confirmed that the earliest responses are not contaminated or influenced by responses from later sources. There seems to be some coherent activity or offset in the baseline (pre 0 ms), in particular with the lower filter cut off. One way to test this might be to simulate a simple response by filtering and time shifting the stimulus waveforms, adding these up plus realistic noise, and applying the deconvolution to see whether the input is accurately reproduced. It might be useful to see how the response latencies and amplitudes correlate to those of conventional click responses, and how they depend on stimulus level.

    4. The multiband responses show a variation of latency with frequency band that indicates a degree of cochlear frequency specificity. The latency functions reported here looks similar to those obtained by Don et al 1993 for derived band click responses, but the actual numbers for the frequency dependent delays (as estimated by eye from figures 4,6 and 7) seem shorter than those reported for wave V at 65 dB SPL (Don et al 1993 table II). The latency function would be better fitted to an exponential, as in Strelcyk et al 2009 (equation 1), than a quadratic function; the fitted exponent could be directly compared to their reported value.

    5. The fact that differences between narrators leads to changes to the ABR response is to be expected, and was already reported in Maddox and Lee 2018. I don't understand why it needs to be examined and discussed at such length here. The space devoted to discussing the recording time also seems very long. Neither abstract or introduction refers to these topics, and they seem to be side-issues that could be summarised and discussed much more briefly.

    L142-144. Is it possible to apply the pulse train regressor to the unaltered speech response? If so, does this improve the response, i.e. make it look more similar to the peaky speech response? It would be interesting to know whether improvement is due to the changed regressor or the stimulus modification or both.

    L208 -211. What causes the difference between the effect of high-pass filtering and subtracting the common response? If they serve the same purpose, but have different results, this raises the question which is more appropriate.

    L244. This seems a misinterpretation. The similarity between broadband and summated multiband responses indicates that the band filtered components in the multiband stimulus elicited responses that add linearly in the broadband response. It does not imply that the responses to the different bands originate from non-overlapping cochlear frequency regions.

    L339-342. Is this measure of SNR appropriate, when the baseline is artificially constructed by deconvolution and filtering? Perhaps noise level could be assessed by applying the deconvolution to a silent recording instead? It might also be useful to have a measure of the replicability of the response.

  3. Reviewer #1:

    Major issues:

    I have two major comments on the work.

    1. The authors motivate the work from the use of naturalistic speech, and the application of the developed method to investigate, for instance, speech-in-noise deficits. But they do not discuss how comprehensible the peaky speech in fact is. I would therefore like to see behavioural experiments that quantitatively compare speech-in-noise comprehension, for example SRTs, for the unaltered speech and the peaky speech. Without such a quantification, it is impossible to fully judge the usefulness of the reported method for further research and clinical applications.

    2. The neural responses to unaltered speech and to peaky speech are analysed by two different methods. For unaltered speech, the authors use the half-wave rectified waveform as the regressor. For peaky speech, however, the regressor is a series of spikes that are located at the timings of the glottal pulses. Due to this rather different analysis, it is impossible to know to which degree the differences in the neural responses to the two types of speech that the authors report are due to the different speech types, or due to the different analysis techniques. The authors should therefore use the same analysis technique for both types of speech. It might be most sensible to analyse the unaltered speech through a regressor with spikes at the glottal pulses a well. In addition, it would be good to see a comparison, say of a SNR, when the peaky speech is analysed through the half-wave rectified waveform and through the series of spikes. This would also further motivate the usage of the regressor with the series of spikes.

  4. Summary: This manuscript describes a type of alteration to speech to make it more peaky, with the goal of inducing stronger responses in the auditory brainstem. Recent work has employed naturalistic speech to investigate subcortical mechanisms of speech processing. However, previous methods were ill equipped to tease apart the neural responses in different parts of the brainstem. The authors show that their speech manipulation improves this: the peaky speech that they develop allows to segregate different waves of the brainstem response. This development may allow further and more refined investigations of the contribution of different parts of the brainstem to speech processing, as well as to hearing deficits.