Perceptual gating of a brainstem reflex facilitates speech understanding in human listeners

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Navigating “cocktail party” situations by enhancing foreground sounds over irrelevant background information is typically considered from a cortico-centric perspective. However, subcortical circuits, such as the medial olivocochlear (MOC) reflex that modulates inner ear activity itself, have ample opportunity to extract salient features from the auditory scene prior to any cortical processing. To understand the contribution of auditory subcortical nuclei and the cochlea, physiological recordings were made along the auditory pathway while listeners differentiated non(sense)-words and words. Both naturally-spoken and intrinsically-noisy, vocoded speech — filtering that mimics processing by a cochlear implant—significantly activated the MOC reflex, whereas listening to speech-in-background noise revealed instead engagement of midbrain and cortical resources. An auditory periphery model reproduced these speech degradation-specific effects, providing a rationale for goal-directed gating of the MOC reflex to enhance representation of speech features in the auditory nerve. Our data reveals the co-existence of two strategies in the auditory system that may facilitate speech understanding in situations where the speech signal is either intrinsically degraded or masked by extrinsic auditory information.

Article activity feed

  1. ###Author Response

    Note from the authors:

    This is the authors' response to the reviewers' comments for the manuscript “Perceptual gating of a brainstem reflex facilitates speech understanding in humans” submitted to eLife via Preprint Review. We appreciate the time and effort the reviewers took to carefully revise our work. We believe all comments and suggestions will improve the manuscript for future publication. All the authors’ comments detailed in this response will be implemented in the next version of this manuscript.

    Reviewer #1: [...] Reviewer 1-Comment 1:

    1. An important aspect of assessing the efferent feedback through the CEOAEs and ABRs is to ensure that different stimuli have equal intensity. The authors write in the methodology that the speech stimuli were presented at 75 dB SPL. However, it is not stated if this applies to the speech stimuli only, such that the stimuli that include background noise would have a higher intensity, or to the net stimuli. If the intensity of the speech signals alone had been kept at 75 dB SPL while the background noise had been increased, this would render the net signal louder and influence the MOCR. In addition, it would have been better to determine the loudness of the signals according to frequency weighting of the human auditory system, especially regarding the vocoded speech, to ensure equal loudness. If that was not done, how can the authors control for differences in perceived loudness resulting from the different stimuli?

    Response to Reviewer 1-Comment 1:

    Controlling the stimulus level is a critical step when recording any type of OAE due to the potential activation of the middle ear muscle reflex (MEMR). High intensity sounds delivered to an ear can evoke contractions of both the stapedius and the tensor tympani muscles causing the ossicular chain to stiffen and the impedance of middle ear sound transmission to increase (Murata et al.,1986; Liberman & Guinan,1998). As a result, retrograde middle ear transmission of OAE magnitude can be reduced due to MEMR and not MOCR activation (Lee et al., 2006). For this reason, we were particularly careful to determine the presentation level of our stimuli.

    As pointed out by the reviewer and stated in the Methods section: Experimental Protocol: “The speech tokens were presented at 75 dB SPL and the click stimulus at 75 dB p-p, therefore no MEMR contribution was expected given a minimum of 10 dB difference between MEMR thresholds and stimulus levels (ANSI S3.6-1996 standards for the conversion of dB SPL to dB HL)”. 75 dB SPL was indeed selected as the presentation level for all natural, noise vocoded and speech-in-noise tokens. All tokens were root-mean-square normalized and the calibration system (sound level meter (B&K G4) and microphone IEC 60711 Ear Simulator RA 0045 563 (BS EN 60645-3:2007), (see CEOAEs acquisition and analysis section)) was set to “A-Weighting” which matches the human auditory range. Therefore, the net signal was never above 75 dBA. We acknowledge the lack of details about the calibration procedure in the current manuscript and will consequently add them in a future Methods section.

    Reviewer 1-Comment 2:

    1. Many of the p-values that show statistical significance are actually near the threshold of 0.05 (such as in the paragraph lines 147-181). This is particularly concerning due to the large number of statistical tests that were carried out. The authors state in the Methods section that they used the Bonferroni correction to account for multiple comparisons. This is in principle adequate, but the authors do not detail what number of multiple comparisons they used for the correction for each of the tests. This should be spelled out, so that the correction for multiple comparisons can be properly verified.

    Response to Reviewer 1-Comment 2:

    Bonferroni corrections were explicitly chosen as the multiple comparisons adjustment across our post-hoc statistical analyses because they are a highly conservative test that protect from Type I error. All the p-values reported in our study are corrected p-values for post-hoc comparisons. However, we agree that for verification purposes, the number of comparisons for each statistical analysis should be clarified in the Methods section and will be added to a future version of the manuscript.

    Reviewer 1-Comment 3:

    1. Line 184-203: It is not clear what speech material is being discussed. Is it the noise vocoded speech, the speech in either type of background noise, or these data taken together?

    Response to Reviewer 1-Comment 3:

    Lines 184-203 correspond to “Auditory brainstem activity reflects changes in cochlear gain” in the Results section. Line 186 describes changes in ABR components during noise-vocoded speech: “Click-evoked ABRs—measured during simultaneous presentation of vocoded speech—showed task-engagement-specific effects similar to the effects observed for CEOAE measurements.” The subsequent 3 sentences refer to the same (noise-vocoded) condition, whereas the remaining sentences in the section refer to the speech-in-noise conditions. As pointed out by the reviewer we did not specify a specific masked condition in the sentence: “Conversely, although wave III was unchanged in both masked conditions for active vs. passive listening, wave V was significantly enhanced: [F (1, 26) = 5.67, p = 0.025 and F (1, 25) = 8.91, p = 0.006] when a lexical decision was required.” Here the rANOVAs correspond to masked conditions: speech in babble noise and speech-shaped noise respectively. This will be rectified in a future version of the manuscript.

    Reviewer 1-Comment 4:

    1. Line 202-203: The authors write that "the ABR data suggest different brain mechanisms are tapped across the different speech manipulations in order to maintain iso-performance levels". It is not clear what evidence supports this conclusion. In particular, from Figure 1D, it appears plausible that the effects seen in the auditory brainstem may be entirely driven by the MOCR effect. To see this, please note that absence of statistical significance does not imply that there is no effect. In particular, although some differences between active and passive listening conditions are non-significant, this may be due to noise, which may mask significant effects. Importantly, where there are significant differences between the active and the passive scenario, they are in the same direction for the different measures (CEOAEs, Wave III, Wave V). Of course, that does not mean that nothing else might happen at the brainstem level, but the evidence for this is lacking.

    Response to Reviewer 1-Comment 4:

    Lines 202-203 also correspond to “Auditory brainstem activity reflects changes in cochlear gain” in the Results section. As suggested by the reviewer, the effects observed in the ABRs may be driven by the MOCR. We agree with this observation in lines 195-197, explaining that the decreased magnitude of ABR components is consistent with reduced magnitude of CEOAEs measured during active listening in the vocoded condition, since a reduction in cochlear gain can reduce the activity of auditory nerve (AN) afferents synapsing in the cochlear nucleus (CN). However, we did not explain that this trend is also observed during the passive listening of speech-in-noise, therefore demonstrating that vocoded and speech-in-noise are differently processed at the level of the brainstem and midbrain. In a future version of the manuscript, we will restrict our interpretation to statistical comparisons in the Results and leave potential mechanisms for the Discussion section.

    Reviewer 1-Comment 5:

    1. The way the output from the computational model is analyzed appears to bias the results towards the author's preferred conclusion. In particular, the authors use the correlation between the simulated neural output for a degraded speech signal, say speech in noise, and the neural output to the speech signal in quiet with the efferent feedback activated. They then compute how this correlation changes when the degraded speech signal is processed by the computational model with or without efferent feedback. However, the way the correlation is computed clearly biases the results to favor processing by a model with efferent feedback.

    The result that the noise-vocoded speech has a higher correlation when processed with the efferent feedback on is therefore entirely expected, and not a revelation of the computational model. More surprising is the observation that, for speech in noise, the correlation value is larger without the efferent feedback. This could due to the scaling of loudness of the acoustic input (see point 1), but more detail is needed to pin this down. In summary, the computational model unfortunately does not allow for a meaningful conclusion.

    Response to Reviewer 1-Comment 5:

    While claims of bias would be understandable had we used shuffled auto-correlograms (SACs) to compare the expression of temporal fine structure (TFS) cues for natural speech versus vocoded stimuli (TFS cues reconstructed from the envelope of our vocoded stimuli would have differed dramatically from those original TFS cues in natural speech) (Shamma and Lorenzi, 2013), there is no inherent reason for SAC analysis of envelopes cues being biased towards either vocoded or speech-in-noise conditions as both stimuli retain the original envelope cues from natural speech. Indeed, since the purpose of our simulations was to compare the relative effects of adding efferent feedback on the reconstruction of the stimulus’ envelope cues in the AN for the two degraded stimuli, SACs offered a targeted analysis tool to extract the relevant information with fewer intermediate steps and presumptions than either encoder models or automatic speech recognition systems.

    We do agree with the reviewer that results of our simulations for the vocoded condition may have been less unexpected than those of speech-in-noise, as the envelopes of vocoded stimuli closely resemble those of natural speech in the absence of a masking noise. However, our results also demonstrate that adding efferent feedback could generate negative correlation changes for a number of vocoded words: either at individual frequencies (low and high spontaneous rate AN fibres (see raw data)) or on average across all frequencies tested [high spontaneous rate AN fibres only (Fig Supplement 3)]. This suggests that noise-vocoding speech (i.e. implementing the envelope from broader channel bandwidths while also scrambling spectrotemporal information in said channels) can disrupt envelope representation in the 1-2kHz range of certain words enough that efferent feedback should not be automatically presumed able to rectify their envelope cue reconstruction in AN fibres.

    As for the speech-in-noise conditions, our intuition for the negative correlation changes observed is that the signal-to-noise ratios (SNRs) tested were not large enough to allow for the isolated extraction of the target signal’s envelope by expanding the dynamic range of AN fibres. As the test stimuli and their SNRs were directly acquired by finding iso-performance in the psychophysical portion of this study (and appropriately normalized as input for the MAP_BS model), we consider the results of the simulation to be indicative of the actual benefit/disadvantage that activating efferent feedback might have on envelope representation of vocoded or speech-in-noise tasks in the AN [and not artefacts of poorly calibrated stimulus presentation level (see Responses to Reviewer1-Comment 1 and 6 for more details about methodology)]. Although this result may be surprising when viewed in the context of physiological and modelling studies demonstrating efferent feedback’s masking effect, our results may help to explain why MOCR anti-masking appears SNR- and stimulus- specific in numerous human studies (de Boer et al., 2012; Mertes et al., 2019).

    Reviewer 1-Comment 6:

    1. The experiment on the ERPs in relation to the speech onsets is not properly controlled. In particular, the different acoustics of the considered speech signals -- speech in quiet, vocoded speech, speech in background noise -- will cause differences in excitation within the cochlea which will then affect every subsequent processing stage, from the brainstem and on to the cortex, thereby leading to different ERPs. As an example, babble noise allows for 'dip listening', while with its flat envelope speech-shaped noise does not. Analyzing differences in the ERPs with the goal of relating these to something different than the purely acoustic differences, such as to attention, would require these acoustic differences to be controlled, which is not the case in the current results.

    Response to Reviewer 1-Comment 6:

    Our fundamental methodological strategy was not to compare or even control the acoustics of the signals (although we did this to some extent by normalizing the presentation level and long-term spectrum across all signals), but instead to maintain iso-performance across conditions and, in doing so, allow the identification of brain mechanisms underlaying performance in a lexical decision task where speech intelligibility was manipulated.

    We do acknowledge the reviewer’s comment regarding acoustic differences across our speech signals. This is why in the Results section we describe that: “Early auditory cortical responses (P1 and N1) are largely driven by acoustic features of the stimulus (Getzmann et al., 2015; Grunwald et al., 2003)”. Therefore, our ERP analysis instead focuses on later, less stimulus-driven components such as P2, N400 and LPC: “Later ERP components, such as P2, N400 and the Late Positivity Complex (LPC), have been linked to speech- and task-specific, top-down (context-dependent) processes (Getzmann et al., 2015; Potts, 2004).”

    With regards to the reviewer’s example: “…babble noise allows for 'dip listening', while with its flat envelope speech-shaped noise does not”. We could argue that in our specific listening conditions “dip listening” did not offer a perceptual advantage over speech in speech shaped noise because:

    1. Higher SNR was required in the babble noise conditions to achieve the same level of performance than for the speech-shaped noise manipulations.

    2. Listeners have fewer chances to use the spectral and temporal dips compared to sentences(Rosen 2013) when listening to monosyllabic words (used in our study)

    3. The dips in the signal are expected to decrease both in depth and frequency with the number of talkers in a babble noise masker (8-talker babble used in our study), with no differences in masking effectiveness for more than 4-talker babble noise (Rosen et al., 2012).

    Overall, we believe that having modulated maskers effectively impaired speech intelligibility (Kwon and Turner 2001), but the most effective one was babble noise confirming that the best speech is its own best masker (Miller, 1947).

    Reviewer #2: [...] Reviewer 2-Comment 1:

    1. A core premise of the experiment is that the non-invasive measures recorded in response to click sounds in one ear provide a direct measure of top-down modulation of responses to the speech sounds presented to the opposite ear. This is not acknowledged anywhere in the paper, and is simply not justifiable. The click and speech stimuli in the different ears will activate different frequency ranges and neural sources in the auditory pathway, as will the various noises added to the speech sounds. Furthermore, the click and speech sounds play completely different roles in the task, which makes identical top-down modulation illogical. The situation is further complicated by the fact that the clicks, speech and noise will each elicit MOCR activation in both ipsi- and contralateral ears via different crossed and uncrossed pathways, which implies different MOCR activation in the two ears.

    Response to Reviewer 2-Comment 1:

    We employed broadband clicks across all stimulus manipulations and listening conditions to activate the entire cochlea so that resulting OAEs could be used to measure modulation of cochlear gain by olivocochlear efferents.

    Historically, studies have applied clicks in one ear (to evoke OAEs) and a broadband noise suppressor in the other to monitor contralateral MOCR activation, demonstrating that clicks are suppressed consistently when subjects actively perform either an auditory (Froehlich et al., 1993, Maison et al., 2001; Garinis et al., 2011) or visual tasks (Puel et al., 1988; Froehlich et al., 1990; Avan & Bonfils 1992; Meric & Collet 1994). Therefore, while we acknowledge that the presence of clicks may have made the task of discriminating vocoded and words-in-noise more difficult, we would have expected to observe suppression of click-evoked OAEs for all stimulus manipulations whether subjects were actively or passively listening to speech stimuli in order to minimize the impact of the irrelevant clicks. In contrast, we observed that contralateral suppression of CEOAEs was both stimulus- and task-dependent. Unlike natural and vocoded speech, active listening of speech-in-noise did not produce significant MOCR activation; while passive listening (equivalent to visual attention) generated an MOCR effect in the opposite direction to their active-listening analogues for all 3 speech manipulations.

    Despite spectrotemporal, level and task-difficulty similarities between noise-vocoded speech and speech-in-noise manipulations, the stimulus-dependence of these results suggests that MOCR activation was controlled in a top-down manner according to the auditory scene presented. We speculated that this arises from improved peripheral processing of specific speech cues during active listening, whereas the opposite effects in passive listening are associated with attenuating auditory inputs to prioritize visual information. In line with this, we observed that introducing efferent feedback to our auditory periphery model differentially affected the auditory nerve output for the 3 most challenging speech manipulations: the resulting enhancement or deterioration of envelope cue representation offering an explanation for divergent patterns of MOCR gating for noise-vocoded and speech-in-noise.

    In summary, we predict that observed changes in CEOAE amplitudes in the contralateral ear will mirror cochlear gain inhibition in the ear processing speech. Bilateral descending control of the MOCR despite speech being presented monaurally is not unexpected for two reasons:

    1. Unlike simple pure tone stimuli, speech activates both left and right auditory cortices even when presented unilaterally to either ear (Heggdal et al., 2019)

    2. Cortical gating of the MOCR in humans does not appear restricted to direct ipsilaterally descending processes that impact cortical gain control in the opposite ear instead likely incorporating polysynaptic, decussating processes to affect both cochlear gain in both ears (Khalfa et al., 2001).

    Together this evidence makes it difficult to envisage a case where unilaterally-presented speech does not influence top-down control of cochlear gain bilaterally.

    Reviewer 2-Comment 2:

    1. The vocoded conditions were recorded from a different group of participants than the masked speech conditions. Comparing between these two, which forms the essential point in this paper, is therefore highly confounded by inter-individual differences, which we know are substantial for these measures. More generally, the high variability of results in this research field should caution any strong conclusions based on comparing just these two experiments. A more useful approach would have been to perform the exact same task in the two experiments, to examine the reproducibility.

    Response to Reviewer 2-Comment 2:

    We ensured that the two populations tested across the three experiments were all normal hearing adults assessed using the same criteria. They were also age- and gender- matched and were recruited from undergraduate courses at Macquarie University (therefore presumably possessed similar literacy); however, we acknowledge this as an important issue and controlled for these issues, as far as we could, by:

    1. Ensuring that CEOAE SNRs were above a 6 dB minimum which allowed for more reliable and replicable recordings within and between subjects (Goodman et al., 2013).

    2. Carefully analysing and selecting ABR waveforms above the residual noise. Residual noise was calculated by applying a weighted average method based on Bayesian inference that weighs individual sweeps proportionally to their estimated precision (Box & Tiao, 1973). This helped preserve all trials without any rejection required for artefacts. ABR waveforms with residual noise equal to or higher than the averaged signal were discarded.

    3. Ensuring that individual ERP components represented a reliable individual average by: a) removing noisy trials (trials between -200 ms and 1.2 sec from sound onset which had absolute amplitude values higher than 75 μV) and b) maintaining between 60-80% of total trials per condition.

    In addition, we assessed potential differences across common variables between experiments such as, lexical performance during natural speech (see Results section), ABR components and CEOAE magnitude changes relative to the baseline during the Active and Passive listening of natural speech (as part of the 1st author’s thesis dissertation: Hernandez Perez, H., & Macquarie University. Department of Linguistics, degree granting institution. (2018). Disentangling the Influence of Attention in the Auditory Efferent System during Speech Processing / Heivet Hernandez Perez): “During active or passive listening of natural speech, no statistical differences between the populations assessed in the noise-vocoded and speech-in-noise experiments for: wave V-III amplitude ratio- Active listening [t (12) = 0.90, p=0.39], Passive listening: [t (23) = 1.58, p=0.13]; wave V-Active listening: [t (23) = 0.09, p=0.93]; Passive listening: [t (24) = -0.24, p=0.81]; CEOAE magnitude changes-Active listening [t (23) = -0.21, p=0.83; Passive listening [t (24) = -0.36, p=0.72].”

    These results ruled out the possibility that the effects observed across the three experiments were due to intrinsic differences between the populations tested. This would be discussed in a future version of the manuscript and added as supplemental material.

    Reviewer 2-Comment 3:

    1. The interpretation presented here is essentially incompatible with the anti-masking model for the MOCR that first started of this field of research, in which the noise response is suppressed more than the signal, which is contradictory to the findings and model presented here, which suggest no role for the MOCR in improving speech in noise perception.

    Response to Reviewer 2-Comment 3:

    Physiological evidence for the MOCR anti-masking effect in animal models (Wiederhold, 1970; Winslow & Sachs 1987; Guinan & Gifford 1988; Kawase et al., 1993) has led to the hypothesis that the MOCR may play an important role in aiding humans to perceive speech in noise (Giraud et al., 1997; Liberman & Guinan 1998). The strictly non-invasive nature of human experiments has made measuring MOCR effects on OAE amplitudes the main technique for testing this anti-masking hypothesis. However, OAE inhibition (the MOCR-mediated reduction in OAE amplitude) has been reported as either increased (Giraud et al., 1997; Mishra and Lutman, 2014), reduced (de Boer et al., 2012; Harkrider and Bowers, 2009) or being unaffected (Stuart and Butler, 2012; Wagner et al., 2008) in participants with improved speech-in-noise perception. More recently, Mertes et al. (2019) suggested that the SNR used to explore speech-in-noise abilities might explain the contradicting results in the literature. The authors found that the MOCR only contributed to perception at the lowest SNR they tested (-12 dB), suggesting that the role of the MOCR for listening-in-noise may be highly dependent on the SNR, which in turns influences the extent to which the MOCR does or does not provide a benefit for hearing in noise. Therefore, our human and modelling data not only expands but also challenges the classical MOCR anti-masking effect by suggesting that, in humans, this effect is not only SNR-specific (which we controlled) but it is also task-specific (i.e whether participants are attending to the contralateral masker or not) and stimuli-dependent (i.e masker intrinsically noisy Vs signal-in-noise). We acknowledge that we can discuss further how our data advances the current state of the MOCR anti-masking effect in a future version of the manuscript.

    Reviewer 2-Comment 4:

    1. The analysis of measures becomes increasingly selective and lacking in detail as the paper progresses: numerous 'outliers' are removed from the ABR recordings, with very uneven numbers of outliers between conditions. ABRs were averaged across conditions with no explicit justification. The statistical analysis of the ABRs is flawed as it does not compare across conditions (vocoded vs masked) but only within each condition separately (active v passive) - from which no across-condition difference can be inferred. The model simulation includes only 3 out of 9 active conditions. For the cortical responses, again only 3 conditions are discussed, with little apparent relevance.

    Response to Reviewer 2-Comment 4:

    In regard to the reviewer’s comment “The analysis of measures becomes increasingly selective and lacking in detail as the paper progresses: numerous 'outliers' are removed from the ABR recordings, with very uneven numbers of outliers between conditions. ABRs were averaged across conditions with no explicit justification.” During the analysis of the ABR measurements, we not only dealt with outliers but also with several missing data points (ABR components below the residual noise). The statistical analysis used to assess potential differences within ABR components was rANOVAs. This type of analysis is particularly restrictive when dealing with missing data points, because it will only include participants with all data available: (2 Conditions X 4 Stimuli manipulations for the noise vocoded experiment). This is why, ABR components’ sample sizes across experiments appeared uneven.

    Regarding the reviewer’s comment: “ABRs were averaged across conditions with no explicit justification.” Our rANOVA had the following design: Factor 1 (Conditions: Active Vs Passive); Factor 2 (Stimuli: natural, 8 channels noise vocoded (Voc8) …etc) and finally the Interaction (Conditions x Stimuli). ABR conditions were not simply averaged together; we only found a significant Conditions effect in the rANOVA that collapses all stimuli manipulations into Active Vs Passive conditions. Therefore, it was only statistically valid, to make inferences and potential interpretations about the Conditions main effect. This would be clarified in both the statistical design and in the Results section of a future version of this manuscript.

    In regard to the reviewer’s comment: “The statistical analysis of the ABRs is flawed as it does not compare across conditions (vocoded vs masked) but only within each condition separately (active v passive) - from which no across-condition difference can be inferred”. Up to this point in our data analysis, we were only interested in within-speech-manipulations comparisons (similar to the CEOAE analysis i.e, within noise-vocoded manipulations). We agree with the reviewer that a simple comparison between speech manipulations (noise-vocoded Vs masked speech) for the variables that are reflecting attentional changes (Active Vs Passive listening) could be useful to infer differences across experiments (noise-vocoded Vs speech-in-noise). This analysis will be added in a future version of the paper.

    Finally, regarding the comment:” The model simulation includes only 3 out of 9 active conditions. For the cortical responses, again only 3 conditions are discussed, with little apparent relevance”. At this stage of our analysis, we wanted to understand the potential reasons why the control of the cochlear gain appeared to be dependent on the way speech was being degraded i.e, noise vocoding the speech signal Vs speech-in-noise. Iso-performance being achieved in 3 task-difficulty levels, we thought to test how both the biophysical model and the auditory cortex (ERP components) would respond to the hardest and most challenging speech degradations (noise vocoded 8 channels, speech in babble noise +5 dB snr and speech in speech-shaped noise +3 dB snr) (see Figure 1B in Results section), where differences in the cochlear gain are most evident across experiments (see Figure 1B in Results section). In these extreme conditions we hypothesized that both the model and the auditory cortex activity would display the most obvious differences in the processing of the different speech degradations. We acknowledge the reviewer’s comment and in a future version of this manuscript, this line of thought will be more clearly described.

    Reviewer 2-Comment 5:

    1. The assumption that changes in non-invasive measures, which represent a selective, random, mixed and jumbled by-product of underlying physiological processes, can be linked causally to auditory function, i.e. that changes in these responses necessarily have a definable and directional functional correlate in perception, is very tenuous and needs to be treated with much more caution.

    Response to Reviewer 2-Comment 5:

    We acknowledge the reviewer’s view about being cautious when interpreting non-invasive measures associated with human perception. However, the physiological measurements used in this study are not new in the field of auditory or speech perception, they are gold-standard methods to assess auditory function in both animal and human models. The novelty of our approach lays in imposing attentional states (Active listening) and (Passive listening) while concurrently probing along the auditory pathway in order to gain a holistic understanding of MOCR-mediated changes during a speech comprehension task. The strength of our methodology arises from extensively and continuously monitoring both the attentional states and the quality of our physiological measurements.

    Reviewer #3: [...] Reviewer 3-Comment 1:

    1. However, I have several substantial concerns with the design, conceptualization, data analysis and interpretation of the results. I have had challenges to understand the hypotheses and rationale behind this study. A number of experimental paradigms have been employed, including peripheral/brainstem physiological measure, as well as cortical auditory responses during active versus 'passive' listening. Different noise conditions were tested but it is not clear to me what rationale was behind these stimulus choices. The authors claim that "our data comparing active and passive listening conditions highlight a categorical distinction between speech manipulation, a difference between processing a single, but degraded, auditory stream (vocoded speech) and parsing a complex acoustic scene to hear out a stream from multiple competing and spectrally similarly sounds" (lines 401-403). This seems like too much of a mouthful. I cannot see that the data support this pretty broad interpretation.

    Response to Reviewer 3-Comment 1:

    The main objective of this study is to examine the role of the auditory efferent system in active vs. passive listening tasks for three commonly employed speech manipulations. To address this, speech intelligibility was degraded in three ways: 1) noise vocoding the speech signal; 2) adding babble noise (BN) to the speech signal at different SNRs or 3) adding speech-shaped noise (SSN) to the speech signal at different SNRs. The reason for using noise-vocoded speech while contralaterally recording CEOAEs is that it allowed speech intelligibility to be manipulated without increasing noise levels (a classical way of evoking the MOCR (Berlin et al., 1993; Norman & Thornton 1993; Kalaiah et al., 2017b)). This avoided confounding CEOAE magnitude changes due to purely stimulus-driven MOCR activation with attention-driven MOCR on CEOAE magnitudes. Moreover, because the level of the speech spectrum decreases with increasing frequency, white noise (which is the most commonly used stimulus to evoke MOCR in the literature) predominantly masks only the high frequency component of the speech signal, therefore it is not considered an efficient speech masker. However, BN (besides representing a more ethological auditory type of noise) and SSN (which is the spectrally matched long-term averaged of the speech signal) have the same long-term average spectrum as speech. Therefore, these noises were able to mask the speech signal equally across frequencies.

    Reviewer 3-Comment 2:

    1. Despite maintaining iso-difficulty between vocoded vs speech-in-noise (SIN) conditions, the authors neither address (a) the fundamental differences in understanding vocoded vs. SIN speech nor (b) any theoretical basis for how the noise manifests in vocoded speech. If the tasks are indeed so obviously 'categorically' different - then it should not be surprising they engage different processing (the 'denoising' may not be comparable). I would prefer much more clearly defined and targeted hypotheses and a justification of the specific stimulus and paradigm choices to test such hypotheses. It appears to me that numerous measures have been obtained (reflecting in fact very different processes along the auditory pathway) and then it has been attempted to make up some coherent conclusions from these data - but the assumptions are not clear, the data are very complex and many aspects of the discussion are speculative. To me, the most interesting element is the reversal of the MOCR behavior in the attended vs ignored conditions. However, ignoring a stimulus is not a passive task! It would have been interesting to also see cortical unattended results.

    Response to Reviewer 3-Comment 2:

    The motivation behind this study arises from controversy in the literature regarding attentional effects at both the level of the cochlear (via MOCR) and the brainstem. Previous studies of attentional effects on CEOAEs have not only prevented direct comparison among them but have also distorted the interpretation of their results. Most have implemented paradigms with large differences in their arousal state [or alertness levels (Eysenck, 2012)] and stimulus type between the active auditory task (e.g. speech stimuli presented while CEOAEs are recorded) and passive listening conditions (no task, CEOAEs recorded during no-noise conditions or with-noise conditions) (Froehlich et al., 1990; Meric et al., 1994; Srinivasan et al., 2012). Our experimental paradigm addressed these issues in three main ways: 1) using the same stimuli for both active and passive listening conditions; 2) using a controlled visual scene across the experimental sessions; and 3) attempting to control for differences in alertness during the passive condition by asking subjects to watch an engaging cartoon movie. The homogeneity of visual and auditory scenes across the experiments allowed the effects of attending to the speech on CEOAE magnitude to be disentangled from the stimulus-driven effects.

    In addition, it was never assumed that the “Passive listening” or the “auditory-ignored” condition was a passive task. In this condition subjects were asked to ignore the auditory stimuli and to watch a non-subtitled, stop-motion movie. To ensure participants’ attention during this condition, they were monitored with a video camera and were asked questions at the end of this session (e.g. What happened in the movie? How many characters were present?) (See Methods section). The aim of a passive or an auditory-ignoring condition is to shift attentional resources away from the auditory scene and towards the visual scene. As shown in (Figure supplement 4) all ERP components were also obtained in the Passive listening condition and they are of a smaller magnitude than ERP components observed in the active listening conditions, demonstrating that cortical representation of the speech-onset was enhanced in all active listening conditions.

    Reviewer 3-Comment 3:

    1. Overall, I'm struggling with this study that touches upon various concepts and paradigms (efferent feedback, active vs. passive listening, neural representation of listening effort, modeling of efferent signal processing, stream segregation, speech-in-noise coding, peripheral vs cortical representations...) where each of them in isolation already provides a number of challenges and has been discussed controversially. In my view, it would be more valuable to specify and clarify the research question and focus on those paradigms that can help verify or falsify the research hypotheses.

    Response to Reviewer 3-Comment 3:

    In our study, we sought to explore how active listening of degraded speech modulates CEOAE magnitudes (as a proxy for efferent-MOCR effects). With the specific Research question: Does auditory attention modulate cochlear gain, via the auditory efferent system, in a task-dependent manner? and Hypothesis: Decreases in speech intelligibility raise auditory attention and this reduces cochlear gain (measured using CEOAEs).

    In particular, unlike previously published studies, we assessed auditory changes objectively and subjectively as part of a highly controlled experimental paradigm, maintaining a constant performance across three experimental manipulations of speech intelligibility as well as minimizing influences of MEMR activation and controlling for homogeneity of both visual and auditory scenes across conditions. We agree with the reviewer that due to the complexity of our study, each section should be more explicit in its hypothesis and aims. This will be clarified in a future version of this manuscript.

  2. ###Reviewer #3:

    This is an interesting study addressing a very relevant and exciting topic. The study investigates the contribution of auditory subcortical nuclei and the cochleae using physiological recordings while listeners differentiated words in different noisy-speech conditions. It is a valuable approach to consider contiguous measures along the auditory pathway during a single behavioral measurement.

    However, I have several substantial concerns with the design, conceptualization, data analysis and interpretation of the results. I have had challenges to understand the hypotheses and rationale behind this study. A number of experimental paradigms have been employed, including peripheral/brainstem physiological measure, as well as cortical auditory responses during active versus 'passive' listening. Different noise conditions were tested but it is not clear to me what rationale was behind these stimulus choices. The authors claim that "our data comparing active and passive listening conditions highlight a categorical distinction between speech manipulation, a difference between processing a single, but degraded, auditory stream (vocoded speech) and parsing a complex acoustic scene to hear out a stream from multiple competing and spectrally similarly sounds" (lines 401-403). This seems like too much of a mouthful. I cannot see that the data support this pretty broad interpretation.

    Despite maintaining iso-difficulty between vocoded vs speech-in-noise (SIN) conditions, the authors neither address (a) the fundamental differences in understanding vocoded vs. SIN speech nor (b) any theoretical basis for how the noise manifests in vocoded speech. If the tasks are indeed so obviously 'categorically' different - then it should not be surprising they engage different processing (the 'denoising' may not be comparable). I would prefer much more clearly defined and targeted hypotheses and a justification of the specific stimulus and paradigm choices to test such hypotheses. It appears to me that numerous measures have been obtained (reflecting in fact very different processes along the auditory pathway) and then it has been attempted to make up some coherent conclusions from these data - but the assumptions are not clear, the data are very complex and many aspects of the discussion are speculative. To me, the most interesting element is the reversal of the MOCR behavior in the attended vs ignored conditions. However, ignoring a stimulus is not a passive task! It would have been interesting to also see cortical unattended results.

    Overall, I'm struggling with this study that touches upon various concepts and paradigms (efferent feedback, active vs. passive listening, neural representation of listening effort, modeling of efferent signal processing, stream segregation, speech-in-noise coding, peripheral vs cortical representations...) where each of them in isolation already provides a number of challenges and has been discussed controversially. In my view, it would be more valuable to specify and clarify the research question and focus on those paradigms that can help verify or falsify the research hypotheses.

  3. ###Reviewer #2:

    This is a highly ambitious study, combining a great number of physiological measures and behavioral conditions. The stated aim is to investigate the role of the descending auditory system in (degraded) speech perception. Unfortunately, the study was not designed with a clear a priori hypothesis, but instead collected a large number of measures, which were fitted together post-hoc into a particular interpretation, based on a selective subset of the data. Even more problematically, the experimental design is based on a fundamentally flawed premise, which undermines the validity of the interpretation. A final practical problem is that the most important comparison is made between conditions that were measured in separate experiments, with different participants. Given the notoriously poor reproducibility of across studies of these measures in this research field (suggesting large inter-individual variations), this casts a serious doubt on the interpretability of the observed difference.

    Specific comments:

    1. A core premise of the experiment is that the non-invasive measures recorded in response to click sounds in one ear provide a direct measure of top-down modulation of responses to the speech sounds presented to the opposite ear. This is not acknowledged anywhere in the paper, and is simply not justifiable. The click and speech stimuli in the different ears will activate different frequency ranges and neural sources in the auditory pathway, as will the various noises added to the speech sounds. Furthermore, the click and speech sounds play completely different roles in the task, which makes identical top-down modulation illogical. The situation is further complicated by the fact that the clicks, speech and noise will each elicit MOCR activation in both ipsi- and contralateral ears via different crossed and uncrossed pathways, which implies different MOCR activation in the two ears.

    2. The vocoded conditions were recorded from a different group of participants than the masked speech conditions. Comparing between these two, which forms the essential point in this paper, is therefore highly confounded by inter-individual differences, which we know are substantial for these measures. More generally, the high variability of results in this research field should caution any strong conclusions based on comparing just these two experiments. A more useful approach would have been to perform the exact same task in the two experiments, to examine the reproducibility.

    3. The interpretation presented here is essentially incompatible with the anti-masking model for the MOCR that first started of this field of research, in which the noise response is suppressed more than the signal, which is contradictory to the findings and model presented here, which suggest no role for the MOCR in improving speech in noise perception.

    4. The analysis of measures becomes increasingly selective and lacking in detail as the paper progresses: numerous 'outliers' are removed from the ABR recordings, with very uneven numbers of outliers between conditions. ABRs were averaged across conditions with no explicit justification. The statistical analysis of the ABRs is flawed as it does not compare across conditions (vocoded vs masked) but only within each condition separately (active v passive) - from which no across-condition difference can be inferred. The model simulation includes only 3 out of 9 active conditions. For the cortical responses, again only 3 conditions are discussed, with little apparent relevance.

    5. The assumption that changes in non-invasive measures, which represent a selective, random, mixed and jumbled by-product of underlying physiological processes, can be linked causally to auditory function, i.e. that changes in these responses necessarily have a definable and directional functional correlate in perception, is very tenuous and needs to be treated with much more caution.

  4. ###Reviewer #1:

    This preprint investigates neural mechanisms for processing degraded speech, in particular regarding efferent feedback. The authors thereby study two main types of speech degradations: noise vocoded speech and speech in background noise. Efferent feedback is assessed by recording click-evoked otoacoustic emissions as well as click-evoked brainstem responses, and the measurements are taken when the degraded speech is attended as well as when it is ignored. In addition, the authors also measure cortical responses to speech onsets. They find that these measures are affected by the two types of speech degradation in very different ways. In particular, for the noise vocoded speech, the click-evoked otoacoustic emissions are reduced when the speech is attended than when it is ignored. The opposite behaviour emerges when subjects listen to speech in background noise. The authors rationalise these different mechanisms through a computational model, which, as they show, can exhibit similar properties.

    Unfortunately, many of the obtained results suffer from a lack of proper controls, which renders them rather inconclusive. In addition, important details of the experimental methodology are not properly described.

    1. An important aspect of assessing the efferent feedback through the CEOAEs and ABRs is to ensure that different stimuli have equal intensity. The authors write in the methodology that the speech stimuli were presented at 75 dB SPL. However, it is not stated if this applies to the speech stimuli only, such that the stimuli that include background noise would have a higher intensity, or to the net stimuli. If the intensity of the speech signals alone had been kept at 75 dB SPL while the background noise had been increased, this would render the net signal louder and influence the MOCR. In addition, it would have been better to determine the loudness of the signals according to frequency weighting of the human auditory system, especially regarding the vocoded speech, to ensure equal loudness. If that was not done, how can the authors control for differences in perceived loudness resulting from the different stimuli?

    2. Many of the p-values that show statistical significance are actually near the threshold of 0.05 (such as in the paragraph lines 147-181). This is particularly concerning due to the large number of statistical tests that were carried out. The authors state in the Methods section that they used the Bonferroni correction to account for multiple comparisons. This is in principle adequate, but the authors do not detail what number of multiple comparisons they used for the correction for each of the tests. This should be spelled out, so that the correction for multiple comparisons can be properly verified.

    3. Line 184-203: It is not clear what speech material is being discussed. Is it the noise vocoded speech, the speech in either type of background noise, or these data taken together?

    4. Line 202-203: The authors write that "the ABR data suggest different brain mechanisms are tapped across the different speech manipulations in order to maintain iso-performance levels". It is not clear what evidence supports this conclusion. In particular, from Figure 1D, it appears plausible that the effects seen in the auditory brainstem may be entirely driven by the MOCR effect. To see this, please note that absence of statistical significance does not imply that there is no effect. In particular, although some differences between active and passive listening conditions are non-significant, this may be due to noise, which may mask significant effects. Importantly, where there are significant differences between the active and the passive scenario, they are in the same direction for the different measures (CEOAEs, Wave III, Wave V). Of course, that does not mean that nothing else might happen at the brainstem level, but the evidence for this is lacking.

    5. The way the output from the computational model is analyzed appears to bias the results towards the author's preferred conclusion. In particular, the authors use the correlation between the simulated neural output for a degraded speech signal, say speech in noise, and the neural output to the speech signal in quiet with the efferent feedback activated. They then compute how this correlation changes when the degraded speech signal is processed by the computational model with or without efferent feedback. However, the way the correlation is computed clearly biases the results to favor processing by a model with efferent feedback. The result that the noise-vocoded speech has a higher correlation when processed with the efferent feedback on is therefore entirely expected, and not a revelation of the computational model. More surprising is the observation that, for speech in noise, the correlation value is larger without the efferent feedback. This could due to the scaling of loudness of the acoustic input (see point 1), but more detail is needed to pin this down. In summary, the computational model unfortunately does not allow for a meaningful conclusion.

    6. The experiment on the ERPs in relation to the speech onsets is not properly controlled. In particular, the different acoustics of the considered speech signals -- speech in quiet, vocoded speech, speech in background noise -- will cause differences in excitation within the cochlea which will then affect every subsequent processing stage, from the brainstem and on to the cortex, thereby leading to different ERPs. As an example, babble noise allows for 'dip listening', while with its flat envelope speech-shaped noise does not. Analyzing differences in the ERPs with the goal of relating these to something different than the purely acoustic differences, such as to attention, would require these acoustic differences to be controlled, which is not the case in the current results.

  5. ##Preprint Review

    This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 1 of the manuscript.

    ###Summary:

    The authors address a very important and timely research question, namely whether, and if so, how, efferent feedback contributes to the neural processing of degraded speech. However, the reviewers have identified significant problems with the experimental design and the data analysis, as well as with the conceptualization and the interpretation of the findings.