Independent mechanisms of temporal and linguistic cue correspondence benefiting audiovisual speech processing

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

No abstract available

Article activity feed

  1. Reviewer #2:

    In this paper, Fiscella and colleagues report the results of behavioral experiments on auditory perception in healthy participants. The paper is clearly written, and the stimulus manipulations are well thought out and executed.

    In the first experiment, audiovisual speech perception was examined in 15 participants. Participants identified keywords in English sentences while viewing faces that were either dynamic or still, and either upright or rotated. To make the task more difficult, two irrelevant masking streams (one audiobook with a male talker, one audiobook with a female talker) were added to the auditory speech at different signal-to-noise ratios for a total of three simultaneous speech streams.

    The results of the first experiment were that both the visual face and the auditory voice influenced accuracy. Seeing the moving face of the talker resulted in higher accuracy than a static face, while seeing an upright moving face was better than a 90-degree rotated face which was better than an inverted moving face. In the auditory domain, performance was better when the masking streams were less loud.

    In the second experiment, 23 participants identified pitch modulations in auditory speech. The task of the participants was considerably more complicated than in the first experiment. First, participants had to learn an association between visual faces and auditory voices. Then, on each trial, they were presented with a static face which cued them which auditory voice to attend to. Then, both target and distracter voices were presented, and participants searched for pitch modulations only in the target voice. At the same time, audiobook masking streams were presented, for a total of 4 simultaneous speech streams. In addition, participants were assigned a visual task, consisting of searching for a pink dot on the mouth of the visually-presented face. The visual face matched either the target voice or the distracter voice, and the face was either upright or inverted.

    The results of the second experiment was that participants were somewhat more accurate (7%) at identifying pitch modulations when the visual face matched the target voice than when it did not.

    As I understand it, the main claim of the manuscript is as follows: For sentence comprehension in Experiment 1, both face matching (measured as the contrast of dynamic face vs. static face) and face rotation were influential. For pitch modulation in Experiment 2, only face matching (measured as the contrast of target-stream vs. distracter-stream face) was influential. This claim is summarized in the abstract as "Although we replicated previous findings that temporal coherence induces binding, there was no evidence for a role of linguistic cues in binding. Our results suggest that temporal cues improve speech processing through binding and linguistic cues benefit listeners through late integration."

    The claim for Experiment 2 is that face rotation was not influential. However, the authors provide no evidence to support this assertion, other than visual inspection (page 15, line 235): "However, there was no difference in the benefit due to the target face between the upright and inverted condition, and therefore no benefit of the upright face (Figure 2C)."

    In fact, the data provided suggests that the opposite may be true, as the improvement for upright faces (t=6.6) was larger than the improvement for inverted faces (t=3.9). An appropriate analysis to test this assertion would be to construct a linear mixed-effects model with fixed factors of face inversion and face matching, and then examine the interaction between these factors.

    However, even if this analysis was conducted and the interaction was non-significant, that would not necessarily be strong support for the claim. As the canard has it, "absence of evidence is not evidence of absence". The problem here is that the effect is rather small (7% for face matching). Trying to find significant differences of face inversion within the range of the 7% effect of face matching is difficult but would likely be possible given a larger sample size, assuming that the effect size found with the current sample size holds (t = 6.6 vs. t = 3.9).

    In contrast, in experiment 1, the range is very large (improvement from ~40% for the static face to ~90% for dynamic face) making it much easier to find a significant effect of inversion.

    One null model would be to assume that the proportional difference in accuracy due to inversion is similar for speech perception and pitch modulation (within the face matching effect) and predict the difference. In experiment 1, inverting the face at 0 dB reduced accuracy from ~90% to ~80%, a ~10% decrease. Applying this to the 7% range found in Experiment 2 would predict that inverted accuracy would be ~6.3% vs. 7%. The authors could perform a power calculation to determine the necessary sample size to detect an effect of this magnitude.

    Other Comments

    When reporting the effects of linear effects models or other regression models, it is important to report the magnitude of the effect, measured as the actual values of the model coefficients. This allows readers to understand the relative amplitude of different factors on a common scale. For experiment 1, the only values provided are imputed statistical significance, which are not good measures of effect size.

    The duration of the pitch modulations in Experiment 2 are not clear. It would help the reader to provide a supplemental figure showing the speech envelope of the 4 simultaneous speech streams and the location and duration of the pitch modulations in the target and distracter streams.

    If the pitch modulations were brief, it should be possible to calculate reaction time as an additional dependent measure. If the pitch modulations in the target and distracter streams occurred at different times, this would also allow more accurate categorization of the responses as correct or incorrect by creation of a response window. For instance, if a pitch modulation occurred in both streams and the participant responded "yes", then the timing of the pitch modulation and the response could dissociate a false-positive to the distractor stream pitch modulation from the target stream pitch modulation.

    It is not clear from the Methods, but it seems that the results shown are only for trials in which a single distracter was presented in the target stream. A standard analysis would be to use signal detection theory to examine response patterns across all of the different conditions.

    In selective attention experiments, the stimulus is usually identical between conditions while only the task instructions vary. The stimulus and task are both different between experiments 1 and 2, making it difficult to claim that "linguistic" vs. "temporal" is the only difference between the experiments.

    At a more conceptual level, it seems problematic to assume that inverting the face dissociates linguistic from temporal processing. For instance, a computer face recognition algorithm whose only job was to measure the timing of mouth movements (temporal processing) might operate by first identifying the face using eye-nose-mouth in vertical order. Inverting the face would disrupt the algorithm and hence "temporal processing", invalidation the assumption that face inversion is a pure manipulation of "linguistic processing".

  2. Reviewer #1:

    Using two behavioral experiments, the authors partially replicate known effects that rotated faces decrease the benefit of visual speech on auditory speech processing.

    As reported by the authors, Experiment 1 suffers from a design flaw considering that a temporal drift occurred in the course of the experiment. This clearly invalidates the reliability of the results and this experiment should be properly calibrated and redone. There is otherwise well-known literature on the topic.

    Experiment 2 should be discussed in the context of divided attention tasks previously reported by researchers so as to better emphasize how and whether this is a novel observation.

    Additionally:

    -The question being addressed is narrowly and ill-construed: numerous authoritative statements in the introduction should reference existing work. For instance, seminal models of Bayesian perception (audiovisual speech processing especially) should be attributed to Dominic Massaro. Such statements as "studies fail to distinguish between binding and late integration" are surprising considering that the fields of multisensory integration and audiovisual speech processing have essentially and traditionally consisted in discussing these specific issues. To name a few researchers in the audiovisual speech domain: the work of Ruth Campbell, Ken Grant, and Jean-Luc Schwartz have largely contributed to refine debates on the implication of attentional resources to audiovisual speech processing using behavioral, neuropsychology, and neuroimaging methods. In light of the additional statements of the kind "The importance of temporal coherence for binding has not previously been established for speech", I would highly recommend the authors to do a thorough literature search of their topic (below some possible references as a start).

    -What the authors understand to be "linguistic cues" should be better defined. For instance, the inverted face experiment aimed at dissociating whether visemic processing depends on face recognition (i.e. on holistic processing) or whether it depends on featural processing (and it does constitute a test, as suggested by the authors, of whether viseme recognition is a linguistic process per se).

    Some references:

    -Alsius, A., Möttönen, R., Sams, M. E., Soto-Faraco, S., & Tiippana, K. (2014). Effect of attentional load on audiovisual speech perception: evidence from ERPs. Frontiers in psychology, 5, 727.

    -Chandrasekaran, C., Trubanova, A., Stillittano, S., Caplier, A., & Ghazanfar, A. A. (2009). The natural statistics of audiovisual speech. PLoS Comput Biol, 5(7), e1000436.

    -Jordan, T. R., & Bevan, K. (1997). Seeing and hearing rotated faces: Influences of facial orientation on visual and audiovisual speech recognition. Journal of Experimental Psychology: Human Perception and Performance, 23(2), 388.

    -Grant, K. W., & Seitz, P. F. (2000). The use of visible speech cues for improving auditory detection of spoken sentences. The Journal of the Acoustical Society of America, 108(3), 1197-1208.

    -Grant, K. W., Van Wassenhove, V., & Poeppel, D. (2004). Detection of auditory (cross-spectral) and auditory-visual (cross-modal) synchrony. Speech Communication, 44(1-4), 43-53.

    -Schwartz, J. L., Berthommier, F., & Savariaux, C. (2002). Audio-visual scene analysis: evidence for a" very-early" integration process in audio-visual speech perception. In Seventh International Conference on Spoken Language Processing.

    -Schwartz, J. L., Berthommier, F., & Savariaux, C. (2004). Seeing to hear better: evidence for early audio-visual interactions in speech identification. Cognition, 93(2), B69-B78.

    -Tiippana, Kaisa, T. S. Andersen, and Mikko Sams. (2004) "Visual attention modulates audiovisual speech perception." European Journal of Cognitive Psychology 16.3: 457-472.

    -van Wassenhove, V. (2013). Speech through ears and eyes: interfacing the senses with the supramodal brain. Frontiers in psychology, 4, 388.

    -Van Wassenhove, V., Grant, K. W., & Poeppel, D. (2007). Temporal window of integration in auditory-visual speech perception. Neuropsychologia, 45(3), 598-607.

  3. Summary: Seeing a speaker's face enhances speech comprehension. This fascinating observation has nourished decades of research yet the behavioral and neural underpinnings of audiovisual speech integration remain to be elucidated.

    In this study, the authors suggest that speech accuracy is influenced by seeing the real face (moving and upright faces being better than static and rotated or inverted faces, respectively) and speech comprehension may benefit more from matching voices and faces. Both reviewers noted that the work presents no conceptual framing and that the manuscript needs to include a better review of the existing literature to situate the study. Several methodological and statistical concerns were also raised, the majority of which are detailed by Reviewer 2.