Auditory detection is modulated by theta phase of silent lip movements
Curation statements for this article:-
Curated by eLife
Summary: The reviewers agreed that the paradigm proposed in this work is elegant, and the question timely and important. However, as detailed below, they highlighted several concerns about analysis choices and the interpretation of the data. While some of these can be addressed, it was felt that a major drawback of the present manuscript is that the behaviour and EEG are obtained separately and any links are hence only circumstantial.
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (eLife)
Abstract
Audiovisual speech perception relies, among other things, on our expertise to map a speaker’s lip movements with speech sounds. This multimodal matching is facilitated by salient syllable features that align lip movements and acoustic envelope signals in the 4 - 8 Hz theta band. Although non-exclusive, the predominance of theta rhythms in speech processing has been firmly established by studies showing that neural oscillations track the acoustic envelope in the primary auditory cortex. Equivalently, theta oscillations in the visual cortex entrain to lip movements, and the auditory cortex is recruited during silent speech perception. These findings suggest that neuronal theta oscillations may play a functional role in organising information flow across visual and auditory sensory areas. We presented silent speech movies while participants performed a pure tone detection task to test whether entrainment to lip movements directs the auditory system and drives behavioural outcomes. We showed that auditory detection varied depending on the ongoing theta phase conveyed by lip movements in the movies. In a complementary experiment presenting the same movies while recording participants’ electro-encephalogram (EEG), we found that silent lip movements entrained neural oscillations in the visual and auditory cortices with the visual phase leading the auditory phase. These results support the idea that the visual cortex entrained by lip movements filtered the sensitivity of the auditory cortex via theta phase synchronisation.
Article activity feed
-
Reviewer #3:
The paper titled: "Auditory detection is modulated by theta phase of silent lip movements" the authors investigate visual entrainment to lip movement using behavioral (exp1) and non-invasive physiology (EEG; exp2).
In the first experiment participants engage in the detection of a brief tone embedded in noise. Critically, the tone appears whilst subjects are viewing a silent movie clip. Tones are critically timed with respect to the phase of the theta rhythm prevalent in the lip action trajectory (and its relation to the original audio track). Each trial includes 0, 1 or 2 tones and subjects provide a speeded response when the tone is detected. Tones are also critically presented either during the first half of the clip or the second half of the clip (or both or neither). This latter timing parameter is designed to probe the …
Reviewer #3:
The paper titled: "Auditory detection is modulated by theta phase of silent lip movements" the authors investigate visual entrainment to lip movement using behavioral (exp1) and non-invasive physiology (EEG; exp2).
In the first experiment participants engage in the detection of a brief tone embedded in noise. Critically, the tone appears whilst subjects are viewing a silent movie clip. Tones are critically timed with respect to the phase of the theta rhythm prevalent in the lip action trajectory (and its relation to the original audio track). Each trial includes 0, 1 or 2 tones and subjects provide a speeded response when the tone is detected. Tones are also critically presented either during the first half of the clip or the second half of the clip (or both or neither). This latter timing parameter is designed to probe the possibility of an increasing degree of entrainment to visual lip movement as the clip evolves. In the second experiment the findings demonstrated in the exp 1 are met with an analysis of visual entrainment and its impact on auditory sources using EEG and source estimation on data obtained while observers viewed the same silent movie clips passively. The paper is well written, the premise is clear and the findings are interesting and timely. In what follows I outline some questions and concerns that come to mind when assessing the validity of the interpretation of the findings. Those span the experimental and stimulus design as well as the analysis choices made.
The behavioral procedure suggests that the tones were pseudo-randomly positioned w/ respect to the quantified theta phase of the lip movement. It would be interesting to understand whether any care was taken to exhaustively sample different phases of the phase of interest in the lip movement. It might be important, therefore to demonstrate that phases were equivalently sampled by chance in the first and second half trials and over the different clips. An inset in figure 1 would make for a good spot to demonstrate the descriptive statistics of target positioning (as a function of phase).
Second and somewhat related, wouldn't it make more sense to quantify accuracy based on phase bins? This way no division to subpopulation would be required since each individual could be aligned to their best phase. The methods leave it somewhat unclear whether this was a possibility in terms of the stimulus design (i.e., were there enough phases to accomplish this in the stimulus/tone timing; see previous point).
In addition the subject mean phase of the correctly detected target provides little insight as to the periodic nature of performance. Analyzing whether there is a periodic modulation of the pattern of responses over phase would provide richer, more nuanced evidence for the claims.
It would be important and interesting to learn whether the first and second part of the trial has the same MI profile at theta b/w lip movement and audio track. Currently, The characterization of MI was done on the whole movie clips. This is crucial for both Experiment 1 and Experiment 2 interpretation.
The distinction b/w the first and second half -- indicating that entrainment takes time to build up is somewhat overstated in the context of this paper seeing that the literature suggests that by 0.5 s entrainment is fully arrived at (among others -- the authors themselves say so in the TINs piece). Other processes such as calibration to a given speaker might take longer, and those might justify (or account for?) the result showing that early vs. late targets differ in the degree to which the phase of the lip action affects performance.
Important details over the stimuli need to be clarified:
Did every clip introduce a new speaker to the subject? Thus, time on cl cip also amounts to degree of familiarity with the speaker?
Did each clip have the same degree of MI b/w audio and lip movement or were there better (more pronounced) lip clips than others when considering their link to the audio? Would it make sense to add these measures as covariates in the analysis?
Is the same target timing used for the same clip for all subjects? Or are the tones truly randomly placed and matched onto clips such that a given clip could appear w/ tones at different times for different subjects?
At the risk of somewhat repeating point #2 above -- within the analysis the following should be considered:
- The authors establish that in the second half performance there are, in fact, two subpopulations in the sample. Wouldn't this post hoc grouping factor, which isn't obviously motivated be better described by properly delineating performance as a function of phase? I can readily understand that the authors might not have a clear hypothesis over what might be the better phase for performing on an irrelevant tone probe. Nonetheless, if a periodic process is entraining performance once a best phase is identified adjacent phase bins should demonstrate this circular relationship. This would allow for a direct quantification of ALL data together after aligning performance to the best phase bin, per subject.
Finally, the following points pertain for most for the contextualization of this work and the discussion:
While the authors discuss at least two mechanisms relating to how entertainment affects growth by the second part of the clip, it would be nice to relate the concrete reading of this effect to cognitive processes that may evolve within these timescales. In other words, learning that tracking takes 0.5 s or learning that visual inputs to frontal cortex take a given time scale to exert impact on auditory sensory regions is another description of the finding. What might these time scales buy me as a speaker and as a listener? What processes might be reflected by arriving at these states of synchrony and top-down control for speech comprehension?
The post hoc description of the subpopulations preferred phases is interesting and could relate interestingly to the entertainment literature (from Spaak 2014 in vision through Hickok 2015 in audition and others). Might the authors speculate on what part of speech is characterized by one phase vs. another?
The author's conjecture in the discussion of this topic - an additional one - there are recent papers by Assaneo et al. (Poeppel as PI, Nat Neurosci, 2019) that show bi-modal behavior in a spontaneous synchronization task (motor to auditory), which was found to be related to morphological differences in frontal-to-auditory white matter pathways, functional differences AND better learning in a statistical learning paradigm. How do the two sets of bi-modal populations interact? The author's discussion of the motor cortex suggests they would.
Methods section:
The paper by and large is well written. An exception to this would be the methods section. Currently, the methods do not comply with best practices that would generate the work reproducible by others.
-
Reviewer #2:
This study performs behavioral assessment of the impact of watching lip movements on tone detection in noise and EEG recordings from passive observers of the same movies. The basic paradigm is that listeners watch a silent movie of lip movements (selected to be at ~theta rate) while listening for tone bursts that occur most commonly twice in a trial (early and late). The key findings are that perceptual sensitivity is higher when tones are in the second half of the trial, when hits align at a particular phase angle of the visual stimuli. Brain signals were also observed to entrain through the course of the trial. The authors conclude that visual modulation of auditory excitability explains these effects.
The stimulus design is elegant, and if taken at face value are a nice demonstration that visual stimuli can modulate …
Reviewer #2:
This study performs behavioral assessment of the impact of watching lip movements on tone detection in noise and EEG recordings from passive observers of the same movies. The basic paradigm is that listeners watch a silent movie of lip movements (selected to be at ~theta rate) while listening for tone bursts that occur most commonly twice in a trial (early and late). The key findings are that perceptual sensitivity is higher when tones are in the second half of the trial, when hits align at a particular phase angle of the visual stimuli. Brain signals were also observed to entrain through the course of the trial. The authors conclude that visual modulation of auditory excitability explains these effects.
The stimulus design is elegant, and if taken at face value are a nice demonstration that visual stimuli can modulate auditory perception in a temporally specific manner. However, I have concerns with the interpretation of the data while also feeling to some extent that these findings are expected; stimulating AC with a speech envelope modulates speech perception (Wilsch et al., 2018), silent speech modulates human auditory cortex (Calvert 1999) and visual stimuli modulated at theta rates directly entrain auditory cortical phase in animals (Atilgan et al., 2018) as do audiovisual speech stimuli in humans (Zion-Golumbic et al., 2013). This study is a further piece of evidence along these lines, but it's hard to be certain of a causal relationship when the behaviour and neurophysiology are in different listeners. I also have some concerns about the current interpretation some of which are addressable with additional analysis.
I'm not convinced that the authors have sufficiently ruled out the possibility that the first tone causes a phase reset in AC that causes detected second tones to be entrained to a particular stimulus phase. In theory this should be easily addressed by looking at the 1 tone trials where the tone is in the second half of the stimulus. These data are in the supplemental material but are not particularly reassuring - while the d' is higher for the second tone, but the phase angles are uniformly distributed across participants in comparison to the clustering observed in the 2-tone data. This finding calls into question the causal link between the phase relationship and performance. The authors note that there are relatively few trials (50% of those available in the 2 tone data) - the contribution that this plays could be addressed by subsampling half the trials from the 2 tone dataset and re-estimating the phase modulation to estimate whether the single tone condition is any different. Another analysis that could be enlightening/ reassuring would be to compute the phase of the hits to tone 2 relative to the onset of tone 1 using the modulation rate of the clip (or 6 Hz, if clips were selected to be that anyway).
I would like to see the distribution of the tones w.r.t. the phase of the lip movement (all tones, not just hits) to be reassured that there is nothing inherent in the movies that causes the phase alignment?
The neurophysiology does not demonstrate a significant increase in entrainment from early to late windows, only that there is a different phase angle. Doesn't this also call into question the conclusion that performance is better in the second half due to better entrainment? While the phase in the second might be 'more efficient' if the entrainment is equivalent shouldn't there be a behavioural relationship in both cases? This is where performing both behaviour and EEG simultaneously (or at least in the same listeners) may have proved enlightening.
-
Reviewer #1:
In this manuscript, the authors report on two separate experiments designed to understand the relationship between lip-movement induced theta phase and auditory processing. In the first experiment, subjects detected tones embedded in noise while viewing silent videos. The results demonstrate that tone detection performance improved when tones are presented later relative to earlier in a trial. It was also demonstrated that correct detection, for tones that occurred later in the trial, was systematically linked with the phase of the theta oscillatory activity conveyed by the lip movements. In the second experiment EEG was recorded while participants viewed the silent videos and performed an emotion judgement task. Theta phase coupling was demonstrated between auditory and visual areas such that oscillations in the visual …
Reviewer #1:
In this manuscript, the authors report on two separate experiments designed to understand the relationship between lip-movement induced theta phase and auditory processing. In the first experiment, subjects detected tones embedded in noise while viewing silent videos. The results demonstrate that tone detection performance improved when tones are presented later relative to earlier in a trial. It was also demonstrated that correct detection, for tones that occurred later in the trial, was systematically linked with the phase of the theta oscillatory activity conveyed by the lip movements. In the second experiment EEG was recorded while participants viewed the silent videos and performed an emotion judgement task. Theta phase coupling was demonstrated between auditory and visual areas such that oscillations in the visual cortex preceded those in the auditory cortex.
The authors conclude that these results demonstrate that lip movements directly affect the excitability of the auditory cortex. However, due to the indirect nature of the reported effects, I do not believe this conclusion is justified. I elaborate on this concern below:
In experiment 1, the main finding that performance is better later in the trial could arise from many factors including non-specific attentional effects.
The analysis reported in the bottom of page 5 (comparing vector lengths for hits vs misses) is critical to the argument but the results are inconclusive (significant interaction, but subsequent comparisons not quite significant. Likely because the experiment is underpowered?).
In Experiment 2: the task performed by the listeners might have biased them towards speech imagery leading to the pattern of effects observed. Indeed, the observed involvement of the left hemisphere may be consistent with the involvement of speech imagery. This would render the observed link between visual and auditory cortices as somewhat trivial and not new (such links have been reported in many previous studies as acknowledged by the authors).
Most importantly, the authors do not provide any direct evidence that the auditory effects observed in Experiment 2 are related to those observed in experiment 1.
Other comments:
For the analyses in Figure 2A, were the number of trials over which the analysis is conducted adjusted for "first tone" vs "second tone"? Since the hit rate is higher for the second tone, there may be a concern that including more trials in the analysis would result in better SNR and hence a more robust effect.
In Experiment 2 the analysis is focused on phase effects. Can you report whether there are any power differences in the delta band in the "early" vs "later" time windows?
Line 176, the authors write "these results established that entrainment of theta lip activity increased in time". It is not clear to me which aspect of the results supports this statement.
Line 405: "any lag between visual and auditory stimuli onsets was later compensated...". I could not find mention of this elsewhere (i.e. how lags were compensated, how large they were). This is critical for interpreting the results and therefore should be described in detail.
Line 430-437 why did you choose to quantify the envelope in this way rather than just taking the wide band envelope?
Figure S3 is important and should be in the main text.
Line 473 "auditory pure tones"
The description in lines 478-481 doesn't make sense. It is unclear how loudness reported in line 480 (91dB SPL; incidentally this is very loud) relates to the later reported value of 72dB SPL.
Line 485 "embedded"
Please clarify whether in your loudness adjustment procedure you were adjusting the loudness of the tone, the noise or the SNR (and thus keeping the overall loudness of the stimulus fixed)
Line 537 "preceding"
-
Summary: The reviewers agreed that the paradigm proposed in this work is elegant, and the question timely and important. However, as detailed below, they highlighted several concerns about analysis choices and the interpretation of the data. While some of these can be addressed, it was felt that a major drawback of the present manuscript is that the behaviour and EEG are obtained separately and any links are hence only circumstantial.
-