An oscillating computational model can track pseudo-rhythmic speech by using linguistic predictions

Sanne ten Oever
Andrea E Martin

Curated by eLife

Evaluation Summary:

The topic is highly interesting and provides new insights to the ongoing debate about the role of oscillations and predictability in speech recognition. The manuscript is of broad interest to readers in the field of speech recognition and neuronal oscillations. Particularly, the authors provide a computational model which additionally to feedforward acoustic input incorporates linguistic predictions as feedback, allowing a fixed oscillator to process non-isochronous speech. The model is tested extensively by applying it to a linguistic corpus, EEG and behavioral data. It explains variations in speech duration based on linguistic predictability, and recently reported phase-dependency of speech perception, supporting the authors claims. The reviewers agreed that this study provides new insights in the current debate about the role of neural oscillations and top-down predictability in speech recognition, and that it represents an important contribution to the field of language neurobiology. Although they thought that the results support the authors' conclusions, the reviewers each raised a number of questions about the modelling and stated that greater clarity is needed in describing this.

(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1, Reviewer #2 and Reviewer #3 agreed to share their names with the authors.)

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (eLife)

Abstract

Neuronal oscillations putatively track speech in order to optimize sensory processing. However, it is unclear how isochronous brain oscillations can track pseudo-rhythmic speech input. Here we propose that oscillations can track pseudo-rhythmic speech when considering that speech time is dependent on content-based predictions flowing from internal language models. We show that temporal dynamics of speech are dependent on the predictability of words in a sentence. A computational model including oscillations, feedback, and inhibition is able to track pseudo-rhythmic speech input. As the model processes, it generates temporal phase codes, which are a candidate mechanism for carrying information forward in time. The model is optimally sensitive to the natural temporal speech dynamics and can explain empirical data on temporal speech illusions. Our results suggest that speech tracking does not have to rely only on the acoustics but could also exploit ongoing interactions between oscillations and constraints flowing from internal language models.

Version published to 10.7554/elife.68066 on eLife
Aug 2, 2021
eLife
May 28, 2021

Evaluation Summary:

The topic is highly interesting and provides new insights to the ongoing debate about the role of oscillations and predictability in speech recognition. The manuscript is of broad interest to readers in the field of speech recognition and neuronal oscillations. Particularly, the authors provide a computational model which additionally to feedforward acoustic input incorporates linguistic predictions as feedback, allowing a fixed oscillator to process non-isochronous speech. The model is tested extensively by applying it to a linguistic corpus, EEG and behavioral data. It explains variations in speech duration based on linguistic predictability, and recently reported phase-dependency of speech perception, supporting the authors claims. The reviewers agreed that this study provides new insights in the current debate …

Evaluation Summary:

The topic is highly interesting and provides new insights to the ongoing debate about the role of oscillations and predictability in speech recognition. The manuscript is of broad interest to readers in the field of speech recognition and neuronal oscillations. Particularly, the authors provide a computational model which additionally to feedforward acoustic input incorporates linguistic predictions as feedback, allowing a fixed oscillator to process non-isochronous speech. The model is tested extensively by applying it to a linguistic corpus, EEG and behavioral data. It explains variations in speech duration based on linguistic predictability, and recently reported phase-dependency of speech perception, supporting the authors claims. The reviewers agreed that this study provides new insights in the current debate about the role of neural oscillations and top-down predictability in speech recognition, and that it represents an important contribution to the field of language neurobiology. Although they thought that the results support the authors' conclusions, the reviewers each raised a number of questions about the modelling and stated that greater clarity is needed in describing this.

(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1, Reviewer #2 and Reviewer #3 agreed to share their names with the authors.)

Read the original source
eLife
May 28, 2021

Reviewer #1 (Public Review):

In this study, the authors investigate the relationship between temporal predictability and linguistic predictability during speech listening.
They first show that language statistics and linguistic predictability influence the duration of linguistic units (syllables and words) in naturally spoken speech. Specifically, they provide evidence that words with higher word frequency have shorter durations, and that linguistic units are lengthened when the next linguistic unit is not strongly predictable.
Then, they test a new computational model (STiMCON), which suggests that brain-speech tracking reflects a mechanism that is not only following the acoustics of speech, but also generates a temporal phase code that is sensitive to linguistic constraints.

The manuscript addresses a very relevant and timely …

Reviewer #1 (Public Review):

In this study, the authors investigate the relationship between temporal predictability and linguistic predictability during speech listening.
They first show that language statistics and linguistic predictability influence the duration of linguistic units (syllables and words) in naturally spoken speech. Specifically, they provide evidence that words with higher word frequency have shorter durations, and that linguistic units are lengthened when the next linguistic unit is not strongly predictable.
Then, they test a new computational model (STiMCON), which suggests that brain-speech tracking reflects a mechanism that is not only following the acoustics of speech, but also generates a temporal phase code that is sensitive to linguistic constraints.

The manuscript addresses a very relevant and timely question: the origins of the pseudo-rhythmicity of speech, and the role of neural oscillations in the tracking of this pseudo-rhythmic input. The premise behind this manuscript is important for neuroscientists interested in the neurobiology of language. The first results show that pseudo-rhythmicity in speech is in part a consequence of top-down predictions flowing from an internal model of language, which I think is in itself a very exciting finding. The proposed STiMCON model is interesting, but some more information is needed in the manuscript to fully understand it.

Read the original source
eLife
May 28, 2021

Reviewer #2 (Public Review):

TenOver and Martin investigate the question of whether including linguistic predictability in an oscillatory model of speech segmentation can explain how neuronal oscillations can process quasi-periodic speech. First, they analyze data from a speech corpus to show that word-frequency, number of syllables and characters relates to syllable duration. They show that linguistic predictability affects the onset timing of words. More specifically, highly predictable words (based on a corpus pretrained RNN), had a shorter word onset time. They introduce a RNN model (STiNCON) that includes 3 layers, oscillations, connectivity between the layers, and inhibition. With the model they show how linguistic predictability can align neuronal excitability to process quasi-rhythmic speech in an oscillator model. They run …

Reviewer #2 (Public Review):

TenOver and Martin investigate the question of whether including linguistic predictability in an oscillatory model of speech segmentation can explain how neuronal oscillations can process quasi-periodic speech. First, they analyze data from a speech corpus to show that word-frequency, number of syllables and characters relates to syllable duration. They show that linguistic predictability affects the onset timing of words. More specifically, highly predictable words (based on a corpus pretrained RNN), had a shorter word onset time. They introduce a RNN model (STiNCON) that includes 3 layers, oscillations, connectivity between the layers, and inhibition. With the model they show how linguistic predictability can align neuronal excitability to process quasi-rhythmic speech in an oscillator model. They run STiNCON first on several speech materials (short constructed sentences, word-pairs, corpus material CGN) and use it to model a behavioral and EEG experiment that was previously published (TenOver & Sack, 2015) (here they also evaluate their model compared to other models with fewer components: suggesting that the oscillation and the feedback were crucial parameters).

The topic is highly interesting and provides new insights to the current debate about the role of oscillations and top-down predictability in speech recognition. The authors provide extensive material. The manuscript is well written and the model is sophisticated. However, given the complexity and density of the manuscript, more clarity in describing the modeling would be useful. This includes, adding methodological details and justification of choices (particularly for the model application to the EEG study).

The results support the authors conclusions. However, I have several questions/comments: (1) In the model the predictability is computed at the word-level, while the oscillator operates at the syllable level. The authors show different duration effects for syllables within words, likely related to predictability. Is there any consequence of this mismatch of scales? (2) Furthermore, could the authors clarify whether or not and how they think the model mechanism is different from top-down phase reset (e.g. l. 41). It seems that the excitability cycle at the intermediate word-level is shifted from being aligned to the 4 Hz oscillator though the linguistic feedback from layer l+1. Would that indicate a phase resetting at the word-level layer through the feedback? (3) The model shows how linguistic predictability can affect neuronal excitability in an oscillatory model, allowing to improve the processing of non-isochronous speech. I do not fully understand the claim that the linguistic predictability makes the processing (at the word-level) more isochronous, and why such isochronicity is crucial. (4) The authors showed that word frequency affects the duration of a word. Now the RNN model relates the predictability of a word (output) to the duration of the previous word W-1 (l. 187). Didn't one expect from Fig. 1B that the duration of the actually predicted word is affected? How are these two effects related?

Read the original source
eLife
May 28, 2021
Reviewer #3 (Public Review):

The authors have presented a unique perspective on the nature of oscillatory tracking of speech perception. A critical question in this field is how such a mechanism could handle dealing with natural deviations from rhythmicity in speech. While some researchers have sought out to show the flexible and dynamic nature of oscillations, the authors here propose that even a sinusoidal mechanism can handle pseudo-rhythmicity so long as it is driven by linguistic constraints. To that end, they show first that (at least in Dutch) the length of words is modulated by linguistic predictability of the word. Then, they build a simple model which combines an sinewave oscillation (fixed amplitude and phase) with these linguistic predictions to generate a phase code of both perception and predictability. The authors show …
Reviewer #3 (Public Review):

The authors have presented a unique perspective on the nature of oscillatory tracking of speech perception. A critical question in this field is how such a mechanism could handle dealing with natural deviations from rhythmicity in speech. While some researchers have sought out to show the flexible and dynamic nature of oscillations, the authors here propose that even a sinusoidal mechanism can handle pseudo-rhythmicity so long as it is driven by linguistic constraints. To that end, they show first that (at least in Dutch) the length of words is modulated by linguistic predictability of the word. Then, they build a simple model which combines an sinewave oscillation (fixed amplitude and phase) with these linguistic predictions to generate a phase code of both perception and predictability. The authors show that this model can explain variation in timing in naturalistic speech as well as explain surprising findings on the bias of perceptual content by oscillatory phase.

I find the paper to be an important contribution to the field and well thought out. I have only a few thoughts and comments.

The use of an RNN model to estimate the internal language model is particularly effective. While the authors acknowledge that a RNN is unlikely to capture all of the complexities of the human internal language model. I found the choice to use a simple architecture as a statistical extractor to be a nice use of a tool that can sometimes be overly convoluted.

An important question is how the authors relate these findings to the Giraud and Poeppel 2012 proposal which really focuses on the syllable. Would you alter the hypothesis to focus on the word level? Or remain at the syllable level and speed up and low down the oscillator depending on the predictability of each word? It would be interesting to hear the authors thoughts on how to manage the juxtaposition of syllable and word processing in this framework.

The authors describe the STiMCON model as having an oscillator with frequency set to the average stimulus rate of the sentence. But how an oscillator can achieve this on its own (without the hand of its overloads) is unclear particularly given a pseudo-rhythmic input. The authors freely accept this limitation. However, it is worth noting that the ability for an oscillator mechanism to do this under pseudorhythmic context is more complicated than it might seem, particularly once we include that the stimulus rate might change from the beginning to the end of a sentence and across an entire discourse.

The "I eat very nice cake" analysis clearly demonstrates in a simple and didactic way the fundamental behaviors of the model: that predictions of the internal model can be read out in a phase code and that deviations from rhythmicity can yield more rhythmic behavior in the brain. I applaud the authors for demonstrating the behaviors in this very simple case first before moving to the more complex naturalistic case

The analysis of the naturalistic dataset shows a nice correlation between the estimated time shifts predicted by the model and the true naturalistic deviations. However, I find it surprising that there is so little deviation across the parameters of the oscillator (Figure 6A). What should we take from the fact that an oscillator aligned in anti-phase from the with the stimulus (which would presumably show the phase code only stimulus offsets), still shows a near equal correlation with true timing deviations. Furthermore, while the R2 shows that the predictions of the model co-vary with the true values, I'm curious to know how accurately they are predicted overall (in terms of mean squared error for example). Does the model account for deviations from rhythmicity of the right magnitude?

Lastly, it is unclear to what extent the oscillator is necessary to find this relative time shift. A model comparison between the predictions of the STiMCON and the RNN predictions on their own (à la Figure 3) would help to show how much the addition of the oscillation improves our predictions. Perhaps this is what is meant by the "non-transformed R2" but this is unclear.

Figure 7 shows a striking result demonstrating how the model can be used to explain an interesting finding that phase of an oscillation can bias perception towards da or ga. The initial papers consider this result to be explained by delays in onset between visual and auditory stimuli whereas this result explains it in terms of the statistical likelihood each syllable. It is a nice reframing which helps me to better understand the previous result.
Read the original source
Version published to 10.1101/2020.12.07.414425v2 on bioRxiv
Mar 3, 2021
Version published to 10.1101/2020.12.07.414425v1 on bioRxiv
Dec 7, 2020

On the generative mechanisms underlying the cortical tracking of natural speech: a position paper

This article has 2 authors:
1. Edmund Lalor
2. Aaron Nidiffer
This article has no evaluationsLatest version May 20, 2025
Temporal regularity does not drive neural and behavioural tracking of musical phrases.

This article has 3 authors:
1. Zofia Anna Hołubowska
2. Xiangbin Teng
3. Pauline Larrouy-Maestri
This article has no evaluationsLatest version Jun 3, 2025
Temporal regularity does not drive neural and behavioural tracking of musical phrases.

This article has 3 authors:
1. Zofia Anna Hołubowska
2. Xiangbin Teng
3. Pauline Larrouy-Maestri
This article has no evaluationsLatest version Jun 3, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

On the generative mechanisms underlying the cortical tracking of natural speech: a position paper

Temporal regularity does not drive neural and behavioural tracking of musical phrases.

Temporal regularity does not drive neural and behavioural tracking of musical phrases.