Cortical Representations of Speech in a Multitalker Auditory Scene

Abstract

The ability to parse a complex auditory scene into perceptual objects is facilitated by a hierarchical auditory system. Successive stages in the hierarchy transform an auditory scene of multiple overlapping sources, from peripheral tonotopically based representations in the auditory nerve, into perceptually distinct auditory-object-based representations in the auditory cortex. Here, using magnetoencephalography recordings from men and women, we investigate how a complex acoustic scene consisting of multiple speech sources is represented in distinct hierarchical stages of the auditory cortex. Using systems-theoretic methods of stimulus reconstruction, we show that the primary-like areas in the auditory cortex contain dominantly spectrotemporal-based representations of the entire auditory scene. Here, both attended and ignored speech streams are represented with almost equal fidelity, and a global representation of the full auditory scene with all its streams is a better candidate neural representation than that of individual streams being represented separately. We also show that higher-order auditory cortical areas, by contrast, represent the attended stream separately and with significantly higher fidelity than unattended streams. Furthermore, the unattended background streams are more faithfully represented as a single unsegregated background object rather than as separated objects. Together, these findings demonstrate the progression of the representations and processing of a complex acoustic scene up through the hierarchy of the human auditory cortex.

SIGNIFICANCE STATEMENT Using magnetoencephalography recordings from human listeners in a simulated cocktail party environment, we investigate how a complex acoustic scene consisting of multiple speech sources is represented in separate hierarchical stages of the auditory cortex. We show that the primary-like areas in the auditory cortex use a dominantly spectrotemporal-based representation of the entire auditory scene, with both attended and unattended speech streams represented with almost equal fidelity. We also show that higher-order auditory cortical areas, by contrast, represent an attended speech stream separately from, and with significantly higher fidelity than, unattended speech streams. Furthermore, the unattended background streams are represented as a single undivided background object rather than as distinct background objects.

This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/7620977.

Systems JC, OHSU - PREreview of "Cortical Representations of Speech in a Multi-talker Auditory Scene"

This is a preprint journal club review of Cortical Representations of Speech in a Multi-talker Auditory Scene by Krishna C Puvvada, Jonathan Z Simon. The preprint was originally posted on bioRxiv on April 10, 2017 (DOI: https://doi.org/10.1101/124750). The authors have responded to this review, and you can find the comments on bioRxiv. The article is now published in The Journal of Neuroscience (DOI: 10.1523/JNEUROSCI.0938-17.2017).

Review

Dear authors,

Thank you for posting your work as a preprint on BioRxiv. We discussed your work at our latest (preprint) systems neuroscience journal club at the Oregon Health & Science University. Below is a summary of our feedback containing our main remarks, points of discussion, and suggestions.

This work explores the encoding and decoding properties of streaming auditory objects in primary and secondary regions of the human auditory cortex. Listeners were tasked to selectively attend to one of three overlapping speech streams while MEG activity was recorded. The novel aspect of the stimulus paradigm is the presence of three concomitant streams instead of two. Two main questions were explored: First, is the attended stream selectively represented in primary and/or secondary regions of the auditory cortex? Second, if that is the case, are the other two streams represented as a combined background, or are they represented separately as two distinct background streams?

To answer these questions two approaches were used: encoding and decoding strategies. The first suggestion we have about the work is purely organizational. We believe it would help the flow of the reading if the encoding and decoding analyses were described in the same order in the methods, the results (including figures), and the discussion sessions.

Regarding the encoding models, we think that it would be helpful to describe the degree of freedom of each model. We found it hard to evaluate if the newly proposed early-late model better describes the data when compared to the summation model because it truly captures key neuronal properties of auditory streaming, or if this is a result of the higher number of parameters. While cross-validation was employed during model fitting and addresses some of these concerns, a validation using an out-of-sample set for prediction would provide a definitive assessment of the potential existence and extent of overfitting.

While we appreciate that showing examples of the raw data together with the encoding and decoding model prediction contributes invaluably to clarity and transparency of the manuscript, we believe that figures 1 and 2 do not provide enough information for that to happen. Particularly, figure 2 would benefit from more labeling and perhaps even a more clear way of displaying the message.

In addition, it would be nice to see what the different models look like. What do the filter look like for the two encoding models? How do they compare? In line with the results of this study, one would believe that the late component of the late model should be larger, but without seeing the models it is hard to know for sure. Showing a representation of the model parameters would certainly add value to this work.

The following figures in the paper are definitely clearer and well accompany the text in the result session. However, we wondered, what does each point in the plots represent? Is it one data point per listener? Are they multiple data points per listener depending on the attended stimulus? It would really help if you could mention this at least in the caption. Perhaps adding a color coding would also help the reader better understand the results.

Finally, you mentioned that the 85-ms boundary was fit on a per-subject manner. Would it be possible to show a plot of those values to see how variable this boundary is in the sample data? We also wondered what it would happen to the model if you used the median of these values.

Thanks again for posting this work as a preprint. We really enjoyed discussing it at our JC and we hope these comments will help make the work even better.

Thank you,

Daniela Saderi (on behalf of the Systems JC, OHSU)

Read the original source

Cortical Representations of Speech in a Multitalker Auditory Scene

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Orthogonal spectral and temporal envelope representation in auditory cortex

Hierarchical emergence of opponent coding in auditory belt cortex

Attenuated processing of vowels in the left hemisphere predicts speech-in-noise perception deficit in children with autism

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Orthogonal spectral and temporal envelope representation in auditory cortex

Hierarchical emergence of opponent coding in auditory belt cortex

Attenuated processing of vowels in the left hemisphere predicts speech-in-noise perception deficit in children with autism