Modulation statistics allow robust prediction of speech recognition accuracy across many words, voices, and natural background sounds

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Although humans excel at speech recognition, recognition accuracy can vary widely due to differences in background environments as well as the speaker’s voice quality, intonation, and pitch. Predicting when speech recognition will succeed or fail, however, remains an ongoing challenge in hearing research. Here we characterize recognition abilities across a wide range of natural conditions using digits spoken by many male and female talkers of multiple ages with 33 unique backgrounds. Across this diverse set of sounds, speech recognition is most strongly influenced by the spectrum and modulation statistics of the noise. Yet, articulatory features of the speech, including fundamental and formant frequencies, show categorically distinct modulatory effects on accuracy across age, gender, and words. We then show that a low-dimensional model of sound, based on computations in the auditory midbrain, accounts for participants’ single-trial recognition behavior across voices, words and backgrounds. Thus, speech-in-noise perception across extremely diverse natural conditions depends largely on a simple set of spectrotemporal statistics likely encoded by central neural populations.

Article activity feed