Whisper automatic speech recognition and humans are similarly sensitive to accents, but use sentence context differently

Junrong Chen
Jan Kwong
Sarah Paull
Sarah Creel

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Despite advancements in speech recognition technology, questions remain about model generalizability and how much models mirror human comprehension. These questions are addressed by comparing OpenAI's Whisper model and 75 human transcribers on 300 English sentences (20 speakers, half F, half M, half US-accented, half Spanish-accented). Sentences ended in 100 target words: 1⁄3 high-predictability sentences (The farmer milked the cows), 1⁄3 low- predictability (The farmer milked the nose), and 1⁄3 low-predictability+nonword (The marmer milked the nose). Target word transcription error rates were examined both in sentences and in isolation (recordings excised from same sentences). Errors decreased with increasing model size, but remained heightened for Spanish-accented vs. US-accented speech, roughly similar to human transcribers. Both models and humans benefited from higher sentence predictability. However, sentence context affected models and humans differently: humans benefited only from high- predictability sentences, while models benefited from any sentence context. Humans numerically outperformed the (best-performing) large model on isolated words and showed different error patterns, suggesting Whisper may have a restricted distribution of short utterances or may need lengthier acoustic context than humans. More varied training data and better context awareness, including speaker identity and pragmatic context, might improve ASR generalization.

Version published to 10.31234/osf.io/8dhua_v1 on OSF Preprints
Sep 4, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed