Beyond word error rate: Multidimensional evaluation of ASR performance for digital speech biomarkers
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Automatic speech recognition (ASR) is increasingly used in digital health, yet its reliability for populations with atypical speech, such as people with dementia, is not well characterised beyond word error rate (WER). We evaluated eight ASR systems on speech from adults with dementia and healthy older adults using WER, part-of-speech sequence agreement (POS-sqWER), part-of-speech distribution mismatch (POS-MDE), sentence-embedding cosine distance, and stutter detection error (SDE). Mixed-effects models with Tukey-corrected contrasts were used for comparison. Performance was consistently poorer for dementia speech across all metrics. AWS and CrisperWhisper showed relatively strong lexical, semantic, and syntactic fidelity, whereas Google and Meta exhibited lower accuracy. Other systems showed intermediate performance with rankings varying by metric. POS-MDE revealed syntactic distortions not captured by WER or POS-sqWER. SDE performance was low across systems. Multidimensional evaluation reveals clinically relevant linguistic distortions obscured by WER alone, supporting the need for population- and task-specific ASR validation in digital health research.