Measuring Reliability in Locally-deployed Language Model Dysarthric Speech Assessments
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Speech is a rich and non-invasive source of clinical information, potentially providing digital biomarkers for neurological disorders such as Parkinson’s disease (PD). Impaired articulation and reduced intelligibility are among the most pervasive PD symptoms, which has motivated research into automated, objective quantification of speech deficits. This study investigated whether metrics derived from automatic speech recognition (ASR) and large language models (LLMs) can quantify speech intelligibility and describe clinical severity. Recordings of fixed read text from patients with PD and healthy controls (HC) were transcribed and evaluated using conventional ASR error measures (such as Word and Character Error Rate), a proposed Mistral-based LLM intelligibility score, and BERT-derived typo metrics (BERT, short for Bidirectional Encoder Representations from Transformers). Group-level discriminability between PD (N = 16) and HC (N = 21) was low (Mann–Whitney p > 0.05; Random Forest obtained Receiver Operating Characteristic Area Under the Curve of 0.66; leave-one-subject-out evaluation), indicating that transcript-level features alone offer limited classification abilities. Importantly, the LLM-derived intelligibility score demonstrated excellent repeatability across five runs (intra-class correlation; ICC = 0.97, Cronbach’s α = 0.98) and showed strong correlations with ASR error metrics. Both LLM and ASR measures significantly correlated (p < 0.05; Spearman’s rank correlation) with common rating clinical scales (Hoehn and Yahr scale and Unified Parkinson’s Disease Rating Scale), whereas BERT-typo parameters did not. These findings suggest the use of LLMs as tools for generating reference-free intelligibility scores that reflect disease severity.
