Measuring Reliability in Locally-deployed Language Model Dysarthric Speech Assessments

Ondrej Klempir
Juliana Grand Mullerova
Ales Tichopad
Radim Krupicka

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Speech is a rich and non-invasive source of clinical information, potentially providing digital biomarkers for neurological disorders such as Parkinson’s disease (PD). Impaired articulation and reduced intelligibility are among the most pervasive PD symptoms, which has motivated research into automated, objective quantification of speech deficits. This study investigated whether metrics derived from automatic speech recognition (ASR) and large language models (LLMs) can quantify speech intelligibility and describe clinical severity. Recordings of fixed read text from patients with PD and healthy controls (HC) were transcribed and evaluated using conventional ASR error measures (such as Word and Character Error Rate), a proposed Mistral-based LLM intelligibility score, and BERT-derived typo metrics (BERT, short for Bidirectional Encoder Representations from Transformers). Group-level discriminability between PD (N = 16) and HC (N = 21) was low (Mann–Whitney p > 0.05; Random Forest obtained Receiver Operating Characteristic Area Under the Curve of 0.66; leave-one-subject-out evaluation), indicating that transcript-level features alone offer limited classification abilities. Importantly, the LLM-derived intelligibility score demonstrated excellent repeatability across five runs (intra-class correlation; ICC = 0.97, Cronbach’s α = 0.98) and showed strong correlations with ASR error metrics. Both LLM and ASR measures significantly correlated (p < 0.05; Spearman’s rank correlation) with common rating clinical scales (Hoehn and Yahr scale and Unified Parkinson’s Disease Rating Scale), whereas BERT-typo parameters did not. These findings suggest the use of LLMs as tools for generating reference-free intelligibility scores that reflect disease severity.

Version published to 10.1101/2025.10.25.25338793 on medRxiv
Oct 27, 2025

Psychiatric Voice Biomarkers: Methodological flaws in pediatric populations

This article has 9 authors:
1. Hammza Jabbar Abd Sattar Hamoudi
2. Mon-Ju Wu
3. Marsal Sanches
4. Cesar A. Soutullo
5. Carolina Olmos
6. Leslie K. Taylor
7. Giovanna Zunta-Soares
8. Jair C. Soares
9. Benson Mwangi
This article has no evaluationsLatest version Oct 15, 2025
AI-Generated Clinical Summaries: Errors and Susceptibility to Speech and Speaker Variability

This article has 9 authors:
1. Thomas C. Draper
2. Jason Leake
3. Timothy Cox
4. Kathryn Lamb-Riddell
5. Benjamin E. Johns
6. John McCormick
7. Stephen Trowell
8. Janice Kiely
9. Richard Luxton
This article has no evaluationsLatest version Oct 30, 2025
Temporal Elements of Speech in Mania

This article has 9 authors:
1. Jeremiah Joyce
2. Ivan Ayala
3. Sanjeev Mishra
4. George Chatzisofroniou
5. Erik Clemens
6. Hang Yu
7. Zachi Attia
8. Baihan Lin
9. Mark Frye
This article has no evaluationsLatest version Nov 26, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Psychiatric Voice Biomarkers: Methodological flaws in pediatric populations

AI-Generated Clinical Summaries: Errors and Susceptibility to Speech and Speaker Variability

Temporal Elements of Speech in Mania