Psychiatric Voice Biomarkers: Methodological flaws in pediatric populations
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Introduction
Psychiatric assessments rely on patient self-reports, clinician observations, and standardized scales, while objective technological tools are currently not reliable enough to be utilized in a clinical setting. Voice may be utilized as a biomarker in different scenarios, including differential diagnosis, assessing symptom severity and predicting suicidality. However, its use depends on accurate automatic speech recognition (ASR). Current gold standard open source ASR systems are trained mainly on adult speech and perform poorly in children, limiting application in pediatric psychiatry.
Methods
We benchmarked two open-source ASR models—NVIDIA Parakeet and Whisper-small—on the Ohio Child Speech Corpus (303 children, ages 4–9), using the reference human transcripts provided with the dataset. Audio was standardized to each model’s expected sampling rate. No model fine-tuning or adaptation was performed. For each utterance, we computed word error rate (WER) and character error rate (CER), and assessed semantic fidelity using Sentence Mover’s Distance (SMD) and BERTScore F1. Metrics were summarized overall, stratified by single-year age bins (4, 5, 6, 7, 8, 9), and also grouped into two broader categories: younger children (ages 4–6) and older children (ages 7–9). We compared WER, CER, SMD, and BERTScore F1 across both age groups and evaluated age effects as trends using nonparametric statistical tests.
Results
Both models showed significant age effects where younger children had markedly higher word error rates (WER >40%) and character error rates (CER >30%) compared to older children (WER ∼30%, CER ∼20%). Sentence mover distance improved with age, while BERTScore F1 remained stable. Despite age-related improvements, overall transcription accuracy was low.
Discussion
Current commonly used open-source ASR systems are inadequate for pediatric audio transcription, specifically in younger children. In order to build clinically translatable tools, collecting child-specific data and model fine-tuning through structured speech paradigms is essential.