What is the retest reliability of computationally extractable speech and language markers?
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Speech is a signal rich in information about cognitive and affective states, which can be of high clinical utility for detecting and monitoring mental health conditions. Numerous studies have employed natural language processing (NLP) and AI-based language models to derive potential psychological and neurocognitive insights from spontaneous speech as a marker. However, only few studies have investigated the test-retest reliability of commonly used features, a basic psychometric property crucial to clinical applications. In the present study, we use a crowdsourcing approach to test the reliability of a comprehensive set of speech- and language markers across three speech elicitation tasks (free speech, picture and cartoon descriptions) and four time points, using the intra-class correlation coefficient (ICC). We also explore the underlying factor structure of the feature space through an exploratory factor analysis (EFA). Results indicate that acoustic-prosodic features exhibit high test-retest reliability across all sessions. In contrast, semantic measures (e.g., semantic similarity, information density, and perplexity), speech quantity metrics, and syntactic complexity, exhibit low reliability, even when stimulus materials used for speech elicitation were kept identical. Although semantic features showed strong within-subject variability, EFA across the feature space revealed a latent factor specifically comprising BERT-based semantic features with a moderate-to-high ICC of 0.76. There was some limited evidence of free speech showing lower ICCs across tasks. Demographic, emotional and physical state factors contributed negligibly to ICC variance, indicating that these external factors had minimal impact on the consistency of the acoustic and semantic features. Overall, we find that acoustic-prosodic and text-based features have crucially different psychometric properties, with the latter showing low test-retest reliability individually, though semantic features form intercorrelated clusters that are more stable and capture significant aspects of the variance.