Adaptive Baseline Calibration for Voice Stress Assessment in Speech Disfluency Monitoring
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Voice stress assessment systems commonly employ fixed thresholds for classifying acoustic features (jitter, shimmer, F0 variability) into stress levels. We show that fixed thresholds produce highly skewed stress score distributions when applied to diverse speakers, with 61.4% of clips scored as high-stress (≥0.8) in the SEP-28K dataset—likely an artifact of inter-speaker vocal variability rather than genuine stress variation, given the informal podcast recording context. We propose an adaptive baseline algorithm using Welford's online algorithm for per-speaker calibration, followed by exponential moving average tracking. Applied to 14,645 clips with valid pitch estimates, the adaptive approach produces a more symmetric distribution (μ=0.530, σ=0.162) with substantially fewer extreme scores. We note that in the absence of ground-truth stress labels, we evaluate calibration quality by distribution shape rather than classification accuracy—a limitation shared by most voice stress analysis systems. We additionally report that YIN-based pitch detection achieves 98.1% F0 extraction rate on SEP-28K, compared to 12.1% with naive autocorrelation—a prerequisite for reliable voice stress features. We discuss implications for pediatric speech applications, where children's vocal characteristics (F0 range 250–400 Hz) differ substantially from adults and make fixed thresholds particularly problematic. The adaptive baseline algorithm is implemented in DisfluoSDK, an on-device framework for speech disfluency monitoring.