Exploring Intonational Contours in Human and Synthetic Speech: An F0-Based Study of Venezuelan Spanish
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Comparisons between human and synthetic speech reveal that prosodic differences manifest primarily in how pitch develops over time rather than in static acoustic properties. This study investigates how human and synthetic intonation differ in Venezuelan Spanish by analyzing the temporal organization of fundamental frequency (F0) contours in declarative and interrogative sentences. The analysis is based on data from the HABLA corpus, which contains utterances produced by human speakers paired with multiple synthetic realizations derived from identical sentence materials, allowing controlled comparisons across voice types. Time-normalized F0 contours were extracted and analyzed using generalized additive mixed models in order to capture differences in contour shape and temporal organization. In addition, linear mixed-effects models were employed to examine complementary token-level acoustic measures, including pitch range, global variability, and slope-based metrics. Across both sentence modalities, the results show that synthetic speech reproduces general intonational configurations compatible with sentence modality. However, systematic differences emerge in the temporal implementation of these patterns. Human speech exhibits greater internal differentiation and localized modulation across the utterance, particularly in linguistically salient sentence-final regions. Synthetic speech, by contrast, displays smoother and more temporally regularized contours, with reduced local variability and attenuated slope changes. Global measures of pitch range and whole-utterance variability show considerable overlap between the two voice types, indicating that divergence is not driven by global pitch amplitude or aggregate dispersion. Instead, the results demonstrate that differences are localized in the temporal organization of F0 across the utterance. These findings highlight the importance of contour-based, time-sensitive approaches for assessing prosodic realism in synthetic speech.