Voice perception across languages: Reduced similarity between speakers, stable identity within speakers
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Purpose: Despite the central role of voice perception in social interaction, the mechanisms underlying cross-linguistic voice similarity judgments remain underexplored. This study examined how voice similarity is perceived within and across languages, using bilingual speech as a means to isolate language-driven effects while controlling for anatomical variation.Method: English-speaking listeners with and without Cantonese proficiency completed a perceptual similarity rating task involving Cantonese–English bilingual speakers. Listeners rated the similarity of voice pairs on a nine-point scale, including both within-language and cross-language comparisons. Multidimensional scaling (MDS) was applied to the dissimilarity ratings to model perceptual voice space, and acoustic–perceptual correlations were used to relate perceptual dimensions to acoustic properties of the speech signal.Results: Cross-language voice pairs were perceived as significantly more dissimilar than within-language pairs, particularly for same-speaker items, indicating a language mismatch effect. No robust language familiarity effect was observed between listener groups. Importantly, same-speaker pairs were consistently judged as more similar than different-speaker pairs even across languages—a finding further supported by MDS, which revealed a stable underlying perceptual structure in which speakers generally clustered with themselves across languages. Perceptual dimensions were primarily associated with pitch, vocal tract size, and harmonic-to-noise spectral characteristics, with their relative weighting modulated by language context.Conclusions: The findings indicate that while language mismatch reduces perceived similarity between speakers, the vocal identity of a speaker remains perceptually stable across languages. These results refine the prototype model of voice perception by demonstrating dynamic, context-sensitive weighting of acoustic cues in bilingual voice processing.