Prosodic cues strengthen human-AI voice boundaries: Listeners do not easily perceive human speakers and AI clones as the same person

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Previous studies concluded that listeners struggle to discriminate AI from human voices, but these studies used monotone-like speech and did not examine prosodic expressiveness, a key advantage of human over AI speakers. This study explores whether prosodic expressiveness facilitates human-AI voice discrimination. We recorded human prosodic speech with confident and doubtful expressions, trained AI models to replicate these prosodic patterns, had AI models generate new sentences, and then had human speakers produce equivalent prosodic expressions for the same sentences. In Experiment 1, we had 48 listeners rate humanlikeness and perceived confidence in 11,808 audio samples, finding that AI speech was consistently rated as less humanlike regardless of prosody. We selected 768 audios (AI × human, confident × doubtful prosody) for Experiment 2, where 80 listeners completed an identity discrimination task, telling whether two sounds were from the same speaker. Bayesian modeling results revealed near-ceiling performance for human-human/AI-AI pairs, with inconsistent prosodies decreasing accuracy by ~7%, while listeners do not easily categorize AI and human as sharing the same identity (~54% accuracy when prosody matches, dropping to ~36% when inconsistent). We found nonlinear effects of Wav2Vec2 acoustic distance on performance, with accuracy-reaction time synchronization supporting direct matching mechanisms over prototype-based processing. Human-AI/AI-human pairs showed unique patterns where larger acoustic distances prompted listeners to rely less on acoustic distance cues, unlike other conditions. Our study suggests that concerns about voice clone detection may have been raised prematurely. Additionally, it addresses current gaps in understanding how within-speaker identity variation influences identity processing.

Article activity feed