Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

AI-generated voice clones are important tools in language learning, audiobooks, and assistive technology, but often struggle to replicate key prosodic features such as dynamic F0 variation. The impact of these differences on speech perception remain underexplored. To address this, we conducted two behavioural tasks, evaluating listeners’ ratings of naturalness and similarity for human speech, three AI voice clones (ElevenLabs, StyleTTS-2, XTTS-v2), and a 30% F0 variation condition. ElevenLabs was rated comparably to human speech, while StyleTTS-2 and XTTS-v2 received lower ratings. Reduced F0 variation also led to lower ratings, suggesting that prosody is key to perceived naturalness and similarity. Listener ratings were further influenced by speaker accent and sex, but not by AI tool experience. These findings suggest that prosodic features and speaker-specific characteristics could be drivers for the varying performance of AI-voice clones. This manuscript has been accepted at Interspeech 2025.

Article activity feed