Does Speech Prosody Shape Social Perception Equally for AI and Human Voices? A 16-Dimension Rating Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

AI can now generate humanlike prosodic patterns, but whether these cues influence social perception in the same way human voices do remains unknown. Our study recruited 40 native Chinese speakers to evaluate the effects of human and AI-cloned voices producing statements in a confident vs. doubtful tone of voice (prosody). Participants rated 320 utterances on 16 dimensions using 7-point scales, ranging from acoustic properties to social impressions of the speaker. Results revealed that human voices received significantly higher ratings than AI voices on most dimensions, including humanlikeness, animateness, and emotional richness, with exceptions for speed and nasality, where AI voices scored higher. Principal component analysis (PCA) identified two core dimensions along which human voices consistently outperformed AI voices: “social appeal” and “vocal expressiveness”. Regression analyses showed that confident prosody enhanced ratings for both voice sources, with voice source × confidence interactions revealing that AI voices showed greater rating increases with confident than with doubtful prosody compared to human voices, particularly on social perception dimensions. However, PCA revealed a critical asymmetry: while vocal expressiveness significantly predicted social appeal for human voices, this expressiveness-to-appeal mapping was completely absent for AI voices, indicating that individual dimension improvements failed to translate into overall social preference gains. These findings suggest that listeners categorize AI as an out-group, thereby limiting the application of human voice perceptual mechanisms even when AI voices exhibit humanlike expressiveness. Implications for social robotics are discussed, including how prosodic design should differ across scenarios where virtual agents serve informational vs. interpersonal roles.

Article activity feed