Semantic content outperforms speech prosody in predicting affective experience in naturalistic settings

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Humans possess the remarkable ability to recognize affective states (e.g., emotions) from speech prosody (i.e., voice acoustics). Algorithms are now trained to do the same task at scale and increasingly deployed commercially. These algorithms have typically been trained on speech data based on enacted (e.g., by professional actors) or observed (e.g., externally rated) affect expressions collected in controlled lab settings. However, these algorithms are deployed to recognize subjective affect experiences from speech in real-world settings. This discrepancy between the controlled, often idealized speech data used for algorithm training and the naturalistic speech encountered during real-world deployment raises questions about whether algorithms can reliably detect subjective affective experiences in everyday settings. Here, we investigate whether experienced affect can be predicted from naturalistic speech samples collected via smartphones. In two field studies (experimental Study 1: N = 409; observational Study 2: N = 687), we collected 25,403 speech samples from participants along with their self-reported affective experiences. Machine learning analyses suggest that prosody reveals only limited affective information (r_md = .17) and is outperformed by semantic content (r_md = .33) captured by word embeddings from a large language model. Our findings challenge the generalizability of prosody-based emotion recognition technologies to naturalistic settings and underscore the importance of incorporating semantic content in the algorithmic recognition of subjective affective experiences from speech.

Article activity feed