Evaluating the Performance of Large Language Models in Identifying Human Facial Emotions: GPT 4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet

Benjamin W Nelson
Ari Winbush
Steven Siddals
John Torous
Nicholas B. Allen
Matthew Flathers

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background. Evaluating the social and emotional capabilities of large language models (LLMs), such as their ability to recognize human facial emotion is critical as their role in human-computer interactions (HCIs) expands, particularly in healthcare applications. Facial expressions convey affective and clinical information, useful for detecting emotions, contextualizing language, understanding interpersonal dynamics, and identifying potential mental health and neurocognitive disorders. However, LLMs' ability to accurately interpret facial expressions remains unclear.Methods. We evaluated the agreement and accuracy of three leading LLM models, GPT-4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet, using the NimStim dataset, a benchmark of 672 facial expressions (calm, angry, happy, fear, sad, neutral, surprise, and disgust) from 43 diverse human actors resulting in 2,016 model-based emotion estimates.Results. All LLM models demonstrated substantial to almost perfect agreement with ground truth labels. Happy expressions had the highest agreement, while fear had the lowest due to high misclassification as surprise. GPT-4o and Gemini had the highest accuracy, with 95% CI lower bound exceeding 0.80, while Claude performed more poorly. There were no significant differences in accuracy as a function of actor sex or race. GPT-4o and Gemini reached human performance levels for overall facial emotion recognition with GPT-4o surpassing human performance for calm/neutral and surprise recognition, while Gemini surpassed human performance for surprise recognition.Conclusion. As GenAI models increasingly mediate HCI and expand into healthcare, evaluating LLM’s socioemotional comprehension is crucial for healthcare applications. This study found that LLMs perform strongly relative to ground truth and comparably to human judges in recognizing prototypical facial expressions, with GPT-4o and Gemini models showing especially strong performance. This study lays the groundwork for evaluating the socioemotional capabilities of LLMs and highlights the need to address existing gaps for safe applications in future HCI and clinical settings.

Version published to 10.31234/osf.io/pxq5h_v1 on OSF Preprints
Apr 1, 2025

Shades of Smiles: Creating Variants of Smiles from Neutral Images of Real Individuals - Method and Validation

This article has 4 authors:
1. Jin Gao
2. Werner Sommer
3. Rasha Abdel Rahman
4. Wei-Jun Li
This article has no evaluationsLatest version Jul 2, 2025
Speaking in Feelings: Facilitating Human Emotion Communication through Analogy by Large Language Models

This article has 4 authors:
1. Jaewon Kim
2. Yerim Kwak
3. Hoyeon Kim
4. Bumseok Jeong
This article has no evaluationsLatest version Jun 17, 2025
Cross-Modal Temporal Attention for Robust Multimodal Emotion Recognition

This article has 3 authors:
1. Briar Calloway
2. Wyne Nasir
3. Caelum Finch
This article has no evaluationsLatest version Jun 27, 2025

Listed in

Abstract

Article activity feed

Related articles

Shades of Smiles: Creating Variants of Smiles from Neutral Images of Real Individuals - Method and Validation

Speaking in Feelings: Facilitating Human Emotion Communication through Analogy by Large Language Models

Cross-Modal Temporal Attention for Robust Multimodal Emotion Recognition