Evaluating the Performance of Large Language Models in Identifying Human Facial Emotions: GPT 4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background. Evaluating the social and emotional capabilities of large language models (LLMs), such as their ability to recognize human facial emotion is critical as their role in human-computer interactions (HCIs) expands, particularly in healthcare applications. Facial expressions convey affective and clinical information, useful for detecting emotions, contextualizing language, understanding interpersonal dynamics, and identifying potential mental health and neurocognitive disorders. However, LLMs' ability to accurately interpret facial expressions remains unclear.Methods. We evaluated the agreement and accuracy of three leading LLM models, GPT-4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet, using the NimStim dataset, a benchmark of 672 facial expressions (calm, angry, happy, fear, sad, neutral, surprise, and disgust) from 43 diverse human actors resulting in 2,016 model-based emotion estimates.Results. All LLM models demonstrated substantial to almost perfect agreement with ground truth labels. Happy expressions had the highest agreement, while fear had the lowest due to high misclassification as surprise. GPT-4o and Gemini had the highest accuracy, with 95% CI lower bound exceeding 0.80, while Claude performed more poorly. There were no significant differences in accuracy as a function of actor sex or race. GPT-4o and Gemini reached human performance levels for overall facial emotion recognition with GPT-4o surpassing human performance for calm/neutral and surprise recognition, while Gemini surpassed human performance for surprise recognition.Conclusion. As GenAI models increasingly mediate HCI and expand into healthcare, evaluating LLM’s socioemotional comprehension is crucial for healthcare applications. This study found that LLMs perform strongly relative to ground truth and comparably to human judges in recognizing prototypical facial expressions, with GPT-4o and Gemini models showing especially strong performance. This study lays the groundwork for evaluating the socioemotional capabilities of LLMs and highlights the need to address existing gaps for safe applications in future HCI and clinical settings.

Article activity feed