Language Models and the Missing Qualia in Face Evaluation

Ziqian Cui
Weisa Wu
Shuai Chang
Yanhong Chen
Fei Zhao
Junting Hu
Ke Zhou
Ming Meng

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Human cognition is grounded in direct sensory experience, whereas recent advances in large language models (LLMs) rely primarily on indirect, language-derived symbolic computation, raising intense debate about the limitation boundary of artificial intelligence. Here, we examine what social cognition based on face evaluation can be achieved from language-derived knowledge with direct visual experience in humans or without it in LLMs. We constructed 2,304 text-described faces by factorially combining ten structural features. Human participants (N=25) were asked to imagine each face and rate 12 traits on 1–9 scales, and five LLMs (Claude 3.5, GPT-4, GPT-3.5, Kimi, and ERNIE) performed the same task with identical instructions by API calls. Representational similarity analyses showed high human–LLM correspondence and strong cross-model homogeneity. Linear mixed-effects models further aligned the human consensus with LLMs’ ratings and captured agreement on facial structural traits. However, despite consensus-level alignment, humans showeddominant participant-by-stimulus variance for social traits, whereas LLMs showed little corresponding idiosyncratic variance. XGBoost attributions indicated shared reliance on expression and skin texture, while humans additionally showed domain-sensitive cue reweighting, a pattern absent in LLMs. Interestingly, higher visual imagery ability in humans predicted greater idiosyncratic variance and weaker residual alignment with LLMs. Together, these results suggested that language statistics can approximate the shared scaffold of face evaluation, whereas experience-linked imagery contributes structured individual-specific variability that current LLMs largely fail to capture, highlighting the key roles of direct sensory experience and visual imagery in human social cognition.

Version published to 10.31234/osf.io/4hu5x_v1 on OSF Preprints
Apr 7, 2026

Multimodal large language models converge on the human-like geometry of abstract emotion

This article has 7 authors:
1. Huiguang He
2. Changde Du
3. Yizhuo Lu
4. Zhongyu Huang
5. Yi Sun
6. Zisen Zhou
7. Shaozheng Qin
This article has no evaluationsLatest version Apr 2, 2026
"Perceptual salience, not affective meaning, drives facial expression detection"

This article has 4 authors:
1. Erika Bucci
2. Giacomo Handjaras
3. Giada Lettieri
4. Luca Cecchetti
This article has no evaluationsLatest version Apr 3, 2026
How Can Hallucinatory Biases Be Effectively Audited and Mitigated in Vision-Language Models? A Unified Theoretical and Empirical Framework Across GPT-4o, Grok 3, and Claude Sonnet 4.5

This article has 1 author:
1. Amirali Ghajari
This article has no evaluationsLatest version Apr 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multimodal large language models converge on the human-like geometry of abstract emotion

"Perceptual salience, not affective meaning, drives facial expression detection"

How Can Hallucinatory Biases Be Effectively Audited and Mitigated in Vision-Language Models? A Unified Theoretical and Empirical Framework Across GPT-4o, Grok 3, and Claude Sonnet 4.5