When machines judge humanness : findings from an interactive reverse Turing test by large language models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
While modern large language models (LLMs) increasingly pass short-form Turing tests and are sometimes rated as more human than humans, whether LLMs themselves can act as evaluators in this setting remains poorly investigated. We designed an interactive reverse Turing test in which seven LLMs (ChatGPT 4.5, Claude 3.7 Sonnet, Gemini Advanced 2.5 Pro, Mistral Large 2.1, Grok 3, DeepSeek V3, and Llama 4 Maverick) served as evaluators. Each LLM autonomously posed up to ten questions to hidden participants, who were either humans or other LLMs instructed with minimal or structured prompts. Thematic analysis was applied to both the questions and the reasons underlying final verdicts. Across 238 reverse Turing tests comprising 1,714 questions, AI evaluators identified AI participants as AI in only three tests. AI participants were judged more human than humans (mean probability of being human: 0.88 in AI participants vs 0.78 in humans; p<0.001). Thematic analysis of questioning strategies revealed emphasis on emotions/feelings (14%), memory (13%) and behaviours (11%), with distinct model-specific patterns: e.g., Claude 3.7 Sonnet emphasised on mind and reasoning, Gemini 2.5 Pro focused on abstraction and creativity and ChatGPT 4.5 focused on socio-emotional dimensions. Reasons cited for final verdict most often referred to authenticity of personality (26%) and veracity of emotions (25%). Among tests with human participants : questions were rated by participants moderately difficult to answer (mean 4.65 out of 10), question relevance was rated higher (mean 6.84), and mean duration was 18.7 minutes (12.4). In conclusion, current AI-based conversational screening appears insufficient for ensuring authenticity in dialogue. Future studies may explore longer, multimodal interactions, richer evaluator prompts co-designed with cognitive experts, and hybrid committees of human and AI evaluators.