When machines judge humanness : findings from an interactive reverse Turing test by large language models

Marc Raynaud
Loïc Raynaud
Agathe Truchot
Alexandre Loupy

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

While modern large language models (LLMs) increasingly pass short-form Turing tests and are sometimes rated as more human than humans, whether LLMs themselves can act as evaluators in this setting remains poorly investigated. We designed an interactive reverse Turing test in which seven LLMs (ChatGPT 4.5, Claude 3.7 Sonnet, Gemini Advanced 2.5 Pro, Mistral Large 2.1, Grok 3, DeepSeek V3, and Llama 4 Maverick) served as evaluators. Each LLM autonomously posed up to ten questions to hidden participants, who were either humans or other LLMs instructed with minimal or structured prompts. Thematic analysis was applied to both the questions and the reasons underlying final verdicts. Across 238 reverse Turing tests comprising 1,714 questions, AI evaluators identified AI participants as AI in only three tests. AI participants were judged more human than humans (mean probability of being human: 0.88 in AI participants vs 0.78 in humans; p<0.001). Thematic analysis of questioning strategies revealed emphasis on emotions/feelings (14%), memory (13%) and behaviours (11%), with distinct model-specific patterns: e.g., Claude 3.7 Sonnet emphasised on mind and reasoning, Gemini 2.5 Pro focused on abstraction and creativity and ChatGPT 4.5 focused on socio-emotional dimensions. Reasons cited for final verdict most often referred to authenticity of personality (26%) and veracity of emotions (25%). Among tests with human participants : questions were rated by participants moderately difficult to answer (mean 4.65 out of 10), question relevance was rated higher (mean 6.84), and mean duration was 18.7 minutes (12.4). In conclusion, current AI-based conversational screening appears insufficient for ensuring authenticity in dialogue. Future studies may explore longer, multimodal interactions, richer evaluator prompts co-designed with cognitive experts, and hybrid committees of human and AI evaluators.

Version published to 10.31234/osf.io/pnx9e_v1 on OSF Preprints
Oct 6, 2025

Beyond BLEU: GPT-5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation

This article has 6 authors:
1. Shalawati Shalawati
2. Arbi Haza Nasution
3. Winda Monika
4. Tatum Derin
5. Aytug Onan
6. Yohei Murakami
This article has no evaluationsLatest version Nov 18, 2025
How humans process language (they think) is machine generated

This article has 3 authors:
1. Gabor Brody
2. Daniel Asherov
3. Athulya Aravind
This article has no evaluationsLatest version Oct 6, 2025
From Static Prediction to Mindful Machines: A Paradigm Shift in Distributed AI Systems

This article has 2 authors:
1. Rao Mikkilineni
2. W. Patrick Kelly
This article has no evaluationsLatest version Nov 20, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Beyond BLEU: GPT-5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation

How humans process language (they think) is machine generated

From Static Prediction to Mindful Machines: A Paradigm Shift in Distributed AI Systems