Measuring What AI Models Know and Whether They Know What They Don't Know: A Three-Run, Blind, Cross-Domain Benchmark of Six Leading Large Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Accurate self-knowledge — the capacity to know what one does not know — may be as important as raw capability in deployed artificial intelligence systems. We report a pre-registered, three-run, blinded evaluation of six frontier large language models (Claude, ChatGPT, Gemini, Grok, DeepSeek, and Perplexity) on ten cross-domain questions spanning thermodynamics, limnology, biostatistics, cardiovascular physiology, evolutionary biology, atmospheric chemistry, Bayesian statistics, evolutionary anthropology, fluid dynamics, and game theory. Each question required not only a correct conclusion but a complete multi-criterion reasoning chain; partial credit was awarded for incomplete but directionally correct reasoning. The same protocol was administered across three independent runs (n = 180 total iterations). Two outcome measures were collected: (1) independently-scored answer quality (0.0 / 0.5 / 1.0 per question per run) and (2) self-audit accuracy — the model's own verdict on its performance against disclosed ground truth. Models differed substantially in mean answer quality (range 7.17–9.17/10; Kruskal-Wallis H = 8.49, p = 0.13; Cohen's d Claude vs Perplexity = 6.93). Critically, self-audit reliability varied even more dramatically: Claude achieved near-perfect self-calibration (Spearman ρ = 0.90, mean discrepancy +0.33), while Grok exhibited extreme systematic deflation (mean discrepancy −7.5), DeepSeek showed run-to-run incoherence (discrepancies +0.5, −3.5, +2.5 across identical-quality answers), and Perplexity lapsed into hallucination for one domain across two runs. Four of ten questions achieved universal 100% success; four questions produced partial success rates of 72–83%, revealing structurally embedded knowledge gaps resistant to correction by ground-truth exposure alone. Token verbosity showed no positive correlation with accuracy (Pearson r = 0.28, p = 0.15), while Gemini achieved the highest token efficiency (6.16 accuracy points per 1,000 tokens). The results suggest that current LLMs possess stratified epistemic competencies: a reliable core of common-curriculum reasoning that is fully consistent, surrounded by domains where mechanism-level specificity remains systematically incomplete. Self-audit performance is an independent dimension of capability that diverges sharply from answer quality and constitutes a significant dimension for model evaluation.

Article activity feed