Measuring What AI Models Know and Whether They Know What They Don't Know: A Three-Run, Blind, Cross-Domain Benchmark of Six Leading Large Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate self-knowledge — the capacity to know what one does not know — may be as important as raw capability in deployed artificial intelligence systems. We report a pre-registered, three-run, blinded evaluation of six frontier large language models (Claude, ChatGPT, Gemini, Grok, DeepSeek, and Perplexity) on ten cross-domain questions spanning thermodynamics, limnology, biostatistics, cardiovascular physiology, evolutionary biology, atmospheric chemistry, Bayesian statistics, evolutionary anthropology, fluid dynamics, and game theory. Each question required not only a correct conclusion but a complete multi-criterion reasoning chain; partial credit was awarded for incomplete but directionally correct reasoning. The same protocol was administered across three independent runs (n = 180 total iterations). Two outcome measures were collected: (1) independently-scored answer quality (0.0 / 0.5 / 1.0 per question per run) and (2) self-audit accuracy — the model's own verdict on its performance against disclosed ground truth. Models differed substantially in mean answer quality (range 7.17–9.17/10; Kruskal-Wallis H = 8.49, p = 0.13; Cohen's d Claude vs Perplexity = 6.93). Critically, self-audit reliability varied even more dramatically: Claude achieved near-perfect self-calibration (Spearman ρ = 0.90, mean discrepancy +0.33), while Grok exhibited extreme systematic deflation (mean discrepancy −7.5), DeepSeek showed run-to-run incoherence (discrepancies +0.5, −3.5, +2.5 across identical-quality answers), and Perplexity lapsed into hallucination for one domain across two runs. Four of ten questions achieved universal 100% success; four questions produced partial success rates of 72–83%, revealing structurally embedded knowledge gaps resistant to correction by ground-truth exposure alone. Token verbosity showed no positive correlation with accuracy (Pearson r = 0.28, p = 0.15), while Gemini achieved the highest token efficiency (6.16 accuracy points per 1,000 tokens). The results suggest that current LLMs possess stratified epistemic competencies: a reliable core of common-curriculum reasoning that is fully consistent, surrounded by domains where mechanism-level specificity remains systematically incomplete. Self-audit performance is an independent dimension of capability that diverges sharply from answer quality and constitutes a significant dimension for model evaluation.