Measuring What AI Models Know and Whether They Know What They Don't Know: A Three-Run, Blind, Cross-Domain Benchmark of Six Leading Large Language Models

Kuldeep Kumar Pandit

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Accurate self-knowledge — the capacity to know what one does not know — may be as important as raw capability in deployed artificial intelligence systems. We report a pre-registered, three-run, blinded evaluation of six frontier large language models (Claude, ChatGPT, Gemini, Grok, DeepSeek, and Perplexity) on ten cross-domain questions spanning thermodynamics, limnology, biostatistics, cardiovascular physiology, evolutionary biology, atmospheric chemistry, Bayesian statistics, evolutionary anthropology, fluid dynamics, and game theory. Each question required not only a correct conclusion but a complete multi-criterion reasoning chain; partial credit was awarded for incomplete but directionally correct reasoning. The same protocol was administered across three independent runs (n = 180 total iterations). Two outcome measures were collected: (1) independently-scored answer quality (0.0 / 0.5 / 1.0 per question per run) and (2) self-audit accuracy — the model's own verdict on its performance against disclosed ground truth. Models differed substantially in mean answer quality (range 7.17–9.17/10; Kruskal-Wallis H = 8.49, p = 0.13; Cohen's d Claude vs Perplexity = 6.93). Critically, self-audit reliability varied even more dramatically: Claude achieved near-perfect self-calibration (Spearman ρ = 0.90, mean discrepancy +0.33), while Grok exhibited extreme systematic deflation (mean discrepancy −7.5), DeepSeek showed run-to-run incoherence (discrepancies +0.5, −3.5, +2.5 across identical-quality answers), and Perplexity lapsed into hallucination for one domain across two runs. Four of ten questions achieved universal 100% success; four questions produced partial success rates of 72–83%, revealing structurally embedded knowledge gaps resistant to correction by ground-truth exposure alone. Token verbosity showed no positive correlation with accuracy (Pearson r = 0.28, p = 0.15), while Gemini achieved the highest token efficiency (6.16 accuracy points per 1,000 tokens). The results suggest that current LLMs possess stratified epistemic competencies: a reliable core of common-curriculum reasoning that is fully consistent, surrounded by domains where mechanism-level specificity remains systematically incomplete. Self-audit performance is an independent dimension of capability that diverges sharply from answer quality and constitutes a significant dimension for model evaluation.

Version published to 10.21203/rs.3.rs-8937831/v1 on Research Square
Mar 16, 2026

Bidirectional Dissociation Between Self-Report and Behavior in AI Status Sensitivity

This article has 1 author:
1. Dustin James
This article has no evaluationsLatest version Mar 26, 2026
The Reliability Chasm: AI Accuracy Across Three Reasoning Domains

This article has 3 authors:
1. Kuldeep pandit
2. Vtasla Pandit
3. Aayan Pandit
This article has no evaluationsLatest version Mar 16, 2026
Do LLMs have core beliefs?

This article has 3 authors:
1. Nitesh Chawla
2. Anna Sokol
3. Marianna Ganapini
This article has no evaluationsLatest version Mar 17, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Bidirectional Dissociation Between Self-Report and Behavior in AI Status Sensitivity

The Reliability Chasm: AI Accuracy Across Three Reasoning Domains

Do LLMs have core beliefs?