Calibration Inversion and Data-Freshness Govern AI Reliability in Indeterminate Domains: Evidence from 2,730 Data Points/ 142 Cross-Protocol Sessions
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
AI systems in fully indeterminate domains — where ground truth is probabilistic and post-hoc — present distinct reliability challenges. This third study in a three-study programme analyses 2,730 data points across 142 sessions, six frontier models, and three protocols spanning financial markets, meteorology, sports, and cryptocurrency. The central finding is calibration inversion: Gemini achieves the best point-estimate accuracy (5.3% mean error, rank 1) yet the second-worst confidence-interval calibration (29.7%, rank 5; −60 pp below the 90% target) — a dissociation absent from all prior domains. A second finding is data-freshness failure: DeepSeek S&P 500 predictions are systematically stale (12–13% error) and Perplexity predictions freeze across sessions — failures of information-access architecture. Cross-protocol Spearman rank correlation (ρ = 0.695) confirms moderate ranking consistency. Claude alone achieves top-tier performance across all protocols (composite 93.6%; V1: 96.7%, V3: 100%, V4-CI: 84.2%). Four indeterminate-domain error types are identified; Type IV (calibration inversion) demands a calibration-verification layer in GAAS.