Calibration Inversion and Data-Freshness Govern AI Reliability in Indeterminate Domains: Evidence from 2,730 Data Points/ 142 Cross-Protocol Sessions

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

AI systems in fully indeterminate domains — where ground truth is probabilistic and post-hoc — present distinct reliability challenges. This third study in a three-study programme analyses 2,730 data points across 142 sessions, six frontier models, and three protocols spanning financial markets, meteorology, sports, and cryptocurrency. The central finding is calibration inversion: Gemini achieves the best point-estimate accuracy (5.3% mean error, rank 1) yet the second-worst confidence-interval calibration (29.7%, rank 5; −60 pp below the 90% target) — a dissociation absent from all prior domains. A second finding is data-freshness failure: DeepSeek S&P 500 predictions are systematically stale (12–13% error) and Perplexity predictions freeze across sessions — failures of information-access architecture. Cross-protocol Spearman rank correlation (ρ = 0.695) confirms moderate ranking consistency. Claude alone achieves top-tier performance across all protocols (composite 93.6%; V1: 96.7%, V3: 100%, V4-CI: 84.2%). Four indeterminate-domain error types are identified; Type IV (calibration inversion) demands a calibration-verification layer in GAAS.

Article activity feed