The Reliability Chasm: AI Accuracy Across Three Reasoning Domains

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Frontier AI systems achieve 93% accuracy on formal reasoning benchmarks, yet the fastest-growing uses of AI concern indeterminate futures: market movements, sports out- comes, medical symptoms. Here we show—across three epistemically distinct problem domains and 7,950 individually scored data points using six frontier models—that AI reliability degrades sharply and predictably as question type shifts from formal to inde- terminate. The governing variable is ρˆ , error correlation between ensemble components. Compute scaling yields ρˆ = 0.80: retrying does not help because errors are structural and shared. GAAS role-separation (Generator–Auditor–Adversary–Synthesizer) reduces this to ρˆ = 0.19—a four-fold improvement on identical compute. Formal-domain accu- racy reaches 93.0% single-agent and 98.7% with GAAS architecture. Semi-determinate expert synthesis falls to 79.2%, with causal-hierarchy errors in 13 of 18 evaluations. In- determinate futures reach only 66.0%, with a calibration inversion: the highest-accuracy model produces 90% confidence intervals capturing actual outcomes just 29.7% of the time. These results demonstrate that intelligence and reliability are empirically orthogonal precisely where AI is deployed most consequentially.

Article activity feed