The Reliability Chasm: AI Accuracy Across Three Reasoning Domains

Kuldeep pandit
Vtasla Pandit
Aayan Pandit

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Frontier AI systems achieve 93% accuracy on formal reasoning benchmarks, yet the fastest-growing uses of AI concern indeterminate futures: market movements, sports out- comes, medical symptoms. Here we show—across three epistemically distinct problem domains and 7,950 individually scored data points using six frontier models—that AI reliability degrades sharply and predictably as question type shifts from formal to inde- terminate. The governing variable is ρˆ , error correlation between ensemble components. Compute scaling yields ρˆ = 0.80: retrying does not help because errors are structural and shared. GAAS role-separation (Generator–Auditor–Adversary–Synthesizer) reduces this to ρˆ = 0.19—a four-fold improvement on identical compute. Formal-domain accu- racy reaches 93.0% single-agent and 98.7% with GAAS architecture. Semi-determinate expert synthesis falls to 79.2%, with causal-hierarchy errors in 13 of 18 evaluations. In- determinate futures reach only 66.0%, with a calibration inversion: the highest-accuracy model produces 90% confidence intervals capturing actual outcomes just 29.7% of the time. These results demonstrate that intelligence and reliability are empirically orthogonal precisely where AI is deployed most consequentially.

Version published to 10.21203/rs.3.rs-9070952/v1 on Research Square
Mar 16, 2026

Beyond the Scaling Ceiling: An Architectural Phase Transition in AI Reliability

This article has 1 author:
1. Kuldeep Pandit
This article has no evaluationsLatest version Mar 16, 2026
CausalReasonBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models

This article has 1 author:
1. Ahmed Cherif
This article has no evaluationsLatest version Apr 13, 2026
Calibration Inversion and Data-Freshness Govern AI Reliability in Indeterminate Domains: Evidence from 2,730 Data Points/ 142 Cross-Protocol Sessions

This article has 1 author:
1. Kuldeep Pandit
This article has no evaluationsLatest version Mar 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Beyond the Scaling Ceiling: An Architectural Phase Transition in AI Reliability

CausalReasonBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models

Calibration Inversion and Data-Freshness Govern AI Reliability in Indeterminate Domains: Evidence from 2,730 Data Points/ 142 Cross-Protocol Sessions