Graduated Dissent: Budgeted Disagreement Resolution for Multi-Model Inference

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Recent empirical work demonstrates that large language models cannot reliably self-correct reasoning without external feedback, and large-scale evaluation across hundreds of models reveals substantial error correlation even between models with distinct architectures and providers. When generator and evaluator share failure modes, self-evaluation may provide weak evidence of correctness, and repeated self-critique may yield diminishing returns. External evaluation can address this, but external evaluation is expensive. A natural question arises: given a fixed verification budget, how should a system allocate costly decorrelated evaluation across queries? We propose graduated dissent, an inference architecture that treats this as a resource allocation problem. Multiple proposers generate candidate analyses in separated contexts. A comparator estimates whether divergence between proposals is superficial, within an expected domain noise floor, or structurally meaningful. Only high-signal disagreements trigger expensive procedures: steelman exchange, adversarial cross-examination, or external verification via formal proof checkers, executable tests, or numerical invariants. The proposed approach is budgeted inference: escalation occurs when the expected information gain from decorrelated evaluation exceeds its cost, given domain-calibrated priors on disagreement signal content. The protocol combines three mechanisms: context separation between generation and evaluation to reduce inheritance of error-producing reasoning traces, graduated triage to concentrate verification compute where decorrelation has the highest expected value, and a steelman exchange that encourages genuine engagement with opposing reasoning structures. We define domain-calibrated threshold structures and propose pre-specified benchmark families targeting technical reasoning reliability. This paper is a protocol proposal with pre-registered evaluation design; empirical results against the specified benchmarks will be incorporated in a subsequent version. The contribution complements existing evaluation approaches by providing an inference framework that may improve what reaches human judgment.

Article activity feed