Graduated Dissent: Budgeted Disagreement Resolution for Multi-Model Inference

Andrew Michael Brilliant

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent empirical work demonstrates that large language models cannot reliably self-correct reasoning without external feedback, and large-scale evaluation across hundreds of models reveals substantial error correlation even between models with distinct architectures and providers. When generator and evaluator share failure modes, self-evaluation may provide weak evidence of correctness, and repeated self-critique may yield diminishing returns. External evaluation can address this, but external evaluation is expensive. A natural question arises: given a fixed verification budget, how should a system allocate costly decorrelated evaluation across queries? We propose graduated dissent, an inference architecture that treats this as a resource allocation problem. Multiple proposers generate candidate analyses in separated contexts. A comparator estimates whether divergence between proposals is superficial, within an expected domain noise floor, or structurally meaningful. Only high-signal disagreements trigger expensive procedures: steelman exchange, adversarial cross-examination, or external verification via formal proof checkers, executable tests, or numerical invariants. The proposed approach is budgeted inference: escalation occurs when the expected information gain from decorrelated evaluation exceeds its cost, given domain-calibrated priors on disagreement signal content. The protocol combines three mechanisms: context separation between generation and evaluation to reduce inheritance of error-producing reasoning traces, graduated triage to concentrate verification compute where decorrelation has the highest expected value, and a steelman exchange that encourages genuine engagement with opposing reasoning structures. We define domain-calibrated threshold structures and propose pre-specified benchmark families targeting technical reasoning reliability. This paper is a protocol proposal with pre-registered evaluation design; empirical results against the specified benchmarks will be incorporated in a subsequent version. The contribution complements existing evaluation approaches by providing an inference framework that may improve what reaches human judgment.

Version published to 10.20944/preprints202603.1830.v1
Mar 24, 2026

Silent collapse in large neural networks: standard evaluation conceals systematic reasoning failure

This article has 1 author:
1. Yin Li
This article has no evaluationsLatest version Mar 23, 2026
A Formal Framework for Evaluating Reasoning Integrity in Language Models

This article has 2 authors:
1. Amaya Kavya
2. Shardul Shinde
This article has no evaluationsLatest version Mar 26, 2026
Rote Memorization or Intelligence: An Assessment of Inferential Reasoning in Large Language Models

This article has 3 authors:
1. Rashid Mehmood
2. Eid Rehman
3. Muhammad Habib
This article has no evaluationsLatest version Apr 1, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Silent collapse in large neural networks: standard evaluation conceals systematic reasoning failure

A Formal Framework for Evaluating Reasoning Integrity in Language Models

Rote Memorization or Intelligence: An Assessment of Inferential Reasoning in Large Language Models