Limits of Self-Correction in LLMs: An Information-Theoretic Analysis of Correlated Errors

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We develop a diagnostic framework for evaluating when LLM self-evaluation can be trusted. The framework's central results are: (1) under a shared-blind-spot modeling assumption with joint conditional independence of evaluations given the shared failure structure, k rounds of self-critique provide information about correctness bounded by what the shared latent failure variable Z mediates---not by any independent channel---so that confidence accumulated through repeated self-evaluation reflects the shared failure structure rather than independently accumulated evidence; and (2) a selector satisfying two independently measurable sufficient conditions---bounded false-acceptance and true-acceptance exceeding that bound---provides a quantifiable lower bound on evidence about correctness. Both results are conditional on explicit modeling assumptions. We also prove an information-theoretic bound showing that self-evaluation is bounded in what it can add when a shared latent failure structure mediates both generation and evaluation errors; we foreground this as scaffolding rather than a primary contribution, since the latent variable requires independent operationalization to give the bound empirical bite.The diagnostic framework identifies what to measure to determine whether a deployed system is in the failure regime, and what properties an external selector must have to escape it. We describe design principles for an architecture motivated by this analysis; same-model context separation is an engineering heuristic, not a theoretical solution, and we present it as a practical starting point pending empirical validation.

Article activity feed