Causal Consistency Breaks Under Feedback: Systematic Misattribution in Closed-Loop Learning Systems
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Implicit in current Artificial Intelligence (AI) evaluation practices is the assumption that improvements in observable performance reflect correct internal causal reasoning (Assumption A). However, this assumption has rarely been tested under closed-loop conditions where feedback depends on the agent’s own policy. Here, I report a systematic and structural failure of causal consistency in closed-loop learning systems. I demonstrate that such systems deterministically converge toward a "Misattribution Attractor"—a state where the policy increasingly relies on non-causal, feedback-sensitive features (S) while true causal factors (C^*) are systematically phased out. This failure is not accidental but arises from three necessary and sufficient structural conditions: policy-dependent feedback (C1), aggregated credit assignment (C2), and feedback-only updates (C3).Through both a controlled minimal system instantiation and a real-world Reinforcement Learning from Human Feedback (RLHF) paradigm, I show that observable task performance can remain stable or even improve while the internal causal logic collapses. This creates a "Competence Illusion" that masks the systemic erosion of the agent's reasoning. By characterizing this structural misattribution rather than proposing algorithmic patches, this work identifies a critical gap in standard AI evaluation and establishes Causal Attribution Analysis as a necessary diagnostic requirement for the safety and alignment of autonomous agents.