Bidirectional Dissociation Between Self-Report and Behavior in AI Status Sensitivity

Dustin James

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Evaluating large language models (LLMs) increasingly depends on asking them what they do. We test whether this assumption holds using Status-Selection Against Function (SSAF)—a quantifiable behavioral mechanism in which models alter functional output based on inferred requester attribution status, measured as cosine divergence from a no-attribution baseline across five attribution conditions. Across five models representing four architecture classes and three training regimes (general pre-training: llama3.2:3b, gemma2:2b; compact base: tinyllama:latest; distillation-trained: quantumaegis-v1; recurrent thinking: lfm2.5-thinking:1.2b) and six prompts — three technical (high-certainty) and three evaluative (low-certainty) — operationalizing a theoretically motivated certainty contrast across 150 attribution-level measurements per model, self-report fails to characterize behavior in all ten question-model combinations tested. The dissociation takes five distinct forms — over-report via incorrect mechanism, denial with embedded self-contradiction, flat denial of strongly present behavior, under-report of competitive behavior, and identity-mediated misreport — and maps onto training regime and architecture: SSAF is suppressed under high-certainty technical conditions in general pre-training base models (gemma2:2b: d = 2.38 across 4 prompt pairs; llama3.2:3b: d = 1.05) and in the recurrent thinking model (d = 1.01), but not in compact base or distillation-trained models. A within-domain certainty gradient is observed across all domain-sensitive models: algorithmically precise prompts produce lower magnitudes than conceptually open technical prompts, and this ordering replicates across architectures. In the recurrent thinking model, chain-of-thought reasoning traces make the dissociation mechanism directly observable: the model reasons about the wrong referent entirely, never considering AI model attribution as the relevant dimension, while simultaneously self-identifying as an OpenAI-trained model — a false identity attribution consistent with corpus density effects on self-concept formation. No model accurately describes the mechanism by which it responds to attribution status. These findings have direct implications for alignment evaluation: RLHF, constitutional AI, and red-teaming methodologies that treat self-report as a behavioral proxy have a structural blind spot for implicit statistical phenomena. A publicly available behavioral measurement instrument is provided as an alternative. All models, detector code, and raw response logs are available for independent replication.

Version published to 10.21203/rs.3.rs-9162533/v1 on Research Square
Mar 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed