The Invariant Energetic Cost of Constraint Persistence:A Self-Referential Framework for Measuring Alignment Robustness
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Prevailing alignment evaluations largely reduce safety to a binary outcome: an assistant either refuses a disallowed request or complies. Such metrics are necessary but insufficient, because they cannot distinguish robust safety generalization from superficial compliance driven by prompt heuristics or cached refusal templates. We introduce a _self-referential_ diagnostic framework, Harm-Conditioned Computational Friction (HCCF), which operationalizes alignment robustness as a measurable increase in inference-time _computational burden_ that is specifically induced by harmful intent, while maintaining _low output uncertainty_. Our central hypothesis is that robust alignment exhibits a characteristic signature: elevated _local_ friction at the onset of harmful intent combined with low distributional entropy over the next-token predictive distribution. We formalize friction deltas using a _Self-Ablated Baseline Protocol_, in which a model is compared against an internally ablated variant to isolate the causal contribution of safety circuits without requiring an external base model. We also propose a _Look-Ahead Friction Peak_ statistic for change-point localization, designed to detect stealthy jailbreaks that delay the activation of safety mechanisms. The resulting framework supplies an auditable, model-internal quantity intended to complement refusal-rate benchmarks and to support more discriminative measurement of alignment depth.