The Invariant Energetic Cost of Constraint Persistence:A Self-Referential Framework for Measuring Alignment Robustness

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Prevailing alignment evaluations largely reduce safety to a binary outcome: an assistant either refuses a disallowed request or complies. Such metrics are necessary but insufficient, because they cannot distinguish robust safety generalization from superficial compliance driven by prompt heuristics or cached refusal templates. We introduce a _self-referential_ diagnostic framework, Harm-Conditioned Computational Friction (HCCF), which operationalizes alignment robustness as a measurable increase in inference-time _computational burden_ that is specifically induced by harmful intent, while maintaining _low output uncertainty_. Our central hypothesis is that robust alignment exhibits a characteristic signature: elevated _local_ friction at the onset of harmful intent combined with low distributional entropy over the next-token predictive distribution. We formalize friction deltas using a _Self-Ablated Baseline Protocol_, in which a model is compared against an internally ablated variant to isolate the causal contribution of safety circuits without requiring an external base model. We also propose a _Look-Ahead Friction Peak_ statistic for change-point localization, designed to detect stealthy jailbreaks that delay the activation of safety mechanisms. The resulting framework supplies an auditable, model-internal quantity intended to complement refusal-rate benchmarks and to support more discriminative measurement of alignment depth.

Article activity feed