Harm-Conditioned Computational Friction as a Diagnostic of Alignment Robustness: A Critical Review and Evaluation Framework

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Robust alignment requires that AI models maintain safety-relevant behavior under distribution shift, adversarial prompting, and optimization pressure. Current evaluation methods often rely on surface compliance metrics—such as refusal rates or policy-template adherence—that may fail to detect fragile safety generalization, reward hacking, or prompt-contingent refusal policies. This paper critically reviews alignment evaluation methods through the lens of harm-conditioned computational friction (HCCF): a diagnostic principle positing that aligned models should exhibit measurable increases in deliberative cost, uncertainty, or constraint activation specifically when processing higher-harm inputs, controlling for task difficulty. We formalize HCCF through behavioral, inference-level, and mechanistic proxies; propose measurement protocols for friction gradients across harm domains; analyze confounds and failure modes (including "theatrical friction"); and provide an evaluation blueprint with robustness checks and cross-domain aggregation. By emphasizing conditional internal resistance rather than only external refusal behavior, HCCF provides a framework to unify existing evaluation methods and distinguish genuine safety generalization from brittle or cosmetic alignment, with potential implications for model auditing, training objectives, and AI safety governance.

Article activity feed