Alignment Via Interpretability: Layerwise Counterfactuals To Detect Maladaptive Llm Behaviors

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language model alignment remains fragile under distribution shift, jailbreak prompts, and latent goal misgeneralization, motivating the need for diagnostic tools that move beyond surface level behavior to internal representations. This paper investigates alignment via interpretability by introducing a layerwise counterfactual analysis framework that probes how targeted interventions on hidden states alter downstream model behavior. Using publicly available transformer based language models and open interpretability tooling, we perform controlled counterfactual substitutions and activation patching across layers to detect maladaptive behaviors that are not reliably exposed through prompt based evaluation alone. Our analysis demonstrates that specific intermediate layers encode decision critical features whose perturbation consistently induces alignment failures such as reward hacking tendencies, deceptive compliance, and instruction misgeneralization, even when input level behavior appears aligned. We further show that these internal signatures are stable across random seeds and model checkpoints, enabling reproducible detection of misalignment risks prior to deployment. The findings support the thesis that interpretability driven diagnostics can serve as an early warning mechanism for alignment failures and provide a principled foundation for integrating internal model transparency into safety evaluation pipelines.

Article activity feed