Alignment Via Interpretability: Layerwise Counterfactuals To Detect Maladaptive Llm Behaviors
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language model alignment remains fragile under distribution shift, jailbreak prompts, and latent goal misgeneralization, motivating the need for diagnostic tools that move beyond surface level behavior to internal representations. This paper investigates alignment via interpretability by introducing a layerwise counterfactual analysis framework that probes how targeted interventions on hidden states alter downstream model behavior. Using publicly available transformer based language models and open interpretability tooling, we perform controlled counterfactual substitutions and activation patching across layers to detect maladaptive behaviors that are not reliably exposed through prompt based evaluation alone. Our analysis demonstrates that specific intermediate layers encode decision critical features whose perturbation consistently induces alignment failures such as reward hacking tendencies, deceptive compliance, and instruction misgeneralization, even when input level behavior appears aligned. We further show that these internal signatures are stable across random seeds and model checkpoints, enabling reproducible detection of misalignment risks prior to deployment. The findings support the thesis that interpretability driven diagnostics can serve as an early warning mechanism for alignment failures and provide a principled foundation for integrating internal model transparency into safety evaluation pipelines.