Alignment Via Interpretability: Layerwise Counterfactuals To Detect Maladaptive Llm Behaviors

Nnaemeka Kingsley Ugwumba
Juan Sebastian Murillejo Contreras

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language model alignment remains fragile under distribution shift, jailbreak prompts, and latent goal misgeneralization, motivating the need for diagnostic tools that move beyond surface level behavior to internal representations. This paper investigates alignment via interpretability by introducing a layerwise counterfactual analysis framework that probes how targeted interventions on hidden states alter downstream model behavior. Using publicly available transformer based language models and open interpretability tooling, we perform controlled counterfactual substitutions and activation patching across layers to detect maladaptive behaviors that are not reliably exposed through prompt based evaluation alone. Our analysis demonstrates that specific intermediate layers encode decision critical features whose perturbation consistently induces alignment failures such as reward hacking tendencies, deceptive compliance, and instruction misgeneralization, even when input level behavior appears aligned. We further show that these internal signatures are stable across random seeds and model checkpoints, enabling reproducible detection of misalignment risks prior to deployment. The findings support the thesis that interpretability driven diagnostics can serve as an early warning mechanism for alignment failures and provide a principled foundation for integrating internal model transparency into safety evaluation pipelines.

Version published to 10.21203/rs.3.rs-8714712/v1 on Research Square
Jan 29, 2026

Harm-Conditioned Computational Friction as a Diagnostic of Alignment Robustness: A Critical Review and Evaluation Framework

This article has 1 author:
1. regio marcos pinto abreu filho
This article has no evaluationsLatest version Dec 11, 2025
Classifying 25 Misinterpretations of Statistical Tests: A Comparison of Six Large Language Models

This article has 3 authors:
1. Alessandro Rovetta
2. Lucia Castaldo
3. Mohammad Ali Mansournia
This article has no evaluationsLatest version Jan 20, 2026
BadInterpreter: Backdoor Attack on LLM-based Interpretable Recommendation

This article has 3 authors:
1. Bing Wang
2. Jing Fang
3. Shengsheng Qian
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Harm-Conditioned Computational Friction as a Diagnostic of Alignment Robustness: A Critical Review and Evaluation Framework

Classifying 25 Misinterpretations of Statistical Tests: A Comparison of Six Large Language Models

BadInterpreter: Backdoor Attack on LLM-based Interpretable Recommendation