Critical Safety Attention Heads: Architecture-Dependent Vulnerabilities in LLMs
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Extensive research on jailbreak attacks demonstrates that despite advancements in safety alignment, Large Language Models (LLMs) remain highly vulnerable to diverse adversarial exploitations, highlighting systemic deficiencies in their internal safety mechanisms. Current research predominantly assumes these mechanisms are universal; however, this perspective neglects the fundamental differences in vulnerability profiles exhibited by distinct model families. To systematically investigate this gap, we propose the ``Family-Specific Vulnerabilities in Safety Attention Networks'' framework, positing that internal safety mechanisms—especially the Critical Safety Attention Heads (CSAHs)—exhibit architecture-dependent distributions and robustness. We validate this through comprehensive empirical analyses on six distinct models spanning three representative LLM families (DeepSeek, LLaMA, and StableLM), examining two variants within each family. Using linear probes and quantitative attention pattern metrics to identify CSAHs, we systematically evaluate their vulnerabilities via three complementary ablations: zero-out (simulating signal loss), mean-value (simulating signal replacement), and undifferentiated attention (simulating signal pollution). Our results reveal distinct failure modes: DeepSeek exhibits extreme sensitivity to signal loss (up to 56% ASR increase without collapse), LLaMA is primarily compromised by signal pollution, while StableLM shows universal sensitivity to multiple intervention types. These findings challenge the universality assumption, demonstrate that previously reported threshold effects are not universal, and provide an empirical basis for architecture-aware safety strategies.