Critical Safety Attention Heads: Architecture-Dependent Vulnerabilities in LLMs

Letian Sha
Peijie Sun
Hao Xue
Shijie Hao
Fu Xiao

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Extensive research on jailbreak attacks demonstrates that despite advancements in safety alignment, Large Language Models (LLMs) remain highly vulnerable to diverse adversarial exploitations, highlighting systemic deficiencies in their internal safety mechanisms. Current research predominantly assumes these mechanisms are universal; however, this perspective neglects the fundamental differences in vulnerability profiles exhibited by distinct model families. To systematically investigate this gap, we propose the ``Family-Specific Vulnerabilities in Safety Attention Networks'' framework, positing that internal safety mechanisms—especially the Critical Safety Attention Heads (CSAHs)—exhibit architecture-dependent distributions and robustness. We validate this through comprehensive empirical analyses on six distinct models spanning three representative LLM families (DeepSeek, LLaMA, and StableLM), examining two variants within each family. Using linear probes and quantitative attention pattern metrics to identify CSAHs, we systematically evaluate their vulnerabilities via three complementary ablations: zero-out (simulating signal loss), mean-value (simulating signal replacement), and undifferentiated attention (simulating signal pollution). Our results reveal distinct failure modes: DeepSeek exhibits extreme sensitivity to signal loss (up to 56% ASR increase without collapse), LLaMA is primarily compromised by signal pollution, while StableLM shows universal sensitivity to multiple intervention types. These findings challenge the universality assumption, demonstrate that previously reported threshold effects are not universal, and provide an empirical basis for architecture-aware safety strategies.

Version published to 10.21203/rs.3.rs-8886237/v1 on Research Square
Mar 16, 2026

The Kernel Blindness Hypothesis: Investigating OS-Level Detectability of LLM Safety Mechanisms

This article has 2 authors:
1. Ata Kilic
2. Baris Celiktas
This article has no evaluationsLatest version Mar 24, 2026
Addressing the Deployment Gap: Hybrid Symbolic-Statistical Vulnerability Detection in Safety-Critical C/C++ Systems

This article has 5 authors:
1. Jude E. Ameh
2. Abayomi Otebolaku
3. Augustine Ikpehai
4. Alex Shenfield
5. Dauda Sule
This article has no evaluationsLatest version Apr 10, 2026
Silent collapse in large neural networks: standard evaluation conceals systematic reasoning failure

This article has 1 author:
1. Yin Li
This article has no evaluationsLatest version Mar 23, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

The Kernel Blindness Hypothesis: Investigating OS-Level Detectability of LLM Safety Mechanisms

Addressing the Deployment Gap: Hybrid Symbolic-Statistical Vulnerability Detection in Safety-Critical C/C++ Systems

Silent collapse in large neural networks: standard evaluation conceals systematic reasoning failure