The Kernel Blindness Hypothesis: Investigating OS-Level Detectability of LLM Safety Mechanisms

Ata Kilic
Baris Celiktas

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

As Large Language Models (LLMs) are integrated into critical infrastructure, externally auditing their safety mechanism activation without internal access becomes essential. This paper investigates whether LLM safety refusals produce a detectable computational footprint at the operating system level across multi- modal side channels. We frame this as a Safety Mechanism Activation Detection problem, employing high-resolution eBPF kernel tracing and GPU telemetry to classify model outputs as Refusal or Compliance. Initial experiments on Meta Llama 3.1 8B yielded high predictive accuracy (AUC ≥ 0.93). However, a rigorous seven-phase analysis revealed this performance was driven entirely by a confounding variable: response length. To isolate genuine safety-mechanism- activation signatures, we applied Length-Controlled Matching, retaining only response pairs with identical token counts. When response length was strictly controlled, the predictive power of kernel and GPU features collapsed to ran- dom chance (AUC ≈ 0.50), despite concurrent white-box analysis proving that the model’s internal representations clearly distinguish safety contexts (AUC 0.84-0.89). We therefore propose the Kernel Blindness hypothesis: under strictly bounded compute environments (single-GPU, 4-bit quantized, non-batched dense transformer inference), the semantic intent of neural operations is indistinguish- able at the kernel level. This phenomenon is rigorously confirmed across three major 7-8B model architectures (Llama 3.1, Gemma, and Mistral), demonstrat- ing that reinforcement learning from human feedback (RLHF) safety alignment produces no externally detectable computational signatures under these specific constraints. This overarching negative result highlights a fundamental limita- tion in black-box monitoring for edge deployments and emphasizes the need for grey-box auditing approaches.

Version published to 10.21203/rs.3.rs-9190463/v1 on Research Square
Mar 24, 2026

Addressing the Deployment Gap: Hybrid Symbolic-Statistical Vulnerability Detection in Safety-Critical C/C++ Systems

This article has 5 authors:
1. Jude E. Ameh
2. Abayomi Otebolaku
3. Augustine Ikpehai
4. Alex Shenfield
5. Dauda Sule
This article has no evaluationsLatest version Apr 10, 2026
Bi-Predictability: A Real-Time Signal for Monitoring LLM Interaction Integrity

This article has 2 authors:
1. Wael Hafez
2. Amir Nazeri
This article has no evaluationsLatest version Mar 24, 2026
Evaluating LLMs for the Automated Generation of Operational Detection Rules in Enterprise EDR Environments

This article has 4 authors:
1. Ioannis Konstantaras
2. Efstratios Chatzoglou
3. Konstantinos E. Kampourakis
4. Georgios Kambourakis
This article has no evaluationsLatest version Mar 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Addressing the Deployment Gap: Hybrid Symbolic-Statistical Vulnerability Detection in Safety-Critical C/C++ Systems

Bi-Predictability: A Real-Time Signal for Monitoring LLM Interaction Integrity

Evaluating LLMs for the Automated Generation of Operational Detection Rules in Enterprise EDR Environments