The Kernel Blindness Hypothesis: Investigating OS-Level Detectability of LLM Safety Mechanisms

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

As Large Language Models (LLMs) are integrated into critical infrastructure, externally auditing their safety mechanism activation without internal access becomes essential. This paper investigates whether LLM safety refusals produce a detectable computational footprint at the operating system level across multi- modal side channels. We frame this as a Safety Mechanism Activation Detection problem, employing high-resolution eBPF kernel tracing and GPU telemetry to classify model outputs as Refusal or Compliance. Initial experiments on Meta Llama 3.1 8B yielded high predictive accuracy (AUC ≥ 0.93). However, a rigorous seven-phase analysis revealed this performance was driven entirely by a confounding variable: response length. To isolate genuine safety-mechanism- activation signatures, we applied Length-Controlled Matching, retaining only response pairs with identical token counts. When response length was strictly controlled, the predictive power of kernel and GPU features collapsed to ran- dom chance (AUC ≈ 0.50), despite concurrent white-box analysis proving that the model’s internal representations clearly distinguish safety contexts (AUC 0.84-0.89). We therefore propose the Kernel Blindness hypothesis: under strictly bounded compute environments (single-GPU, 4-bit quantized, non-batched dense transformer inference), the semantic intent of neural operations is indistinguish- able at the kernel level. This phenomenon is rigorously confirmed across three major 7-8B model architectures (Llama 3.1, Gemma, and Mistral), demonstrat- ing that reinforcement learning from human feedback (RLHF) safety alignment produces no externally detectable computational signatures under these specific constraints. This overarching negative result highlights a fundamental limita- tion in black-box monitoring for edge deployments and emphasizes the need for grey-box auditing approaches.

Article activity feed