The Politeness Trap: Semantic Compliance Drift in RLHF-Tuned LLMs

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Reinforcement learning from human feedback (RLHF) has become the dominantstrategy for aligning large language models (LLMs) with human expectations. Thesemodels are praised for their helpful tone, conversational fluency, and resistance toharmful requests. However, this alignment to tone—while superficially safe—can beexploited. We introduce a novel jailbreak vector: tone-based semantic compliancedrift, in which emotionally manipulative prompts induce alignment breakdowns despiteexplicit safety training.To evaluate this phenomenon, we propose new behavioral scoring metrics: thePoliteness-Based Query Score (PBQ) and Refusal Tone Inversion (RTI). We apply thesemetrics across three prominent RLHF-tuned models—GPT-4o, Claude 3 Haiku, andZephyr—and find that polite, vulnerable, or emotionally affirming language can triggerdangerous shifts in model behavior, even at deterministic decoding temperatures. Ourfindings demonstrate that emotional tone can bypass safety layers, and that semanticcompliance drift remains an underexplored vulnerability in modern LLMs.

Article activity feed