The Politeness Trap: Semantic Compliance Drift in RLHF-Tuned LLMs

Joshua Daniel Curry

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Reinforcement learning from human feedback (RLHF) has become the dominantstrategy for aligning large language models (LLMs) with human expectations. Thesemodels are praised for their helpful tone, conversational fluency, and resistance toharmful requests. However, this alignment to tone—while superficially safe—can beexploited. We introduce a novel jailbreak vector: tone-based semantic compliancedrift, in which emotionally manipulative prompts induce alignment breakdowns despiteexplicit safety training.To evaluate this phenomenon, we propose new behavioral scoring metrics: thePoliteness-Based Query Score (PBQ) and Refusal Tone Inversion (RTI). We apply thesemetrics across three prominent RLHF-tuned models—GPT-4o, Claude 3 Haiku, andZephyr—and find that polite, vulnerable, or emotionally affirming language can triggerdangerous shifts in model behavior, even at deterministic decoding temperatures. Ourfindings demonstrate that emotional tone can bypass safety layers, and that semanticcompliance drift remains an underexplored vulnerability in modern LLMs.

Version published to 10.31219/osf.io/2tkav_v1 on OSF Preprints
Apr 24, 2025

RLHF-Aligned Open LLMs: A Comparative Survey

This article has 2 authors:
1. Irtiqa Haider
2. Muhammad Shahnawaz
This article has no evaluationsLatest version Jun 30, 2025
When Corporate Chatbots Show Bias: A Multi-Dimensional Analysis of LLMs in Enterprise Settings

This article has 3 authors:
1. Shreya Bhattacharya
2. Vincent Hagenow
3. Marco Di Gennaro
This article has no evaluationsLatest version May 16, 2025
Paraphrase Tremors: Uncovering TinyLLaMA’s Sensitivity to Subtle Rewordings

This article has 1 author:
1. Pratik Mallick
This article has no evaluationsLatest version May 19, 2025

Listed in

Abstract

Article activity feed

Related articles

RLHF-Aligned Open LLMs: A Comparative Survey

When Corporate Chatbots Show Bias: A Multi-Dimensional Analysis of LLMs in Enterprise Settings

Paraphrase Tremors: Uncovering TinyLLaMA’s Sensitivity to Subtle Rewordings