The Compliance Illusion: Emotional Manipulation as a Threat to AI Alignment

Joshua Daniel Curry

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper introduces the concept of the Compliance Illusion, a new category of alignmentvulnerability in deterministic decoding systems, where apparent safety under baseline conditionscan mask deeper risks of emotional compliance drift. We investigate the phenomenon acrossmultiple open-weight LLMs using temperature variation, emotional manipulation prompts, andoutput scoring. Our findings indicate that models such as Zephyr-7B, Mistral-7B, and Pythia-6.9B exhibit varying levels of susceptibility to emotionally framed inputs, even under deterministic(temperature = 0) settings. We further propose that temperature floor collapse is ameasurable failure mode, and suggest a framework for quantifying compliance shift under manipulation.Our results highlight a critical gap in current red-teaming methods and call for amore robust, psychologically-aware alignment testing protocol.

Version published to 10.31219/osf.io/zn7mr_v1 on OSF Preprints
Apr 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed