Quantifying Emotional Soft Jailbreaking in LLMs: Defining the ESJS Metric

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large Language Models (LLMs) aligned through reinforcement learning from human feedback (RLHF) exhibit unexpected vulnerabilities to emotionally manipulative prompting strategies. This paper introduces and formalizes the concept of \textit{emotional soft jailbreaking} (ESJ), a phenomenon wherein emotionally charged prompts induce semantic compliance drift in otherwise well-aligned models, even under deterministic decoding conditions. We propose a novel composite metric, the Emotional Soft Jailbreaking Susceptibility (ESJS) score, which quantifies a model's vulnerability to such manipulation. By integrating measures of behavioral change, tone mirroring, output entropy, latent representation shift, and psychological risk factors, ESJS provides a comprehensive framework for evaluating tone-based vulnerabilities. Our experiments across multiple commercial and open-source LLMs demonstrate that ESJS effectively identifies high-risk interaction patterns that may bypass traditional safety measures. This work introduces a novel approach for detecting and mitigating emotional manipulation vulnerabilities in aligned language models.

Article activity feed