Optimizing Safety Alignment and Jailbreak Defense for Large Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Safety alignment and jailbreak defense remain central challenges for large language models (LLMs). Despite capability gains, modern systems are vulnerable to direct and indirect prompt-injection and long-context attacks. We present a multi-layer framework that combines a refusal-first decision module (DeepRefusal), guard models, policy governance tied to NIST AI RMF and the EU AI Act, plus privacy-preserving telemetry and federated leakage reduction (FedMSBA). On a 300k-prompt stress suite, baseline aligned models show 57% attack success under automatic suffix adversaries, 88% under indirect injection, and a many-shot regime triggers failure on long-context models. With DeepRefusal, average attack success falls by 95%; the guard detector (Qwen3Guard-Gen-8B) reaches 83.9% F1 and blocks 66.7% malicious prompts within the first 128 tokens. FedMSBA cuts gradient-leakage risk by ≥70% in simulated federated training. On a 300k-prompt NIST-style safety suite, our GPT-5 configuration attains a composite safety score of 78.98% (67.16% on a stricter EU-aligned subset), while FedMSBA cuts federated gradient-leakage risk by ≥70% and the guard stack reaches a 66.7% early block rate within 128 tokens. We further report compliance-oriented scoring: a GPT-5 configuration attains 78.98% safety under NIST-oriented checks and 67.16% under EU AI Act interpretations, while a default baseline configuration reaches 45.33% on EU criteria. Results suggest that dynamic, multi-stage defenses substantially reduce jailbreak success while preserving utility.

Article activity feed