Breaking Barriers: Multi-Agent Prompt Fusion for Automated LLM Jailbreaks

Jiaen Hu
Juan Zhang
Zichen Li

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the widespread deployment of Large Language Models (LLMs) across various natural language processing tasks, their potential security vulnerabilities have become increasingly prominent, emerging as a crucial issue in AI safety research. Although various jailbreak attack methods have been proposed to expose risks related to generation control and content safety, current studies mostly focus on isolated attack strategies, lacking multi-agent coordination mechanisms and thus failing to comprehensively probe the behavioral boundaries of models. To systematically explore the multidimensional risks of LLMs under jailbreak defense scenarios,this paper proposes a collaborative attack framework based on energy function modeling and multiagent game mechanisms. The framework treats suffix generation, input reconstruction, and context reshaping attacks as three independent agents, each capturing target constraints from distinct attack perspectives. To facilitate effective collaboration among these strategies, a game-theoretic dynamic optimization process is used, enabling agents to evolve jointly and adaptively adjust their respective contributions based on performance feedback. To enhance the linguistic naturalness and attack potency of the generated texts,we introduce a semantic fusion module assisted by a language model,optimizing semantic coherence and expression fluency while preserving adversarial goals. Furthermore,a multidimensional reward function system based on attack success rate (ASR),language perplexity (PPL),and semantic similarity is designed to continuously optimize the fusion strategies within a reinforcement learning framework. Experimental results demonstrate that our method significantly improves attack success rates and text naturalness over multiple attack rounds,producing adversarial samples with stronger stealth and usability. Our findings effectively reveal the current blind spots in LLM defenses under multi-dimensional inputs, advancing research in LLM security and adversarial defense.

Version published to 10.21203/rs.3.rs-7070323/v1 on Research Square
Jul 30, 2025

Secure Engineering of Autonomous AI Agents: A Threat-Driven Development Framework

This article has 6 authors:
1. Tanvir Ahmed
2. Samiul Hasan
3. Ahammed Shorif
4. Ansarul Hoque
5. Shadman Sajid
6. Md. Badiuzzaman Biplob
This article has no evaluationsLatest version Aug 14, 2025
Adaptive Multi-Agent Role Reassignment over Model Context Protocol for Resilient AI Orchestration

This article has 1 author:
1. Manish Shukla
This article has no evaluationsLatest version Aug 22, 2025
Efficient reasoning with small language models: A path forward for agentic AI in Cyber-Physical Systems

This article has 1 author:
1. Mordecai Ohemeng
This article has no evaluationsLatest version Aug 25, 2025

Listed in

Abstract

Article activity feed

Related articles

Secure Engineering of Autonomous AI Agents: A Threat-Driven Development Framework

Adaptive Multi-Agent Role Reassignment over Model Context Protocol for Resilient AI Orchestration

Efficient reasoning with small language models: A path forward for agentic AI in Cyber-Physical Systems