Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Benji Peng
Ziqian Bi
Qian Niu
Ming Liu
Pohsun Feng
Tianyang Wang
Lawrence K.Q. Yan
Yizhu Wen
Yichao Zhang
Caitlyn Heqi Yin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond health-care, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies. We roughly categorize attack approaches into prompt-based, model-based, multimodal, and multilingual, covering techniques such as adversarial prompting, backdoor injections, and cross-modality exploits. We also review various defense mechanisms, including prompt filtering, transformation, alignment techniques, multi-agent defenses, and self-regulation, evaluating their strengths and shortcomings. We also discuss key metrics and benchmarks used to assess LLM safety and robustness, noting challenges like the quantification of attack success in interactive contexts and biases in existing datasets. Identifying current research gaps, we suggest future directions for resilient alignment strategies, advanced defenses against evolving attacks, automation of jailbreak detection, and consideration of ethical and societal impacts. This review emphasizes the need for continued research and cooperation within the AI community to enhance LLM security and ensure their safe deployment.

Version published to 10.31219/osf.io/z8jk3 on OSF Preprints
Oct 17, 2024

A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

This article has 3 authors:
1. Abrar Alotaibi
2. Raed Mughus
3. Moataz Ahmed
This article has no evaluationsLatest version Dec 18, 2025
Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

This article has 5 authors:
1. Deepshikha Bhati
2. Fnu Neha
3. Devi Sri Bandaru
4. Matthew Weber
5. Ishan Dilipbhai Gajera
This article has no evaluationsLatest version Jan 15, 2026
Multi-Sallm: A Multilingual Security Assessment of Generated Code

This article has 5 authors:
1. Mohammed Latif Siddiq
2. Noshin Ulfat
3. Nishat Raihan
4. Joanna C. S. Santos
5. Marcos Zampieri
This article has no evaluationsLatest version Dec 16, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

Multi-Sallm: A Multilingual Security Assessment of Generated Code