Safeguarding Prompt Robustness: Evaluating Protection Methods Against Adversarial Text Generation in Large Language Models

Samaiyae Kaczmarek
Felix van Buren
Konstantinos Petros

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The increasing sophistication of adversarial attacks on text generation models has raised significant concerns regarding the security and reliability of automated text generation systems. Addressing this challenge, the research introduces a novel framework that combines multiple protection methods to enhance the robustness of prompt security against a wide range of adversarial strategies. The proposed multi-layered defense system integrates input filtering, model fine-tuning, reinforcement learning, and other techniques to create a comprehensive approach that can dynamically adapt to different types of adversarial attacks, effectively mitigating their impact while preserving the quality of generated responses. The empirical evaluation, which encompassed various adversarial scenarios, demonstrated the combined framework's superior performance over individual protection methods, highlighting its ability to significantly reduce the success rate of attacks while maintaining high levels of coherence and relevance in the outputs. The study's contributions lie in its demonstration of how leveraging the complementary strengths of diverse protective mechanisms can lead to a more resilient and adaptable defense system, offering valuable insights into the development of secure and reliable text generation models.

Version published to 10.21203/rs.3.rs-5096328/v1 on Research Square
Sep 18, 2024

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

This article has 10 authors:
1. Benji Peng
2. Ziqian Bi
3. Qian Niu
4. Ming Liu
5. Pohsun Feng
6. Tianyang Wang
7. Lawrence K.Q. Yan
8. Yizhu Wen
9. Yichao Zhang
10. Caitlyn Heqi Yin
This article has no evaluationsLatest version Oct 17, 2024
Automated Enhancements for Cross-Modal Safety Alignment in Open-Source Large Language Models

This article has 5 authors:
1. Alexander Rateri
2. Luciano Thompson
3. Emilia Hartman
4. Leonard Collins
5. James Patterson
This article has no evaluationsLatest version Sep 27, 2024
Mixed Perturbation: Generating Directionally Diverse Perturbations for Adversarial Training

This article has 2 authors:
1. Changhun Hyun
2. Hyeyoung Park
This article has no evaluationsLatest version Oct 1, 2024

Listed in

Abstract

Article activity feed

Related articles

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Automated Enhancements for Cross-Modal Safety Alignment in Open-Source Large Language Models

Mixed Perturbation: Generating Directionally Diverse Perturbations for Adversarial Training