A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks, yet their deployment in high-stakes applications has raised critical concerns regarding reliability, safety, and response trustworthiness. In this paper, we present a red teaming framework that systematically uncovers vulnerabilities in LLM outputs. Our approach employs a novel multi-role architecture comprising a target, attackers, and jury models. The attackers generate increasingly effective adversarial prompts while the jury rigorously evaluates response accuracy and consistency across tasks. In a case study, our red teaming strategy proved particularly effective at exposing unfaithfulness in LLM responses. Exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks, revealing vulnerabilities in the LLMs' reliability. The approach successfully identifies how structural constraints in summarization tasks can significantly influence vulnerability patterns, with format limitations demonstrating measurable improvements in model faithfulness. It demonstrates that architectural design choices typically outweigh parameter scaling in determining model safety. The framework's key strength lies in its adaptability across different evaluation tasks, from English question-answering to Arabic summarization, enabling comprehensive comparison of model vulnerabilities. While our approach excels at comparing cross-model and cross-linguistic vulnerabilities, it faces challenges in fully automating the generation of effective adversarial prompts across different languages. Moreover, our experiments also reveal limitations in detecting certain subtle forms of unfaithfulness that do not manifest as explicit factual contradictions, particularly when working across different linguistic contexts. Overall, this red teaming architecture provides both actionable insights into current LLM vulnerabilities and a scalable methodology for ongoing safety evaluation as models continue to evolve.