A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

Abrar Alotaibi
Raed Mughus
Moataz Ahmed

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks, yet their deployment in high-stakes applications has raised critical concerns regarding reliability, safety, and response trustworthiness. In this paper, we present a red teaming framework that systematically uncovers vulnerabilities in LLM outputs. Our approach employs a novel multi-role architecture comprising a target, attackers, and jury models. The attackers generate increasingly effective adversarial prompts while the jury rigorously evaluates response accuracy and consistency across tasks. In a case study, our red teaming strategy proved particularly effective at exposing unfaithfulness in LLM responses. Exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks, revealing vulnerabilities in the LLMs' reliability. The approach successfully identifies how structural constraints in summarization tasks can significantly influence vulnerability patterns, with format limitations demonstrating measurable improvements in model faithfulness. It demonstrates that architectural design choices typically outweigh parameter scaling in determining model safety. The framework's key strength lies in its adaptability across different evaluation tasks, from English question-answering to Arabic summarization, enabling comprehensive comparison of model vulnerabilities. While our approach excels at comparing cross-model and cross-linguistic vulnerabilities, it faces challenges in fully automating the generation of effective adversarial prompts across different languages. Moreover, our experiments also reveal limitations in detecting certain subtle forms of unfaithfulness that do not manifest as explicit factual contradictions, particularly when working across different linguistic contexts. Overall, this red teaming architecture provides both actionable insights into current LLM vulnerabilities and a scalable methodology for ongoing safety evaluation as models continue to evolve.

Version published to 10.21203/rs.3.rs-8187921/v1 on Research Square
Dec 18, 2025

Alignment Via Interpretability: Layerwise Counterfactuals To Detect Maladaptive Llm Behaviors

This article has 2 authors:
1. Nnaemeka Kingsley Ugwumba
2. Juan Sebastian Murillejo Contreras
This article has no evaluationsLatest version Jan 29, 2026
A Unified Multi-Domain Framework for Hallucination Detection and Reliability Evaluation in Large Language Models

This article has 7 authors:
1. Kayarvizhy N
2. Bhuvana M
3. Ayman Khan
4. Bharath C
5. Brijesh S G
6. Pankaj Awasthi
7. Reevu Maity
This article has no evaluationsLatest version Dec 19, 2025
Multi-Sallm: A Multilingual Security Assessment of Generated Code

This article has 5 authors:
1. Mohammed Latif Siddiq
2. Noshin Ulfat
3. Nishat Raihan
4. Joanna C. S. Santos
5. Marcos Zampieri
This article has no evaluationsLatest version Dec 16, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Alignment Via Interpretability: Layerwise Counterfactuals To Detect Maladaptive Llm Behaviors

A Unified Multi-Domain Framework for Hallucination Detection and Reliability Evaluation in Large Language Models

Multi-Sallm: A Multilingual Security Assessment of Generated Code