Investigating Deceptive Fairness Attacks on Large Language Models via Prompt Engineering
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Artificial intelligence systems, particularly those employing natural language processing techniques, have increasingly been scrutinized for their potential to propagate and amplify societal biases. Addressing the vulnerability of these systems to deceptive fairness attacks, where subtly crafted prompts manipulate outputs to introduce bias, is both novel and critical in ensuring ethical AI deployment. The research investigates how LLMs can be systematically compromised through deceptive prompt engineering, revealing significant impacts on fairness metrics such as demographic parity, equalized odds, and disparate impact. The experimental design included the development of an extensive dataset of neutral and deceptive prompts, automated interaction with LLMs, and a robust analysis framework to assess the biases in responses. Results demonstrated substantial deviations in fairness metrics under deceptive conditions, highlighting the need for advanced detection and mitigation strategies. Future work should focus on enhancing the resilience of LLMs through real-time detection algorithms, ethical design principles, and continuous monitoring to uphold fairness across diverse applications. The findings emphasize the urgency of addressing bias in AI to prevent the perpetuation of inequality and ensure equitable technology deployment.