Adaptive Dynamic Fusion for Adversarial and Counterfactual Debiasing in Pre-Trained Language Models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Pre-trained language models are highly effective in various Natural Language Processing (NLP) tasks; however, they remain vulnerable to stereotypical biases, raising concerns about fairness. This paper presents a framework for mitigating bias in language models through an enhanced version of DeBERTaV3, which incorporates an Adaptive-Dynamic Fusion (ADF) component, referred to as ADFBERT. This model features a dynamically reweighted fusion layer that adapts token interactions based on their positional encoding. We integrate Adversarial Fine-Tuning (AFT) and Counterfactual Data Augmentation (CDA) to improve bias mitigation performance. AFT introduces an adversarial loss that minimizes correlations between learned representations and biased attributes, while CDA generates counterfactual samples to promote invariance across different demographic groups. We evaluate each method using ADFBERT on the StereoSet benchmark dataset using the Idealized Context Association Test (ICAT) score for the assessment. Experimental results show that ADFBERT combined with AFT enhances the ICAT score by 0.78 points compared to XLNet-large. In contrast, ADFBERT with CDA achieves a state-of-the-art ICAT score of 78.90\%, surpassing the best baselines such as XLNet-large, RoBERTa, and GPT-3 by 6.90\%. These findings highlight that integrating dynamic fusion with adversarial fine-tuning and counterfactual data augmentation significantly improves bias mitigation and fairness in pre-trained language models.

Article activity feed