Fine-grained Debiasing for Large Language Modelsvia Bias Intensity and Probability Decoupling
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities butoften inherit and even amplify social biases present in their training data. Existingdebiasing approaches—particularly those based on human preference alignment,such as Reinforcement Learning from Human Feedback (RLHF) and DirectPreference Optimization (DPO)—typically treat bias as a binary attribute, overlooking the nuanced differences in bias intensity. Moreover, optimizing solely forthe probability gap between preferred (less biased) and rejected (more biased)responses can lead to an undesirable phenomenon where the probabilities of bothbiased and neutral responses increase simultaneously.To address these limitations, we propose a novel fine-grained debiasing frameworkfor LLMs featuring two key innovations. First, we introduce a method to quantifybias intensity using a multi-model evaluation committee and integrate this finegrained signal into the DPO objective, resulting in Bias-Intensity Weighted DPO(BIW-DPO). This enables the model to apply differentiated penalties based onthe severity of bias. Second, we propose a Probability Decoupling Regularization(PDR) term that dynamically suppresses the probabilities of both preferred andrejected responses according to the perceived bias level, effectively preventingthe coupled escalation of biased outputs.Extensive experiments on both English and Chinese bias benchmarks (BBQ,CBBQ, GenderAlign) demonstrate that our integrated approach, DPO-FGD,achieves substantial bias reduction compared to standard DPO while mitigating performance degradation on general capability benchmarks (MMLU,GSM8K, MT-Bench). Our analysis further confirms the effectiveness of finegrained bias intensity modeling and highlights the critical role of decouplingresponse probabilities in robust debiasing.