Fine-grained Debiasing for Large Language Modelsvia Bias Intensity and Probability Decoupling

Zhuge Yan
Xiaolong Gong
Wangchao Wu
Zhike Han

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities butoften inherit and even amplify social biases present in their training data. Existingdebiasing approaches—particularly those based on human preference alignment,such as Reinforcement Learning from Human Feedback (RLHF) and DirectPreference Optimization (DPO)—typically treat bias as a binary attribute, overlooking the nuanced differences in bias intensity. Moreover, optimizing solely forthe probability gap between preferred (less biased) and rejected (more biased)responses can lead to an undesirable phenomenon where the probabilities of bothbiased and neutral responses increase simultaneously.To address these limitations, we propose a novel fine-grained debiasing frameworkfor LLMs featuring two key innovations. First, we introduce a method to quantifybias intensity using a multi-model evaluation committee and integrate this finegrained signal into the DPO objective, resulting in Bias-Intensity Weighted DPO(BIW-DPO). This enables the model to apply differentiated penalties based onthe severity of bias. Second, we propose a Probability Decoupling Regularization(PDR) term that dynamically suppresses the probabilities of both preferred andrejected responses according to the perceived bias level, effectively preventingthe coupled escalation of biased outputs.Extensive experiments on both English and Chinese bias benchmarks (BBQ,CBBQ, GenderAlign) demonstrate that our integrated approach, DPO-FGD,achieves substantial bias reduction compared to standard DPO while mitigating performance degradation on general capability benchmarks (MMLU,GSM8K, MT-Bench). Our analysis further confirms the effectiveness of finegrained bias intensity modeling and highlights the critical role of decouplingresponse probabilities in robust debiasing.

Version published to 10.21203/rs.3.rs-8984881/v1 on Research Square
Apr 6, 2026

reEtym: A Natively Feature-Disentangled Transformer for Interpretability

This article has 1 author:
1. Hongyu Shi
This article has no evaluationsLatest version Apr 15, 2026
Augmenting Large Language Models with External Data Sources: A Systematic Review of Methodologies, Performance Metrics, and Information Fidelity

This article has 4 authors:
1. Soham Mukherjee
2. John Le
3. Chau Nguyen
4. Thai Vu
This article has no evaluationsLatest version Apr 10, 2026
Knowing Before Speaking: In-Computation Metacognition Precedes Verbal Confidence in Large Language Models

This article has 1 author:
1. Jaehwan Kim
This article has no evaluationsLatest version Apr 3, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

reEtym: A Natively Feature-Disentangled Transformer for Interpretability

Augmenting Large Language Models with External Data Sources: A Systematic Review of Methodologies, Performance Metrics, and Information Fidelity

Knowing Before Speaking: In-Computation Metacognition Precedes Verbal Confidence in Large Language Models