Noise-Robust Preference Alignment for Large Language Models via Confidence Estimation and Adaptive Optimization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Preference alignment is essential for aligning language models with human intentions, yet synthetic preference data often contains noise that hinders generalization. To address this issue, we introduce a noise-robust alignment framework that enhances model resilience to imperfect training data. The approach integrates a Preference Confidence Estimation module, which assigns reliability scores to preference samples, and an Adaptive Robust Optimization strategy that incorporates these scores into the learning process. This design allows the model to emphasize reliable signals and reduce the impact of noisy supervision. Experiments across dialogue, summarization, and instruction-following benchmarks show consistent improvements over existing alignment methods. Further analysis confirms the complementary effects of the two modules and their robustness under varying noise conditions, highlighting the framework’s ability to promote stable and accurate preference learning.