DDPO: Diversity-Driven Preference Optimization for Machine Translation Enhancing Robustness and Generalization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large Language Models (LLMs) have advanced Machine Translation (MT), but fine-tuning often struggles with data scarcity, especially in low-resource settings. Preference Optimization (PO) methods, like DPO and CRPO, learn from preference data. However, existing PO approaches primarily select "best" candidates based on reward and confidence, often overlooking diversity among candidate translations. This can lead to models learning similar error types, limiting generalization and robustness. To address this, we propose Diversity-Driven Preference Optimization (DDPO), a novel method integrating diversity into preference sample selection. DDPO selects the dispreferred translation ($y_l$) not only based on lower reward and confidence but crucially by maximizing its semantic or syntactic diversity from the preferred translation ($y_w$). This provides richer, more informative learning signals, compelling the model to learn robust preference boundaries. Experiments on ALMA-7B and NLLB-1.3B, using FLORES-200 for preference construction and evaluating on WMT21/22 test sets across 10 translation directions, consistently demonstrate DDPO significantly outperforms state-of-the-art baselines, including CRPO, across all automated metrics (KIWI22, COMET22, XCOMET, KIWI-XL). This establishes DDPO as a more effective and robust approach for fine-tuning MT models, achieving superior translation quality and enhanced generalization with modest computational overhead.