Fluent vs. Non-fluent Data Augmentation in Knowledge Distillation for Machine Translation for Low-Resource Languages
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper investigates the impact of different data augmentation techniques proposed in the literature for sequence-level knowledge distillation (KD), of neural machine translation for low-resource languages. In KD, a smaller and faster student model learns to replicate the behaviour of a larger, more powerful teacher model by training on synthetic data generated by the teacher. We compare standard sequence-level KD, which uses a single beam search output from the teacher, with two data augmentation strategies that increase the synthetic corpus generated by the teacher for KD. The first strategy, Multi-task Learning Data Augmentation (MaTiLDA), involves perturbing the teacher model outputs to generate non-fluent variants of the target. For the second approach, we explore using Multi-Hypothesis KD (MHD) as a data augmentation method. This enriches the distillation corpus by generating multiple translations for each source sentence, thereby capturing a broader range of the teacher's output distribution. Our experiments confirm that both strategies outperform standard distillation. However, we find that combining MaTiLDA with MHD data yields suboptimal results, primarily due to the noise introduced by the diverse synthetic corpus. Our study analyses the properties of each augmentation method and offers insights into the conditions that enhance knowledge distillation and student model performance in low-resource scenarios.