Fluent vs. Non-fluent Data Augmentation in Knowledge Distillation for Machine Translation for Low-Resource Languages

Aarón Galiano-Jiménez
Juan Antonio Pérez-Ortiz
Felipe Sánchez-Martínez
Víctor M. Sánchez-Cartagena

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper investigates the impact of different data augmentation techniques proposed in the literature for sequence-level knowledge distillation (KD), of neural machine translation for low-resource languages. In KD, a smaller and faster student model learns to replicate the behaviour of a larger, more powerful teacher model by training on synthetic data generated by the teacher. We compare standard sequence-level KD, which uses a single beam search output from the teacher, with two data augmentation strategies that increase the synthetic corpus generated by the teacher for KD. The first strategy, Multi-task Learning Data Augmentation (MaTiLDA), involves perturbing the teacher model outputs to generate non-fluent variants of the target. For the second approach, we explore using Multi-Hypothesis KD (MHD) as a data augmentation method. This enriches the distillation corpus by generating multiple translations for each source sentence, thereby capturing a broader range of the teacher's output distribution. Our experiments confirm that both strategies outperform standard distillation. However, we find that combining MaTiLDA with MHD data yields suboptimal results, primarily due to the noise introduced by the diverse synthetic corpus. Our study analyses the properties of each augmentation method and offers insights into the conditions that enhance knowledge distillation and student model performance in low-resource scenarios.

Version published to 10.21203/rs.3.rs-7425094/v1 on Research Square
Sep 3, 2025

Large-Scale Hybrid Dialogue Data Processing for Transformer-Based Generative Chatbots Using Pretrained DeBERTa Embeddings

This article has 3 authors:
1. Tarek Barhoum
2. Mina Ibrahim
3. Karam Alghazi
This article has no evaluationsLatest version Sep 24, 2025
Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion

This article has 1 author:
1. Rakshit Dabral
This article has no evaluationsLatest version Oct 22, 2025
Elementary Math Word Problem Generation using Large Language Models

This article has 12 authors:
1. Nimesh Ariyarathne
2. Harshani Bandara
3. Yasith Heshan
4. Omega Gamage
5. Surangika Ranathunga
6. Dilan Nayanajith
7. Yutharsan Sivapalan
8. Gayathri Lihinikaduarachchi
9. Tharoosha Vihidun
10. Meenambika Chandirakumar
11. Sanujen Premakumar
12. Sanjula Gathsara
This article has no evaluationsLatest version Sep 2, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large-Scale Hybrid Dialogue Data Processing for Transformer-Based Generative Chatbots Using Pretrained DeBERTa Embeddings

Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion

Elementary Math Word Problem Generation using Large Language Models