Variability in Low-Resource Machine Translation Evaluation: Authentic vs. LLM-Generated Training Corpora
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The evaluation of machine translation systems often relies on a single metric and a single test dataset, an approach that can yield misleading system comparisons and premature conclusions regarding translation quality. A further complicating factor is the presence of translationese in test data, i.e. linguistic features specific to translated texts which can significantly influence both human and automatic assessments, particularly when present in the source language. In this paper, we examine the variability of MT evaluation results across datasets for English–Galician, Spanish–Galician, and Portuguese–Galician. We investigate whether the variability observed in prior research work can be replicated by training our own models from scratch, explore the feasibility of generating synthetic training corpora using large language models, and assess whether results obtained with authentic data can be reproduced with synthetic corpora. Additionally, we examine the relationship between dataset variability and the translationese effect. Our findings provide new insights into the influence of dataset composition on machine translation evaluation, the utility of large language models generated corpora, and the challenges associated with evaluating translation quality in low-resource settings.