Measuring Lexical Distance between Parallel Corpora: The Case of AI-Generated News Translation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Since the University of Warwick’s news translation project in the mid-2000s, it has been a truism that journalists rarely translate whole articles but instead compose stories using texts in other languages as one source among others. However, the development of AI-based machine translation has brought about a shift in journalistic practices. Increasingly, multilingual news agencies are using these tools to produce similar stories in multiple languages. One consequence has been that researchers can now compile parallel corpora of translated stories. This article proposes a method to characterize such corpora by measuring the distance between source and target texts, a method it applies to stories published in English and French on the website SwissInfo.ch. It describes the mechanics of corpus-building, article vectorization, and the creation of a lexical substitution list that makes measurement possible. It then proposes three measures – Euclidean, Jaccard, and cosine – which have complementary strengths and weaknesses. The value of these measurement tools is heuristic: they make it possible to identify patterns that can be investigated using other methods more familiar to news translation researchers, such as interviews or direct observation.