Measuring Lexical Distance between Parallel Corpora: The Case of AI-Generated News Translation

Kyle Conway
Julie Alice Gramaccia
Nikita Scholz
Téana Averbeck

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Since the University of Warwick’s news translation project in the mid-2000s, it has been a truism that journalists rarely translate whole articles but instead compose stories using texts in other languages as one source among others. However, the development of AI-based machine translation has brought about a shift in journalistic practices. Increasingly, multilingual news agencies are using these tools to produce similar stories in multiple languages. One consequence has been that researchers can now compile parallel corpora of translated stories. This article proposes a method to characterize such corpora by measuring the distance between source and target texts, a method it applies to stories published in English and French on the website SwissInfo.ch. It describes the mechanics of corpus-building, article vectorization, and the creation of a lexical substitution list that makes measurement possible. It then proposes three measures – Euclidean, Jaccard, and cosine – which have complementary strengths and weaknesses. The value of these measurement tools is heuristic: they make it possible to identify patterns that can be investigated using other methods more familiar to news translation researchers, such as interviews or direct observation.

Version published to 10.33767/osf.io/t3b62_v2 on OSF Preprints
Jun 18, 2025
Version published to 10.33767/osf.io/t3b62_v1 on OSF Preprints
Jun 16, 2025

Measuring Lexical Distance between Parallel Corpora: The Case of AI-Generated News Translation

This article has 4 authors:
1. Kyle Conway
2. Julie Alice Gramaccia
3. Nikita Scholz
4. Téana Averbeck
This article has no evaluationsLatest version Jun 18, 2025
ValuesML: A new Multilingual Dataset for Values Detection in News and Political Manifestos

This article has 1 author:
1. Mario Scharfbillig
This article has no evaluationsLatest version May 26, 2025
Exploring the Role of Translation Brief Elements in Prompts to Large Language Models

This article has 1 author:
1. Hala Sharkas
This article has no evaluationsLatest version Jun 5, 2025

Listed in

Abstract

Article activity feed

Related articles

Measuring Lexical Distance between Parallel Corpora: The Case of AI-Generated News Translation

ValuesML: A new Multilingual Dataset for Values Detection in News and Political Manifestos

Exploring the Role of Translation Brief Elements in Prompts to Large Language Models