Creating Datasets of Parallel Sentences in Low-Resource Languages Using AI

Balzhan Abduali
Marek Milosz
Ualsher Tukeyev

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study addresses the critical issue of data scarcity for low-resource languages, particularly focusing on the methodology for creating a corpus of parallel sequences in two low-resource languages. The lack of large-scale, high-quality bilingual datasets significantly hinders the development of neural machine translation systems for such languages. In this work, a comparative analysis of AI systems for generating parallel corpus on a test dataset is conducted, with selection criteria based on accessibility (free to use), translation quality, and efficiency. AI system was selected based on predefined criteria, and its performance in generating parallel data was assessed. As an example, a sizable Kyrgyz-Kazakh parallel corpus was created. However, error analysis revealed that approximately 0.5% of the translations contained inaccuracies, highlighting the need for further post-editing and model refinement. This study contributes to the advancement of resource development for low-resource language pairs and provides practical insights into the efficient creation of parallel corpus using modern AI systems.

Version published to 10.20944/preprints202505.0556.v1
May 8, 2025

Variability in Low-Resource Machine Translation Evaluation: Authentic vs. LLM-Generated Training Corpora

This article has 3 authors:
1. Sofía García González¹
2. German Rigau Claramunt²
3. Jose Ramom Pichel Campos
This article has no evaluationsLatest version Jan 21, 2026
Neural Machine Translation and Multilingual NLP: A Survey of Methods, Architectures, and Applications

This article has 3 authors:
1. Yao Yuna
2. Junhao Song
3. Jing Qiao
This article has no evaluationsLatest version Jan 6, 2026
Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research

This article has 10 authors:
1. Derguene Mbaye
2. Tatiana D. P. Mbengue
3. Madoune R. Seye
4. Moussa Diallo
5. Mamadou L. Ndiaye
6. Dimitri S. Adjanohoun
7. Djiby Sow
8. Cheikh S. Wade
9. Jean-Claude B. Munyaka
10. Jerome Chenal
This article has no evaluationsLatest version Jan 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Variability in Low-Resource Machine Translation Evaluation: Authentic vs. LLM-Generated Training Corpora

Neural Machine Translation and Multilingual NLP: A Survey of Methods, Architectures, and Applications

Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research