Creating Datasets of Parallel Sentences in Low-Resource Languages Using AI

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This study addresses the critical issue of data scarcity for low-resource languages, particularly focusing on the methodology for creating a corpus of parallel sequences in two low-resource languages. The lack of large-scale, high-quality bilingual datasets significantly hinders the development of neural machine translation systems for such languages. In this work, a comparative analysis of AI systems for generating parallel corpus on a test dataset is conducted, with selection criteria based on accessibility (free to use), translation quality, and efficiency. AI system was selected based on predefined criteria, and its performance in generating parallel data was assessed. As an example, a sizable Kyrgyz-Kazakh parallel corpus was created. However, error analysis revealed that approximately 0.5% of the translations contained inaccuracies, highlighting the need for further post-editing and model refinement. This study contributes to the advancement of resource development for low-resource language pairs and provides practical insights into the efficient creation of parallel corpus using modern AI systems.

Article activity feed