GinLish Corpus v0.1.0 - Development and Evaluation of Low-Resource Tagin-English Parallel Corpus
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper introduces GinLish Corpus v0.1.0, the inaugural Tagin-English parallel corpus, addressing a critical gap in resources for the definitely endangered Tagin language, a Tani language of the Sino-Tibetan family spoken in Arunachal Pradesh, India. Prior to this work, there were likely very few or no online resources available for Tagin language, making this corpus uniquely valuable. Our dataset comprises 35,000 meticulously collected and aligned English and Tagin sentence pairs. We leverage this corpus to conduct a comprehensive Neural Machine Translation (NMT) study, comparing the performance of various architectures including Recurrent Neural Networks (RNN), and Transformers, evaluated using BLEU, METEOR, chrF, and TER scores. Our results demonstrate that the RNN model achieved the best performance among the architectures, with BLEU scores of 26.07 and 25.12 for English-to-Tagin and Tagin-to-English translations, respectively. In contrast, the Transformer model underperformed, with BLEU scores of 22.14 for English-to-Tagin and 20.28 for Tagin-to-English translations. Our findings lay the groundwork for future research in Tagin language technology, aiding in language preservation and expanding NMT reach to extremely low-resource languages.