Towards Multilingual Machine Translation for Low-Resource South Asian Languages: A Transformer-Based Approach on English–Urdu–Kashmiri

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Machine translation remains a significant challenge for natural language processing in low-resource languages, particularly within linguistically diverse regions such as South Asia. To address the scarcity of translation resources for Kashmiri and Urdu, a high-quality multilingual parallel corpus was constructed. This corpus comprises over 26,000 English sentences that have been manually translated and aligned with their Urdu and Kashmiri equivalents. The dataset was further expanded by incorporating translations into Urdu and integrating a publicly available English–Kashmiri corpus containing 16,000 samples. The NLLB-200 multilingual model was fine-tuned for English–Kashmiri and English–Urdu translation using this dataset. The model achieved BLEU scores of 35.6 and 36.13 for English–Urdu across multiple runs. For English–Kashmiri, BLEU scores improved from 6.28 and 5.68 to 17.9 and 18.29 after fine-tuning. These results demonstrate a substantial improvement in translation performance compared to zero-shot baselines. Qualitative evaluation further indicates enhanced fluency and grammatical accuracy in the model outputs. Statistical analysis and visualizations provide additional support for these findings. The resulting resources and methodologies can inform the development of future multilingual and bidirectional machine translation systems for underrepresented languages, including Kashmiri, Urdu, and English.

Article activity feed