Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India

Badal Nyalang
Biman Debbarma

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We present KokborokMT, a high-quality neural machine translation (NMT) system for Kokborok (ISO 639-3; trp), a Tibeto-Burman language spoken primarily in Tripura, India with approximately 1.5 million speakers. Despite its status as an official language of Tripura, Kokborok has remained severely under-resourced in the NLP community, with prior machine translation attempts limited to systems trained on small Bible-derived corpora achieving BLEU scores below 7. We fine-tune the NLLB-200-distilled-600M model on a multi-source parallel corpus comprising 36,052 sentence pairs: 9,284 professionally translated sentences from the SMOL dataset, 1,769 Bible-domain sentences from WMT shared task data, and 24,999 synthetic back-translated pairs generated via Gemini Flash from Tatoeba English source sentences. We introduce trp_Latn as a new language token for Kokborok in the NLLB framework. Our best system achieves BLEU scores of 17.30 (en→trp) and 38.56 (trp→en) on held-out test sets, representing substantial improvements over prior published results. Human evaluation by three annotators yields mean adequacy of 3.74/5 and fluency of 3.70/5, with substantial agreement between trained evaluators (κ = 0.67). We will release the model and code publicly under CC-BY-4.0 upon acceptance.

Version published to 10.20944/preprints202603.2322.v1
Mar 31, 2026

Natural Language Processing in the Era of Large Language Models: Foundations, Integration, and Low-Resource Frontiers

This article has 1 author:
1. Monisha Gottam
This article has no evaluationsLatest version Mar 6, 2026
A BLEU-Based Comparative Analysis of Human and ChatGPT 4.0 Translation in Kumpulan Lagu dan Cerita Anak- anak Dwibahasa

This article has 2 authors:
1. Amon Bernabas Tenis
2. Adi Sytrisno
This article has no evaluationsLatest version Mar 24, 2026
Language Twin: A Shared-State Architecture for Terminology-Consistent Document Translation with Human Edit Propagation—A Pilot Study

This article has 1 author:
1. Elliott Ahn (Seok-hyun)
This article has no evaluationsLatest version Mar 30, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Natural Language Processing in the Era of Large Language Models: Foundations, Integration, and Low-Resource Frontiers

A BLEU-Based Comparative Analysis of Human and ChatGPT 4.0 Translation in Kumpulan Lagu dan Cerita Anak- anak Dwibahasa

Language Twin: A Shared-State Architecture for Terminology-Consistent Document Translation with Human Edit Propagation—A Pilot Study