A Transformer-Based Language Model for Nyishi, a Low-Resource Language of Northeast India
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
NyishiBERT is a foundational transformer-based language model specifically developed for Nyishi (njz-Latn), a Sino-Tibetan language of Northeast India. Utilizing the ModernBERT-Base architecture, the model was trained on a severely low-resource corpus of 55,870 sentences sourced from the WMT25 shared task, achieving a test perplexity of 20.78. Downstream performance was evaluated using a sentiment classification task constructed via label projection and high-confidence filtering. Results demonstrate that the learned representations effectively support classification even under weak supervision. The model, training code, and evaluation datasets are publicly released to provide a foundational baseline for future research in the Tani language subgroup.