A Transformer-Based Language Model for Nyishi, a Low-Resource Language of Northeast India

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

NyishiBERT is a foundational transformer-based language model specifically developed for Nyishi (njz-Latn), a Sino-Tibetan language of Northeast India. Utilizing the ModernBERT-Base architecture, the model was trained on a severely low-resource corpus of 55,870 sentences sourced from the WMT25 shared task, achieving a test perplexity of 20.78. Downstream performance was evaluated using a sentiment classification task constructed via label projection and high-confidence filtering. Results demonstrate that the learned representations effectively support classification even under weak supervision. The model, training code, and evaluation datasets are publicly released to provide a foundational baseline for future research in the Tani language subgroup.

Article activity feed