NE-BERT: A Multilingual Language Model for 9 Northeast Indian Languages

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse languages, yet critically underrepresented low-esource languages remain marginalized [1]. We present NE-BERT, a domain-specic multilingual encoder model trained on approximately 8.3 million sentences spanning 9 Northeast Indian languages and 2 anchor languages (Hindi, English) a linguistically diverse region with minimal representation in existing multilingual models [2]. By employing weighted data sampling and a custom SentencePiece Unigram tok-kenizer [3], NE-BERT outperforms IndicBERT [4] across all evaluated languages, achieving 2.85 lower average perplexity. Our tokenizer demonstrates superior effciency on ultra-low-resource languages, with 1.60X better tokenization fertility than mBERT [5]. We address critical vocab-bulary fragmentation issues in extremely low-resource languages such as Pnar (1,002 sentences) and Kokborok (2,463 sentences) through aggressive upsampling strategies. We release NE-BERT under CC-BY-4.0 to support NLP research and digital inclusion for Northeast Indian communities.

Article activity feed