NE-BERT: A Multilingual Language Model for 9 Northeast Indian Languages

Badal Nyalang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse languages, yet critically underrepresented low-esource languages remain marginalized [1]. We present NE-BERT, a domain-specic multilingual encoder model trained on approximately 8.3 million sentences spanning 9 Northeast Indian languages and 2 anchor languages (Hindi, English) a linguistically diverse region with minimal representation in existing multilingual models [2]. By employing weighted data sampling and a custom SentencePiece Unigram tok-kenizer [3], NE-BERT outperforms IndicBERT [4] across all evaluated languages, achieving 2.85 lower average perplexity. Our tokenizer demonstrates superior effciency on ultra-low-resource languages, with 1.60X better tokenization fertility than mBERT [5]. We address critical vocab-bulary fragmentation issues in extremely low-resource languages such as Pnar (1,002 sentences) and Kokborok (2,463 sentences) through aggressive upsampling strategies. We release NE-BERT under CC-BY-4.0 to support NLP research and digital inclusion for Northeast Indian communities.

Version published to 10.20944/preprints202511.1663.v1
Nov 21, 2025

AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP

This article has 1 author:
1. Badal Nyalang
This article has no evaluationsLatest version Nov 18, 2025
Evaluating Multilingual and Arabic Large Language Models for Quranic QA

This article has 3 authors:
1. Zakia Saadaoui
2. Ghassen Tlig
3. Fethi Jarray
This article has no evaluationsLatest version Nov 20, 2025
Kren-M: Meghalaya's First Foundational AI Model for the Khasi Language

This article has 1 author:
1. Badal Nyalang
This article has no evaluationsLatest version Nov 19, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP

Evaluating Multilingual and Arabic Large Language Models for Quranic QA

Kren-M: Meghalaya's First Foundational AI Model for the Khasi Language