AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP

Badal Nyalang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We present AssameseRoBERTa, a monolingual language model trained from scratch on 1.6 million Assamese sentences comprising approximately 77 million tokens. Despite being trained on a relatively modest corpus compared to mainstream language models, our model achieves remarkable performance improvements over existing multilingual baselines. AssameseRoBERTa obtains a perplexity of 1.57 on in-domain text and 5.93 on unseen text, representing a 7.7× improvement over the previous best Assamese-specific model and outperforming multilingual models like mBERT and MuRIL by significant margins. Our approach demonstrates that dedicated monolingual models can effectively address the challenges of low-resource language processing, particularly for morphologically rich languages like Assamese. We release our model and training methodology to facilitate further research in Northeast Indian language technologies.

Version published to 10.21203/rs.3.rs-8124065/v1 on Research Square
Nov 18, 2025

NE-BERT: A Multilingual Language Model for 9 Northeast Indian Languages

This article has 1 author:
1. Badal Nyalang
This article has no evaluationsLatest version Nov 21, 2025
Evaluating Multilingual and Arabic Large Language Models for Quranic QA

This article has 3 authors:
1. Zakia Saadaoui
2. Ghassen Tlig
3. Fethi Jarray
This article has no evaluationsLatest version Nov 20, 2025
Kren-M: Meghalaya's First Foundational AI Model for the Khasi Language

This article has 1 author:
1. Badal Nyalang
This article has no evaluationsLatest version Nov 19, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

NE-BERT: A Multilingual Language Model for 9 Northeast Indian Languages

Evaluating Multilingual and Arabic Large Language Models for Quranic QA

Kren-M: Meghalaya's First Foundational AI Model for the Khasi Language