AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We present AssameseRoBERTa, a monolingual language model trained from scratch on 1.6 million Assamese sentences comprising approximately 77 million tokens. Despite being trained on a relatively modest corpus compared to mainstream language models, our model achieves remarkable performance improvements over existing multilingual baselines. AssameseRoBERTa obtains a perplexity of 1.57 on in-domain text and 5.93 on unseen text, representing a 7.7× improvement over the previous best Assamese-specific model and outperforming multilingual models like mBERT and MuRIL by significant margins. Our approach demonstrates that dedicated monolingual models can effectively address the challenges of low-resource language processing, particularly for morphologically rich languages like Assamese. We release our model and training methodology to facilitate further research in Northeast Indian language technologies.