AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We present AssameseRoBERTa, a monolingual language model trained from scratch on 1.6 million Assamese sentences comprising approximately 77 million tokens. Despite being trained on a relatively modest corpus compared to mainstream language models, our model achieves remarkable performance improvements over existing multilingual baselines. AssameseRoBERTa obtains a perplexity of 1.57 on in-domain text and 5.93 on unseen text, representing a 7.7× improvement over the previous best Assamese-specific model and outperforming multilingual models like mBERT and MuRIL by significant margins. Our approach demonstrates that dedicated monolingual models can effectively address the challenges of low-resource language processing, particularly for morphologically rich languages like Assamese. We release our model and training methodology to facilitate further research in Northeast Indian language technologies.

Article activity feed