Kren-M: Meghalaya's First Foundational AI Model for the Khasi Language

Badal Nyalang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We present Kren-M, Meghalaya’s first foundational AI model and the first generative language model built in Northeast India for an indigenous language. This 2.6-billion-parameter bilingual (Khasi–English) LLM, based on Gemma-2-2B, incorporates a custom tokenizer with 2,135 added tokens, yielding 30–36% tokenization efficiency gains for Khasi and Garo. Continued pre-training on 5.43 million rigorously cleaned Khasi sentences (proprietary) is followed by instruction tuning on 33,034 high-quality examples. Key technical fixes eliminate auto-translation, instruction echoing, and infinite generation. Kren-M delivers fluent, task-aware bilingual chat and translation. The model and all checkpoints are publicly released on Hugging Face as MWirelabs/Kren-M. To support Northeast Indian language research, we additionally released one of the largest public Assamese and Mizo corpora, along with the first public Garo corpus.

Version published to 10.21203/rs.3.rs-8144118/v1 on Research Square
Nov 19, 2025

AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP

This article has 1 author:
1. Badal Nyalang
This article has no evaluationsLatest version Nov 18, 2025
Leveraging Pāṇinian Grammar and Neural Models for Morphologically Rich Sanskrit NLP

This article has 2 authors:
1. Yashawant Pathak
2. Jagdish Makhijani
This article has no evaluationsLatest version Oct 30, 2025
NE-BERT: A Multilingual Language Model for 9 Northeast Indian Languages

This article has 1 author:
1. Badal Nyalang
This article has no evaluationsLatest version Nov 21, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP

Leveraging Pāṇinian Grammar and Neural Models for Morphologically Rich Sanskrit NLP

NE-BERT: A Multilingual Language Model for 9 Northeast Indian Languages