Kren-M: Meghalaya's First Foundational AI Model for the Khasi Language

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We present Kren-M, Meghalaya’s first foundational AI model and the first generative language model built in Northeast India for an indigenous language. This 2.6-billion-parameter bilingual (Khasi–English) LLM, based on Gemma-2-2B, incorporates a custom tokenizer with 2,135 added tokens, yielding 30–36% tokenization efficiency gains for Khasi and Garo. Continued pre-training on 5.43 million rigorously cleaned Khasi sentences (proprietary) is followed by instruction tuning on 33,034 high-quality examples. Key technical fixes eliminate auto-translation, instruction echoing, and infinite generation. Kren-M delivers fluent, task-aware bilingual chat and translation. The model and all checkpoints are publicly released on Hugging Face as MWirelabs/Kren-M. To support Northeast Indian language research, we additionally released one of the largest public Assamese and Mizo corpora, along with the first public Garo corpus.

Article activity feed