Kren-M: Meghalaya's First Foundational AI Model for the Khasi Language
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We present Kren-M, Meghalaya’s first foundational AI model and the first generative language model built in Northeast India for an indigenous language. This 2.6-billion-parameter bilingual (Khasi–English) LLM, based on Gemma-2-2B, incorporates a custom tokenizer with 2,135 added tokens, yielding 30–36% tokenization efficiency gains for Khasi and Garo. Continued pre-training on 5.43 million rigorously cleaned Khasi sentences (proprietary) is followed by instruction tuning on 33,034 high-quality examples. Key technical fixes eliminate auto-translation, instruction echoing, and infinite generation. Kren-M delivers fluent, task-aware bilingual chat and translation. The model and all checkpoints are publicly released on Hugging Face as MWirelabs/Kren-M. To support Northeast Indian language research, we additionally released one of the largest public Assamese and Mizo corpora, along with the first public Garo corpus.