NE-OCR: Unified Optical Character Recognition for 10 Languages of Northeast India
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We present NE-OCR, a unified optical character recognition model for 10 Northeast Indian languages - represented across 12 language-script pairs spanning 4 scripts - along with Hindi and English as anchor languages. NE-OCR is built on a Vision Transformer backbone (ViTSTR-Base, 86M parameters), trained on approximately 1.34 million text-image pairs constructed from native language corpora. On a held-out benchmark of 24,000 test samples (2,000 per language-script pair), NE-OCR achieves a mean Character Accuracy (ChA) of 94.99%, reaching a peak of 98.85% on Khasi, while maintaining an inference latency of 17.2ms per image on an A40 GPU - the fastest among all evaluated systems. We benchmark against four baseline systems: EasyOCR, Tesseract 5, TrOCR-large-printed, and Chandra. NE-OCR outperforms all baselines across 9 Northeast Indian language-script pairs, with competitive performance on the English and Hindi anchor languages. We additionally present a qualitative analysis of DeepSeek OCR 2 and Chandra as representatives of the vision-language model (VLM) paradigm, demonstrating that VLMs fail on unseen scripts by hallucinating document structure rather than producing recognition errors. Model weights are publicly available under CC-BY-4.0.