NE-OCR: Unified Optical Character Recognition for 10 Languages of Northeast India

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We present NE-OCR, a unified optical character recognition model for 10 Northeast Indian languages - represented across 12 language-script pairs spanning 4 scripts - along with Hindi and English as anchor languages. NE-OCR is built on a Vision Transformer backbone (ViTSTR-Base, 86M parameters), trained on approximately 1.34 million text-image pairs constructed from native language corpora. On a held-out benchmark of 24,000 test samples (2,000 per language-script pair), NE-OCR achieves a mean Character Accuracy (ChA) of 94.99%, reaching a peak of 98.85% on Khasi, while maintaining an inference latency of 17.2ms per image on an A40 GPU - the fastest among all evaluated systems. We benchmark against four baseline systems: EasyOCR, Tesseract 5, TrOCR-large-printed, and Chandra. NE-OCR outperforms all baselines across 9 Northeast Indian language-script pairs, with competitive performance on the English and Hindi anchor languages. We additionally present a qualitative analysis of DeepSeek OCR 2 and Chandra as representatives of the vision-language model (VLM) paradigm, demonstrating that VLMs fail on unseen scripts by hallucinating document structure rather than producing recognition errors. Model weights are publicly available under CC-BY-4.0.

Article activity feed