Enhancing Hindi OCR Accuracy Using Color-Driven Sub-Character Segmentation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Hindi script contains numerous sub-characters whose combinations form complex conjuncts with shapes that differ substantially from their individual components. Conventional OCR models trained on grayscale images often fail to capture this internal structure, leading to persistent errors on conjunct characters. We introduce a two-stage OCR framework designed to make this structure explicit: (a) a segmentation model, trained entirely on synthetic data, that predicts color-coded sub-character layouts from grayscale text; and (b) an OCR model trained on the same synthetic colored images, which reads text directly from the predicted colorized outputs of the first stage. This approach provides explicit sub-character cues without requiring any manually colored real data. The proposed framework achieves 98.85\% character accuracy and 96.12\% word accuracy on the Mozhi benchmark. On a challenging unseen dataset of 1,000 words containing 58.1\% conjunct characters, it attains 96.62\% character accuracy representing a 16.6\% improvement over a fine-tuned grayscale baseline. These results demonstrate that incorporating sub-character structure through color-coded supervision substantially improves robustness in Hindi OCR.