SinFUND and SinOCR: Benchmarks for Sinhala Handwritten OCR and Template-Free Form Understanding
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We introduce SinFUND and SinOCR, two novel datasets with benchmarks designed to enhance Optical Character Recognition (OCR) and template-free form understanding for Sinhala, a low-resource language spoken by approximately 16 million people in Sri Lanka. SinFUND is the first fully annotated dataset of 100 diverse, manually filled Sinhala forms, marking a significant advancement in form understanding research for Sinhala. SinOCR is the most extensive dataset for Sinhala OCR available to date, consisting of 100,000 images including 1,135 handwritten texts and texts printed in 200 different Sinhala fonts, representing a broad range of writing styles.This study outlines the development and annotation processes that ensure the high quality and practical applicability of these datasets in OCR and form understanding tasks. We also introduce novel baseline models optimized for these datasets, that demonstrate state-of-the-art performance in recognizing and interpreting Sinhala text, whether printed or handwritten. These models utilize a language-independent document understanding framework coupled with a multilingual language model, enhancing the efficiency and accuracy of processing multilingual forms.The benchmarks and datasets presented hold the potential to significantly support digital transformation in Sri Lanka by improving administrative efficiency and document accessibility in various sectors. Furthermore, our framework is scalable and adaptable, offering potential applications for similar OCR and form understanding tasks in other low-resource languages.