Literally Reading behind the Lines: A benchmark for OCR on Cluttered Printed Documents

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Document Clutter is less explored problem in stored documents, which arises due to accidental splilling or smudging of liquids like sauces, inks, tea, etc, on the modern documents or naturally present in the historical or legal documents. Such clutter leads to loss of information while performing Optical Character Recognition (OCR) due to non-readability of cluttered letters or words. In this paper, we introduce ClutterOCRBench : a dataset containing 1080 document images with and without clutter, created using a thoughtful three-step process to achieve 100% correct ground truth, despite the non-readability of some data. In the first step, we print the \((1080)\) pages covering 12 domains and directly scan the printed pages. In the second step, we manually add 10 different types of clutter to the printed pages such as paint, coffee, and mud, with five different levels of degradation. Pages with clutter are scanned using the same orientation as in the first step. The step ensures that the sentence-level boxes in clean images are aligned with those in cluttered images. In the third step, we manually transcribe the text in the clean documents and use them for the aligned cluttered documents. We provide a comprehensive comparison of the latest OCR and Vision Language Models to perform text extraction from cluttered documents. After fine-tuning on the proposed dataset, the best models achieve a \((14%)\) reduction in CER and a \((7%)\) reduction in WER on the ClutterOCRBench test set.

Article activity feed