SPARTAN – Automated Table Detection and Extraction from Documents using Advanced OpenCV Heuristics and OCR Techniques
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid growth of born-digital PDF documents has amplified the demand for fast, precise tabular data extraction on an industrial scale. State-of-the-art deep-learning approaches have high accuracy, but at the resource expense of substantial computational complexity, data-hungry training process and black-box incomprehensibility, confining deployment in the real world. In this paper, we introduce SPARTAN (Structured Parsing and Relevant Table Analysis), an entirely open-source, heuristic-based pipeline, with high-fidelity table detection and extraction and no model training or GPU requirements. SPARTAN mixes lightweight OpenCV image-processing modules: column whitespace analysis, boundary and text-based region segmentation and line segment cell parsing, with a modular OCR layer and optional post-processing hooks for LLM-driven schema mapping. We evaluated SPARTAN on more than 20K pages of PCN-480 (Product Change Notification and Product Discontinuance Notification), scientific papers, certificates and datasheets and reported 0.94 precision, 0.91 recall and 0.93 F1-score, with 96.7% OCR character accuracy, processing a page in 4.2 seconds on an average, and requiring 1.2 GB peak RAM on the most demanding PDFs. Our model outperformed Tabula, Deepdoctection, TabbyPDF and EMb-TTBF in accuracy and speed. Its rule transparency effortlessly copes with bor-derless, nested and merged-cell layouts that easily outsmart classical heuristics, without incurring the resource cost of end-to-end neural pipelines. SPARTAN’s CLI-governed, swap-in-swap-out architecture encourages domain tuning, edge deployment and cloud-scalable REST service wrapping, making it a practical bridge between brittle rule systems and heavyweight AI for document-understanding pipelines. The work proves that well-crafted modernized heuristics, combined with high-quality OCR, can match or even outperform deep learning approaches while remaining within reach of small and medium enterprises, thus reopening a critical gate to cost-efficient, explainable PDF table extraction.