SPARTAN – Automated Table Detection and Extraction from Documents using Advanced OpenCV Heuristics and OCR Techniques

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid growth of born-digital PDF documents has amplified the demand for fast, precise tabular data extraction on an industrial scale. State-of-the-art deep-learning approaches have high accuracy, but at the resource expense of substantial computational complexity, data-hungry training process and black-box incomprehensibility, confining deployment in the real world. In this paper, we introduce SPARTAN (Structured Parsing and Relevant Table Analysis), an entirely open-source, heuristic-based pipeline, with high-fidelity table detection and extraction and no model training or GPU requirements. SPARTAN mixes lightweight OpenCV image-processing modules: column whitespace analysis, boundary and text-based region segmentation and line segment cell parsing, with a modular OCR layer and optional post-processing hooks for LLM-driven schema mapping. We evaluated SPARTAN on more than 20K pages of PCN-480 (Product Change Notification and Product Discontinuance Notification), scientific papers, certificates and datasheets and reported 0.94 precision, 0.91 recall and 0.93 F1-score, with 96.7% OCR character accuracy, processing a page in 4.2 seconds on an average, and requiring 1.2 GB peak RAM on the most demanding PDFs. Our model outperformed Tabula, Deepdoctection, TabbyPDF and EMb-TTBF in accuracy and speed. Its rule transparency effortlessly copes with bor-derless, nested and merged-cell layouts that easily outsmart classical heuristics, without incurring the resource cost of end-to-end neural pipelines. SPARTAN’s CLI-governed, swap-in-swap-out architecture encourages domain tuning, edge deployment and cloud-scalable REST service wrapping, making it a practical bridge between brittle rule systems and heavyweight AI for document-understanding pipelines. The work proves that well-crafted modernized heuristics, combined with high-quality OCR, can match or even outperform deep learning approaches while remaining within reach of small and medium enterprises, thus reopening a critical gate to cost-efficient, explainable PDF table extraction.

Article activity feed