SPARTAN – Automated Table Detection and Extraction from Documents using Advanced OpenCV Heuristics and OCR Techniques

Shlok Nandurbarkar
Archana Chaudhari
Rahesha Mulla

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid growth of born-digital PDF documents has amplified the demand for fast, precise tabular data extraction on an industrial scale. State-of-the-art deep-learning approaches have high accuracy, but at the resource expense of substantial computational complexity, data-hungry training process and black-box incomprehensibility, confining deployment in the real world. In this paper, we introduce SPARTAN (Structured Parsing and Relevant Table Analysis), an entirely open-source, heuristic-based pipeline, with high-fidelity table detection and extraction and no model training or GPU requirements. SPARTAN mixes lightweight OpenCV image-processing modules: column whitespace analysis, boundary and text-based region segmentation and line segment cell parsing, with a modular OCR layer and optional post-processing hooks for LLM-driven schema mapping. We evaluated SPARTAN on more than 20K pages of PCN-480 (Product Change Notification and Product Discontinuance Notification), scientific papers, certificates and datasheets and reported 0.94 precision, 0.91 recall and 0.93 F1-score, with 96.7% OCR character accuracy, processing a page in 4.2 seconds on an average, and requiring 1.2 GB peak RAM on the most demanding PDFs. Our model outperformed Tabula, Deepdoctection, TabbyPDF and EMb-TTBF in accuracy and speed. Its rule transparency effortlessly copes with bor-derless, nested and merged-cell layouts that easily outsmart classical heuristics, without incurring the resource cost of end-to-end neural pipelines. SPARTAN’s CLI-governed, swap-in-swap-out architecture encourages domain tuning, edge deployment and cloud-scalable REST service wrapping, making it a practical bridge between brittle rule systems and heavyweight AI for document-understanding pipelines. The work proves that well-crafted modernized heuristics, combined with high-quality OCR, can match or even outperform deep learning approaches while remaining within reach of small and medium enterprises, thus reopening a critical gate to cost-efficient, explainable PDF table extraction.

Version published to 10.21203/rs.3.rs-6644838/v1 on Research Square
Jul 18, 2025

Data Rescue of Historical Tables Through Semi-Supervised Table Structure Recognition

This article has 2 authors:
1. Loitongbam Gyanendro Singh
2. Stuart E. Middleton
This article has no evaluationsLatest version Jul 17, 2025
Image Detection and Data extraction Using Hybrid Deep Learning Techniques

This article has 4 authors:
1. R V Raghavendra Rao
2. Ch. Ram Mohan Reddy
3. Vishruth AC
4. Prajwal P K
This article has no evaluationsLatest version Jul 24, 2025
Post-OCR Correction Using Large Language Models with Constrained Decoding

This article has 5 authors:
1. Ignacio Sastre
2. Lorena Etcheverry
3. Guillermo Rey
4. Guillermo Moncecchi
5. Aiala Rosá
This article has no evaluationsLatest version Jul 15, 2025

Listed in

Abstract

Article activity feed

Related articles

Data Rescue of Historical Tables Through Semi-Supervised Table Structure Recognition

Image Detection and Data extraction Using Hybrid Deep Learning Techniques

Post-OCR Correction Using Large Language Models with Constrained Decoding