Data rescue of historical tables through semi-supervised table structure recognition

Loitongbam Gyanendro Singh
Stuart E. Middleton

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study uses a novel semi-supervised learning framework to explore Tabular Structure Recognition (TSR) for digitizing historical documents, specifically employing the CascadeTabNet model. TSR is crucial for transforming archival tabular data into digital formats, enhancing accessibility and analysis across various research fields. Challenges like physical degradation, inconsistent lighting, and non-standard handwriting hinder the generation of high-quality annotations of historical documents needed for effective model training. To address these issues, this research explores two research questions: (i) Can a semi-supervised training approach reduce the need for expensive data annotations? and (ii) Does semi-supervised training improve model robustness? We applied our methodology across three datasets: the GloSAT and ICDAR-2019 datasets based on historical documents, and the predominantly modern documents PubTabNet dataset. Our results indicate that semi-supervised learning substantially increases TSR accuracy and decreases dependency on extensive labelled datasets, providing a robust solution for large-scale digitization initiatives and contributing to the preservation and improved accessibility of historical data. All code from this paper is freely available on GitHub ( https://github.com/stuartemiddleton/glosat_table_dataset ).

Version published to 10.1007/s10032-025-00562-6
Dec 1, 2025
Version published to 10.21203/rs.3.rs-5842111/v1 on Research Square
Jul 17, 2025

DARE: A large-scale handwritten DAte REcognition system

This article has 5 authors:
1. Christian Møller Dahl
2. Torben Skov Dyg Johansen
3. Emil Nørmark Sørensen
4. Christian Emil Westermann
5. Simon Friis Wittrock
This article has no evaluationsLatest version Dec 18, 2025
APAU-Net: Adaptive Prior-Aware U-Net Text-Line Segmentation for Historical Documents

This article has 4 authors:
1. Mohamed Amine Beghoura
2. Abdelouahab Attia
3. Abderraouf Bouziane
4. M. Hassaballah
This article has no evaluationsLatest version Dec 15, 2025
Understanding the Impact of Dataset Characteristics on RAG-based Multi-hop QA Performance

This article has 3 authors:
1. Nimet Aksoy
2. Zekeriya Anıl Güven
3. Murat Osman Ünalır
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

DARE: A large-scale handwritten DAte REcognition system

APAU-Net: Adaptive Prior-Aware U-Net Text-Line Segmentation for Historical Documents

Understanding the Impact of Dataset Characteristics on RAG-based Multi-hop QA Performance