A Methodological Framework for the Use of AI Tools in Automated Workflows for Generating Validated and Structured Historical Datasets

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The growing availability of digitized historical collections has enabled large-scale computational research; however, transforming heterogeneous and noisy textual data into structured and analyzable formats remains a major challenge within the Digital Humanities. This article presents a reproducible workflow for historical text processing that integrates Optical Character Recognition (OCR), Large Language Models (LLM)-assisted post-correction, and Named Entity Recognition (NER) into a unified pipeline. Implemented within the n8n automation framework, the workflow emphasizes transparency, modularity, and human-in-the-loop validation, enabling scholars to maintain interpretive control over data transformation. The pipeline is evaluated on five historical corpora in Spanish (18th–19th centuries), demonstrating significant reductions in Character and Word Error Rates (up to -93.5% and -66.8% respectively) through lightweight, open-weight LLMs (Gemma 3 27B, Qwen 2.5 32B, LLaMA 3 70B). NER performance using FLAIR achieves F1 scores above 0.96 for persons and organizations, and a semantic-level similarity evaluation is added through CoNES to assess distributed lexical recovery. Beyond reporting benchmarks, the study reflects on the epistemic implications of automated processing in historical research and argues that reproducible data pipelines are essential infrastructures for scaling relational analyses such as co-entity networks and computational historiography. All results contribute toward a transparent methodological model that bridges humanistic inquiry and computational automation while preserving scholarly traceability.

Article activity feed