A Methodological Framework for the Use of AI Tools in Automated Workflows for Generating Validated and Structured Historical Datasets

José Antonio Motilla-Chávez
Diego Espitia
Edgardo Galán-Vásquez
Edgardo Ugualde
Diego Perez-Martinez
Marcela Lomelí-Jasso
Eduardo Perez-Martinez
Hector Gerardo Perez-Gonzalez
Fernando Carlín-Loza
Valeria Martínez-Ramírez
Martín Zumaya Hernández

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The growing availability of digitized historical collections has enabled large-scale computational research; however, transforming heterogeneous and noisy textual data into structured and analyzable formats remains a major challenge within the Digital Humanities. This article presents a reproducible workflow for historical text processing that integrates Optical Character Recognition (OCR), Large Language Models (LLM)-assisted post-correction, and Named Entity Recognition (NER) into a unified pipeline. Implemented within the n8n automation framework, the workflow emphasizes transparency, modularity, and human-in-the-loop validation, enabling scholars to maintain interpretive control over data transformation. The pipeline is evaluated on five historical corpora in Spanish (18th–19th centuries), demonstrating significant reductions in Character and Word Error Rates (up to -93.5% and -66.8% respectively) through lightweight, open-weight LLMs (Gemma 3 27B, Qwen 2.5 32B, LLaMA 3 70B). NER performance using FLAIR achieves F1 scores above 0.96 for persons and organizations, and a semantic-level similarity evaluation is added through CoNES to assess distributed lexical recovery. Beyond reporting benchmarks, the study reflects on the epistemic implications of automated processing in historical research and argues that reproducible data pipelines are essential infrastructures for scaling relational analyses such as co-entity networks and computational historiography. All results contribute toward a transparent methodological model that bridges humanistic inquiry and computational automation while preserving scholarly traceability.

Version published to 10.21203/rs.3.rs-7943432/v1 on Research Square
Nov 17, 2025

Development and automated deployment of a specialised machine learning schema within a collaborative research centre: an explorative approach using large language models

This article has 8 authors:
1. Klaus Kaier
2. Gita Benadi
3. Sophia Nolde
4. Cristóbal Tagle Ludwig
5. Claudia Giuliani
6. Felix Engel
7. Manuel Watter
8. Harald Binder
This article has no evaluationsLatest version Oct 7, 2025
Automated LLM based Extraction of Standardized Synthesis Procedures: an All-Domain, Zero-Shot Approach

This article has 5 authors:
1. Pedro Mendes
2. Daniel Costa
3. Matteo Manica
4. Teodoro Laino
5. Filipa Ribeiro
This article has no evaluationsLatest version Nov 14, 2025
From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics

This article has 2 authors:
1. Khairul Alam
2. Banani Roy
This article has no evaluationsLatest version Oct 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Development and automated deployment of a specialised machine learning schema within a collaborative research centre: an explorative approach using large language models

Automated LLM based Extraction of Standardized Synthesis Procedures: an All-Domain, Zero-Shot Approach

From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics