Post-OCR Correction Using Large Language Models with Constrained Decoding

Ignacio Sastre
Lorena Etcheverry
Guillermo Rey
Guillermo Moncecchi
Aiala Rosá

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This article addresses the problem of correcting noisy Optical Character Recognition (OCR) outputs from digitized historical documents, specifically those from the Berrutti Archive related to Uruguay’s civic-military dictatorship. These documents—produced with typewriters, diverse layouts, and overlaid annotations—pose significant challenges for standard optical character recognition (OCR) tools, resulting in highly error-prone text. We present a novel post-OCR correction method that leverages fine-tuned open-source Large Language Models (LLMs) combined with a constrained decoding strategy. This strategy incorporates character-level similarity between the OCR input and the generated output at decoding time, steering the model toward corrections that closely preserve the original text structure. We evaluate our method on a gold-standard dataset of over 2000 annotated lines and show that it outperforms prompting and standard fine-tuning approaches, reducing both character error rate (CER) and word error rate (WER). The corrected outputs provide more accurate input for downstream tasks, such as named entity recognition, relation and event extraction, and knowledge graph construction, thereby supporting the broader goal of extracting knowledge from historically significant and sensitive archives.

Version published to 10.21203/rs.3.rs-6823036/v1 on Research Square
Jul 15, 2025

SPARTAN – Automated Table Detection and Extraction from Documents using Advanced OpenCV Heuristics and OCR Techniques

This article has 3 authors:
1. Shlok Nandurbarkar
2. Archana Chaudhari
3. Rahesha Mulla
This article has no evaluationsLatest version Jul 18, 2025
LeCoder: A Large-Scale Automated Coder for Coding Errors in Word Production Tasks

This article has 4 authors:
1. Shanhua Hu
2. Delaney DuVal
3. Brielle C Stark
4. Nazbanou Nozari
This article has no evaluationsLatest version Jul 31, 2025
Analysis: Serving Individuals with Language Impairments using Automatic Speech Recognition Models and Large Language Models: Challenges and Opportunities

This article has 13 authors:
1. Yiyu Shi
2. Ruiyang Qin
3. Haoxinran Yu
4. Lixuan Wei
5. Yuxuan Liu
6. Dancheng Liu
7. Chenhui Xu
8. Jiajie Li
9. Gelei Xu
10. Ahmed Abbasi
11. Jinjun Xiong
12. Xiufan Yu
13. Zhi Zheng
This article has no evaluationsLatest version Jul 24, 2025

Listed in

Abstract

Article activity feed

Related articles

SPARTAN – Automated Table Detection and Extraction from Documents using Advanced OpenCV Heuristics and OCR Techniques

LeCoder: A Large-Scale Automated Coder for Coding Errors in Word Production Tasks

Analysis: Serving Individuals with Language Impairments using Automatic Speech Recognition Models and Large Language Models: Challenges and Opportunities