Post-OCR Correction Using Large Language Models with Constrained Decoding

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This article addresses the problem of correcting noisy Optical Character Recognition (OCR) outputs from digitized historical documents, specifically those from the Berrutti Archive related to Uruguay’s civic-military dictatorship. These documents—produced with typewriters, diverse layouts, and overlaid annotations—pose significant challenges for standard optical character recognition (OCR) tools, resulting in highly error-prone text. We present a novel post-OCR correction method that leverages fine-tuned open-source Large Language Models (LLMs) combined with a constrained decoding strategy. This strategy incorporates character-level similarity between the OCR input and the generated output at decoding time, steering the model toward corrections that closely preserve the original text structure. We evaluate our method on a gold-standard dataset of over 2000 annotated lines and show that it outperforms prompting and standard fine-tuning approaches, reducing both character error rate (CER) and word error rate (WER). The corrected outputs provide more accurate input for downstream tasks, such as named entity recognition, relation and event extraction, and knowledge graph construction, thereby supporting the broader goal of extracting knowledge from historically significant and sensitive archives.

Article activity feed