Understanding Transformer-Based OCR for Medieval Manuscripts: A Systematic Ablation Study and Inspection Analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Adapting Transformer-based Optical Character Recognition (TrOCR) models to medieval manuscripts presents a significant domain gap. This work provides a systematic investigation into TrOCR fine-tuning strategies using a 14th-15th century Italian manuscript. We conduct controlled ablation studies on preprocessing, data augmentation, and encoder layer freezing. Results demonstrate that full fine-tuning of all encoder layers is critical, achieving an 11.68% Character Error Rate (CER). We also show that ContrastLimited Adaptive Histogram Equalization(CLAHE) preprocessing yields a 12.9% relative CER reduction. Our hyperparameter configuration generalized effectively, achieving 7.68% CER on the public READ-16 benchmark. As a key contribution, we perform a quantitative analysis of model localization maps. We establish that encoder-based Grad-CAM entropy and Gini impurity are a much stronger correlates of token prediction loss than decoder cross-attention. We propose its utility as a robust diagnostic for visual uncertainty. This finding has direct applications for uncertainty sampling in active learning and pseudo-label filtering in semi-supervised learning workflows. This study offers both practical guidelines for adapting TrOCR and a novel method for interpreting model adaptation.

Article activity feed