DARE: A large-scale handwritten DAte REcognition system

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Handwritten text recognition for historical documents is an important task, but it remains challenging due to insufficient training data combined with wide variability in writing styles and degradation of historical documents. In the context of recognizing handwritten dates, we propose a model based on the EfficientNetV2 architecture. The model is characterized by its fast training speed, robust-ness to parameter choices, and accurate transcription of handwritten dates from various sources. For our training process, we build and introduce a database containing nearly 10 million tokens derived from over 2.2 million images of handwritten dates, extracted and segmented from diverse historical documents. Considering that dates are among the most prevalent pieces of information in historical documents, and given the existence of millions of such documents in historical archives, achieving efficient and automated transcription of dates holds the potential for substantial cost savings compared to manual transcription efforts. We demonstrate that training on handwritten text that exhibits substantial variability in writing styles yields robust models for recognizing general handwritten text and that transfer learning from the DARE system increases transcription accuracy substantially, allowing one to obtain high accuracy even when using relatively small training samples on entirely new types of documents. The DARE database is freely available at https://www.kaggle.com/datasets/sdusimonwittrock/dare-database. Code to be made available at https://github.com/TorbenSDJohansen/DARE.

Article activity feed