Fidel: A Large-Scale Sentence Level Amharic OCR Dataset

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The Ethiopic script used in the Amharic Language presents persistent challengesfor Optical Character Recognition (OCR) due to its large character vocabulary,diacritics, and high variability in handwriting. Although Amharic is spoken byover 58 million people, progress in OCR has been constrained by the lack of large,diverse, sentence-level datasets. Existing datasets are small, synthetic-only, orlimited to character- or word-level annotations, preventing models from captur-ing the complexity of real documents. We introduce Fidel, the first large-scaleAmharic OCR dataset spanning handwritten, typed, and synthetic text. Fidelcontains 40k handwritten and 28k typed line images collected from 411 nativewriters, providing broad coverage of handwriting styles and modern vocabulary.We further formalize our approach as a scalable data acquisition and preprocess-ing pipeline deskewing, line extraction, and alignment designed to guide futuredataset creation for low-resource scripts. To complement the real data, we gener-ate high-quality synthetic Amharic text images to support robust model training.Using Fidel, we construct the first comprehensive benchmark for Amharic OCR,evaluating seven deep learning based OCR models. These models spanCNN, CTC, transformer and hybrid architectures, enabling a robust assessmentof domain transfer and modality-specific performance across handwritten, typed,and synthetic text. The best-performing model trained on Fidel achieves state-of-the-art results, with a CER of 2.64% and WER of 7.29%, demonstrating thesubstantial impact of our dataset on advancing practical, high-accuracy AmharicOCR.

Article activity feed