Benchmarking OCR and Vision-Language Models for Turkish Text Recognition: A Comprehensive Evaluation Using Synthetic Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Purpose: We present the first systematic benchmark evaluation of Optical Character Recognition (OCR) and Vision-Language Models (VLMs) for Turkish text recognition, addressing a critical gap in low-resource language processing. Turkish, with its agglutinative structure and unique characters (ç, ğ, ı, İ, ö, ş, ü), poses challenges for models trained on high-resource languages such as English. Methods: We developed a synthetic Turkish dataset of 6,600 images spanning three main text types: printed, handwritten, scene text. The dataset includes variations such as the presence of Turkish characters, effects of word length, sentence versus word recognition, and various distortion types (rotation, resolution, noise, and blur).Our evaluation compares three different model categories: traditional OCR systems, open-source VLMs, and commercial VLMs. Results: The results show that modern VLMs significantly outperform traditional OCR approaches, with GPT-4o and Qwen2.5-VL models demonstrating superior performance. Notably, images containing Turkish-specific characters posed significant challenges for all models, with only GPT-4o maintaining stable performance. This highlights the critical impact of training dataset composition on multilingual performance. While the agglutinative word structure did not significantly affect recognition accuracy, handwritten text recognition remains a persistent challenge across all evaluated systems. Conclusion: The open-source Qwen2.5-VL model achieved comparable performance to the commercial GPT-4o despite having fewer parameters, showing strong potential as a computationally efficient alternative. This benchmark study establishes a standardized evaluation framework for Turkish text recognition research. To support future research in this domain, we publicly release the synthetic dataset, enabling reproducible research in low-resource language text recognition.

Article activity feed