Multi-Scale Encoder-Only Architectures for Enhanced Scene and Handwritten Text Recognition
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate and efficient recognition of scene and handwritten text is crucial for document analysis and computer vision applications. Traditional architectures often rely on encoder-decoder frameworks, which, despite their strong performance, suffer from high computational costs and inference latency due to the decoder's autoregressive nature. This work introduces encoder-only transformer architectures that eliminate the need for a decoder, enabling faster and more parallelizable computation. We propose four families of encoder-only models—PVTSTR, Twins-PCPVTSTR, Twins-SVTSTR, and VANSTR—by adapting state-of-the-art vision backbones to the domain of Scene Text Recognition (STR) and Handwritten Text Recognition (HTR). These models utilize progressive shrinking strategies for multi-scale feature extraction, allowing them to efficiently capture both fine-grained character-level features and global word-level semantics. Our models achieve accuracy improvements of up to 4.8\%, 0.5\%, and 3.6\% over baseline ViTSTR models, with specific variants such as VANSTR-b1 and Twins-SVTSTR-Small providing competitive accuracy while maintaining lower computational costs and faster inference speeds. Here we show that our encoder-only architectures significantly enhance text recognition performance, demonstrating their potential for a wide range of practical applications. The code of the proposed work is available in the repository doi.org/10.5281/zenodo. 15791181.