Multi-Scale Encoder-Only Architectures for Enhanced Scene and Handwritten Text Recognition

Shashank B N
Sirisetti Gopi Chandu
S. Nagesh Bhattu
Sri Phani Krishna K

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Accurate and efficient recognition of scene and handwritten text is crucial for document analysis and computer vision applications. Traditional architectures often rely on encoder-decoder frameworks, which, despite their strong performance, suffer from high computational costs and inference latency due to the decoder's autoregressive nature. This work introduces encoder-only transformer architectures that eliminate the need for a decoder, enabling faster and more parallelizable computation. We propose four families of encoder-only models—PVTSTR, Twins-PCPVTSTR, Twins-SVTSTR, and VANSTR—by adapting state-of-the-art vision backbones to the domain of Scene Text Recognition (STR) and Handwritten Text Recognition (HTR). These models utilize progressive shrinking strategies for multi-scale feature extraction, allowing them to efficiently capture both fine-grained character-level features and global word-level semantics. Our models achieve accuracy improvements of up to 4.8\%, 0.5\%, and 3.6\% over baseline ViTSTR models, with specific variants such as VANSTR-b1 and Twins-SVTSTR-Small providing competitive accuracy while maintaining lower computational costs and faster inference speeds. Here we show that our encoder-only architectures significantly enhance text recognition performance, demonstrating their potential for a wide range of practical applications. The code of the proposed work is available in the repository doi.org/10.5281/zenodo. 15791181.

Version published to 10.21203/rs.3.rs-7053758/v1 on Research Square
Nov 26, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed