Scene Text Recognition via Alternating Hierarchical-Global Attention in Encoder-Only Transformers
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Scene Text Recognition (STR) plays a pivotal role in various real-world applications, such as autonomous driving, document analysis, and assistive technologies, by enabling machines to extract and interpret text from natural images. Despite significant advancements in deep learning, existing transformer-based STR models, often suffer from high computational costs, large memory requirements, and slow inference due to the autoregressive nature of decoder-based architectures. This paper proposes two model variants, FasterViTSTR and DualFasterViTSTR, to address these challenges. FasterViTSTR is built upon the FasterViT architecture, modified to an encoder-only design for efficient STR. FasterViTSTR utilizes hierarchical attention to capture both local and global features, enhancing the model's ability to process variable-length sequences and complex spatial layouts. DualFasterViTSTR introduces a novel alternating hybrid attention mechanism, alternating between hierarchical and global attention layers, not only enhances performance but also improves efficiency. The proposed models are evaluated on multiple STR benchmark datasets, where the FasterViTSTR V2 variant achieves up to a 0.9% improvement in word accuracy over existing baselines, while the DualFasterViTSTR V2 variant delivers up to a 2.11% improvement, along with reduced inference time, FLOPs, and parameter count compared to FasterViTSTR. The implementation is made available in the repository : github.com/ShashNagendra/STR-via-Alternating-Hierarchical-Global-Attention.