Scene Text Recognition via Alternating Hierarchical-Global Attention in Encoder-Only Transformers

Shashank B N
S. Nagesh Bhattu
Sri Phani Krishna K

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Scene Text Recognition (STR) plays a pivotal role in various real-world applications, such as autonomous driving, document analysis, and assistive technologies, by enabling machines to extract and interpret text from natural images. Despite significant advancements in deep learning, existing transformer-based STR models, often suffer from high computational costs, large memory requirements, and slow inference due to the autoregressive nature of decoder-based architectures. This paper proposes two model variants, FasterViTSTR and DualFasterViTSTR, to address these challenges. FasterViTSTR is built upon the FasterViT architecture, modified to an encoder-only design for efficient STR. FasterViTSTR utilizes hierarchical attention to capture both local and global features, enhancing the model's ability to process variable-length sequences and complex spatial layouts. DualFasterViTSTR introduces a novel alternating hybrid attention mechanism, alternating between hierarchical and global attention layers, not only enhances performance but also improves efficiency. The proposed models are evaluated on multiple STR benchmark datasets, where the FasterViTSTR V2 variant achieves up to a 0.9% improvement in word accuracy over existing baselines, while the DualFasterViTSTR V2 variant delivers up to a 2.11% improvement, along with reduced inference time, FLOPs, and parameter count compared to FasterViTSTR. The implementation is made available in the repository : github.com/ShashNagendra/STR-via-Alternating-Hierarchical-Global-Attention.

Version published to 10.21203/rs.3.rs-7585117/v1 on Research Square
Oct 1, 2025

Multi-Scale Encoder-Only Architectures for Enhanced Scene and Handwritten Text Recognition

This article has 4 authors:
1. Shashank B N
2. Sirisetti Gopi Chandu
3. S. Nagesh Bhattu
4. Sri Phani Krishna K
This article has no evaluationsLatest version Nov 26, 2025
Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

This article has 2 authors:
1. Anuj Attri
2. HariOm .
This article has no evaluationsLatest version Nov 4, 2025
Attention Re-Alignment in Multimodal Large Language Models via Intermediate-Layer Guidance

This article has 6 authors:
1. Yanming Chen
2. Pandong Wang
3. Guofeng Qin
4. Wei Wu
5. Ming Chen
6. Yongtao Hao
This article has no evaluationsLatest version Nov 14, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multi-Scale Encoder-Only Architectures for Enhanced Scene and Handwritten Text Recognition

Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

Attention Re-Alignment in Multimodal Large Language Models via Intermediate-Layer Guidance