Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution

Meng Wang
Qianqian Li
Haipeng Liu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In textual vision scenarios, super-resolution aims to enhance textual quality and readability to facilitate downstream tasks. However, the ambiguity of character regions in complex backgrounds remains challenging to mitigate, particularly the interference between tightly connected characters. In this paper, we propose single-character-based embedding feature aggregation using cross-attention for scene text super-resolution (SCE-STISR) to solve this problem. Firstly, a dynamic feature extraction mechanism is employed to adaptively capture shallow features by dynamically adjusting multi-scale feature weights based on spatial representations. During text–image interactions, a dual-level cross-attention mechanism is introduced to comprehensively aggregate the cropped single-character features with textual prior, also aligning semantic sequences and visual features. Finally, an adaptive normalized color correction operation is applied to mitigate color distortion caused by background interference. In TextZoom benchmarking, the text recognition accuracies of different recognizers are 53.6%, 60.9%, and 64.5%, which are improved by 0.9–1.4% over the baseline TATT, achieving an optimal SSIM value of 0.7951 and a PSNR of 21.84. Additionally, our approach improves accuracy by 0.2–2.2% over existing baselines on five text recognition datasets, validating the effectiveness of the model.

Version published to 10.3390/s25072228
Apr 2, 2025
Version published to 10.20944/preprints202412.2621.v1
Dec 31, 2024

Scene Text Detection Using Attention with Depthwise Separable Convolutions for Mobile Applications

This article has 2 authors:
1. Ramalakshmi Subbukalai
2. Vani Vijayan
This article has no evaluationsLatest version Sep 11, 2025
LexiAlign: A Diffusion Model Text Alignment and Refinement Method Based on Local Regeneration

This article has 5 authors:
1. Weijia Zhu
2. Xinjin Li
3. Jing Pu
4. Jing Tan
5. Minglu Wang
This article has no evaluationsLatest version Sep 4, 2025
Scene Text Recognition via Alternating Hierarchical-Global Attention in Encoder-Only Transformers

This article has 3 authors:
1. Shashank B N
2. S. Nagesh Bhattu
3. Sri Phani Krishna K
This article has no evaluationsLatest version Oct 1, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Scene Text Detection Using Attention with Depthwise Separable Convolutions for Mobile Applications

LexiAlign: A Diffusion Model Text Alignment and Refinement Method Based on Local Regeneration

Scene Text Recognition via Alternating Hierarchical-Global Attention in Encoder-Only Transformers