A Single Character Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In textual vision scenarios, the super-resolution focuses on enhancing the textual quality and readability for downstream tasks. However, the confusion of character regions by complex backgrounds is often hard to relieve, especially the interference between tightly connected characters. In this paper, we propose a single character based embedding feature aggregation network using cross-attention to solve this problem. Firstly, a dynamic feature extraction is applied to adaptively capture shallow features by adjusting the weights of multi-scale features according to spatial representations. During text-image interaction, two levels of cross-attention are introduced to deeply aggregation the clipped single character features with the textual prior, also aligning semantic sequences and visual features. Finally, an adaptive normalised colour correction operation is used to improve the colour drift due to background interference. On the TextZoom benchmark, the text recognition accuracies are 53.6%, 60.9%, and 64.5% on three recognizers, with SSIM of 0.7951 and PSNR of 21.84, which are at the state-of-the-art level. In addition, our approach improves accuracy by 0.2%-2.2% over baselines on 5 text recognition datasets validating the model generalization.