Enhancing Vision Transformers for Scene Text Recognition and Spotting with Orthogonal Constraints
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In the realm of computer vision, Vision Transformers (ViTs) have emerged as a compelling choice for scene text recognition (STR) and spotting, tasked with deciphering text embedded within complex natural environments. While ViTs like ViTSTR offer a hopeful balance of precision, processing speed, and computational efficiency, their widespread application in STR has been hindered by suboptimal accuracy. This accuracy challenge is largely due to the homogeneous nature of feature processing within the multi-head self-attention mechanism. To surmount this barrier, this paper pioneers the integration of orthogonality constraints into the architecture of ViTSTR. These constraints are designed to foster a diverse and distinct feature detection capability across different attention heads, thereby enhancing the model's overall accuracy. By implementing these constraints, the model adapts to capture a more comprehensive array of textual features, which is crucial for handling the varied and unpredictable nature of scene texts. The introduction of this orthogonality not only bolsters the model's precision but also preserves its computational agility, making it a robust solution for real-world STR and spotting applications.