Adaptive Transformer with Sequence-Guided Decoders for Enhanced Vision Captioning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In recent years, Transformer architectures have been extensively applied to image captioning, achieving remarkable performance. The spatial and positional relationships between visual objects play a pivotal role in crafting meaningful and accurate captions. To further enhance image captioning with Transformers, this paper introduces the \textit{Adaptive Geometry-Integrated Transformer} (AGIT). This novel model incorporates advanced geometry-aware mechanisms into both its encoder and decoder, enabling superior representation and utilization of spatial information. Specifically, the proposed framework comprises two key components: \romannumeral1) a geometry-enhanced self-attention module, termed the \textit{Geometry Attention Refiner} (GAR), which explicitly integrates relative spatial relationships into the visual feature representations during encoding; and \romannumeral2) a sequence-guided decoding mechanism powered by \textit{Position-Sensitive LSTMs} (PS-LSTMs) to accurately model and maintain word-order semantics while generating captions. Experimental evaluations on the MS COCO and Flickr30k datasets demonstrate that AGIT outperforms state-of-the-art models in both accuracy and computational efficiency, setting a new benchmark in image captioning.