Anticipatory Semantics with Bidirectional Guidance for Image Captioning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Producing captions that are not only grammatically fluent but also semantically faithful to visual content has long stood as a central problem at the junction of computer vision and natural language processing. Conventional encoder-decoder frameworks with attention modules, although powerful, typically confine the decoding process to a retrospective scope: every prediction is conditioned solely on historical tokens, thereby ignoring what future semantics may dictate. This retrospective bias hinders models from fully capturing scene-level coherence. To address this limitation, we introduce \textbf{FISRA} (Future-Infused Semantic Revision Architecture), a dual-pass attention paradigm that supplements training with anticipatory semantic signals. Specifically, FISRA first constructs a global caption hypothesis serving as a semantic scaffold and then refines the decoding trajectory through revision-oriented attention that aligns each step with both preceding context and projected continuations. This design enhances coherence by weaving forward-looking cues into the generative process. The framework is model-agnostic and integrates seamlessly with existing attention-based captioners. Empirical evaluations on MS-COCO reveal that FISRA consistently advances the state of the art, delivering 133.4 CIDEr-D on the Karpathy split and 131.6 CIDEr-D on the official evaluation server. These findings confirm that our approach significantly strengthens semantic alignment and alleviates the exposure bias endemic to unidirectional captioning systems.