Anticipatory Semantics with Bidirectional Guidance for Image Captioning

Noémie Laurent
Elodie Fairchild
Arthur Delvaux

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Producing captions that are not only grammatically fluent but also semantically faithful to visual content has long stood as a central problem at the junction of computer vision and natural language processing. Conventional encoder-decoder frameworks with attention modules, although powerful, typically confine the decoding process to a retrospective scope: every prediction is conditioned solely on historical tokens, thereby ignoring what future semantics may dictate. This retrospective bias hinders models from fully capturing scene-level coherence. To address this limitation, we introduce \textbf{FISRA} (Future-Infused Semantic Revision Architecture), a dual-pass attention paradigm that supplements training with anticipatory semantic signals. Specifically, FISRA first constructs a global caption hypothesis serving as a semantic scaffold and then refines the decoding trajectory through revision-oriented attention that aligns each step with both preceding context and projected continuations. This design enhances coherence by weaving forward-looking cues into the generative process. The framework is model-agnostic and integrates seamlessly with existing attention-based captioners. Empirical evaluations on MS-COCO reveal that FISRA consistently advances the state of the art, delivering 133.4 CIDEr-D on the Karpathy split and 131.6 CIDEr-D on the official evaluation server. These findings confirm that our approach significantly strengthens semantic alignment and alleviates the exposure bias endemic to unidirectional captioning systems.

Version published to 10.20944/preprints202509.1497.v1
Sep 17, 2025

Capturing Narrative Semantics from Captions for Relational Scene Abstraction

This article has 4 authors:
1. Sofia Nguyen
2. Noah Macleod
3. Arjun Patel
4. Brielle Monroe
This article has no evaluationsLatest version Sep 18, 2025
Caption-Grounded Structural Parsing for Compound Scientific Visuals

This article has 4 authors:
1. Omar Al-Mansoori
2. Aiden Johnson
3. Ava Martinez
4. Oliver Smith
This article has no evaluationsLatest version Aug 29, 2025
Beyond References: Human-Aligned Caption Reliability Assessment

This article has 4 authors:
1. Jaxon Carter
2. Caleb Turner
3. Ava Martinez
4. Hailey Peterson
This article has no evaluationsLatest version Sep 30, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Capturing Narrative Semantics from Captions for Relational Scene Abstraction

Caption-Grounded Structural Parsing for Compound Scientific Visuals

Beyond References: Human-Aligned Caption Reliability Assessment