A Multi-Task Intervention Framework for Semantically Faithful Image Captioning

Emma Claes
Milan De Wilde
Callum Hensley
Nathan Verhoeven

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Image captioning has long stood as a cornerstone problem at the intersection of computer vision and natural language generation, aiming to translate rich visual scenes into coherent linguistic narratives. While recent Transformer- and LSTM-based encoder–decoder architectures have achieved remarkable success by adopting the extract-then-generate paradigm, persistent issues remain regarding factual reliability and content completeness. Specifically, many models tend to produce semantically inconsistent or overly generic descriptions, frequently misinterpreting visual cues or omitting essential elements. These limitations stem not merely from inadequate supervision but from deeper causal flaws: the learned representations capture superficial correlations between high-frequency patterns and linguistic tokens, thereby conflating causation with co-occurrence. To address these challenges, we propose \textbf{DEPICT} (DEpendent multi-task framework with Proxy-confounder Intervened Captioning Transformer), a novel paradigm that redefines image captioning through the lens of causal reasoning and task dependency. Unlike conventional pipelines that treat caption generation as a monolithic sequence prediction task, DEPICT introduces an explicit intermediate supervision stage—\emph{bag-of-categories} prediction—serving as a semantically interpretable mediator between visual encoding and language decoding. This design encourages the model to develop structured understanding of image semantics before producing fluent sentences. Furthermore, DEPICT incorporates causal intervention by applying Pearl’s \emph{do}-calculus to disentangle genuine causal signals from spurious correlations. To operationalize this principle, we introduce a set of high-frequency concept proxies to approximate latent confounders and estimate their influence through variational inference. This intervention effectively “cuts off” biased visual-linguistic links, guiding the model toward more faithful reasoning about the underlying causes of scene semantics. Finally, DEPICT is trained using a multi-agent reinforcement learning (MARL) strategy that jointly optimizes both intermediate and final tasks while mitigating the propagation of task-level errors. Each agent specializes in a distinct sub-task, and their cooperative reward structure ensures that semantic precision and narrative fluency are balanced during end-to-end learning. Extensive experiments on MSCOCO and other benchmarks demonstrate that DEPICT not only surpasses strong baselines but also matches or exceeds the performance of state-of-the-art Transformer captioners. More importantly, qualitative analyses reveal that DEPICT generates descriptions that are more causally grounded, diverse, and semantically faithful.

Version published to 10.20944/preprints202510.1615.v1
Oct 21, 2025

Anticipatory Semantics with Bidirectional Guidance for Image Captioning

This article has 3 authors:
1. Noémie Laurent
2. Elodie Fairchild
3. Arthur Delvaux
This article has no evaluationsLatest version Sep 17, 2025
Text-Enriched Vision-Language Captioning for Contextual Scene Understanding and Accessibility

This article has 4 authors:
1. Chloe Zhang
2. Ethan Moreau
3. Brielle Monroe
4. Lucas Tremblay
This article has no evaluationsLatest version Oct 17, 2025
Rethinking Convolutional Semantics for Image Caption Generation Beyond Recurrent Paradigms

This article has 4 authors:
1. Noah Macdonald
2. Sofia Leblanc
3. Landen Whitaker
4. Arman Chowdhury
This article has no evaluationsLatest version Oct 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Anticipatory Semantics with Bidirectional Guidance for Image Captioning

Text-Enriched Vision-Language Captioning for Contextual Scene Understanding and Accessibility

Rethinking Convolutional Semantics for Image Caption Generation Beyond Recurrent Paradigms