A Multi-Task Intervention Framework for Semantically Faithful Image Captioning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Image captioning has long stood as a cornerstone problem at the intersection of computer vision and natural language generation, aiming to translate rich visual scenes into coherent linguistic narratives. While recent Transformer- and LSTM-based encoder–decoder architectures have achieved remarkable success by adopting the extract-then-generate paradigm, persistent issues remain regarding factual reliability and content completeness. Specifically, many models tend to produce semantically inconsistent or overly generic descriptions, frequently misinterpreting visual cues or omitting essential elements. These limitations stem not merely from inadequate supervision but from deeper causal flaws: the learned representations capture superficial correlations between high-frequency patterns and linguistic tokens, thereby conflating causation with co-occurrence. To address these challenges, we propose \textbf{DEPICT} (DEpendent multi-task framework with Proxy-confounder Intervened Captioning Transformer), a novel paradigm that redefines image captioning through the lens of causal reasoning and task dependency. Unlike conventional pipelines that treat caption generation as a monolithic sequence prediction task, DEPICT introduces an explicit intermediate supervision stage—\emph{bag-of-categories} prediction—serving as a semantically interpretable mediator between visual encoding and language decoding. This design encourages the model to develop structured understanding of image semantics before producing fluent sentences. Furthermore, DEPICT incorporates causal intervention by applying Pearl’s \emph{do}-calculus to disentangle genuine causal signals from spurious correlations. To operationalize this principle, we introduce a set of high-frequency concept proxies to approximate latent confounders and estimate their influence through variational inference. This intervention effectively “cuts off” biased visual-linguistic links, guiding the model toward more faithful reasoning about the underlying causes of scene semantics. Finally, DEPICT is trained using a multi-agent reinforcement learning (MARL) strategy that jointly optimizes both intermediate and final tasks while mitigating the propagation of task-level errors. Each agent specializes in a distinct sub-task, and their cooperative reward structure ensures that semantic precision and narrative fluency are balanced during end-to-end learning. Extensive experiments on MSCOCO and other benchmarks demonstrate that DEPICT not only surpasses strong baselines but also matches or exceeds the performance of state-of-the-art Transformer captioners. More importantly, qualitative analyses reveal that DEPICT generates descriptions that are more causally grounded, diverse, and semantically faithful.