Adaptive Semantic Fusion for Contextual Image Captioning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The automatic generation of textual descriptions from visual data is a fundamental yet challenging task that requires the seamless integration of image understanding and sophisticated language modeling. It involves not only identifying and interpreting complex visual elements but also effectively mapping them to coherent and contextually relevant textual representations. In this paper, we propose a novel framework called the \textit{Dynamic Iterative Refinement Model (DIRM)}, which addresses these challenges by dynamically adjusting the output vocabulary of the language decoder through decoder-driven visual semantics. By leveraging a dynamic gating mechanism and scatter-connected mappings, DIRM implicitly learns robust associations between visual tag words and corresponding image regions. This enables the model to generate captions that are semantically rich, contextually accurate, and capable of capturing fine-grained visual details. The proposed framework introduces a multi-step refinement strategy, wherein visual concepts are iteratively refined and integrated into the decoding process to enhance semantic alignment. Furthermore, DIRM incorporates a visual-concept vocabulary to guide the generation of descriptive keywords, effectively bridging the gap between high-level visual semantics and linguistic coherence. These innovations allow the model to adaptively focus on salient image features, reducing reliance on generic language patterns and promoting content-specific caption generation. Extensive experiments conducted on the MS-COCO dataset demonstrate the superiority of DIRM over existing visual-semantic-based approaches. The framework achieves state-of-the-art results across multiple evaluation metrics, including BLEU, CIDEr, and SPICE, reflecting its ability to generate captions with enhanced fluency, relevance, and descriptive depth. Additionally, qualitative analysis highlights the model's proficiency in capturing nuanced visual relationships and producing detailed captions that align closely with human annotations. Our work represents a significant advancement in image captioning, paving the way for future research in dynamic visual-linguistic integration and multimodal generation tasks.