Cross-Modal Semantic-Enhanced Image Captioning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The process of semi-supervised image captioning necessitates generating descriptive captions from images through a hybrid cross-modal inference system, named as the Semantic Consistency and Predictive Regulation Framework (SCPRF). Unlike traditional methods that depend heavily on extensively annotated datasets, our approach leverages a combination of a sparse set of labeled image-caption pairs and a larger corpus of unlabeled images. This paper introduces a novel methodology that bridges the descriptive gap by enforcing semantic consistency and utilizing predictive cues from raw images to guide the caption generation. Specifically, we address the challenge of cross-modal disparities by embedding both images and their generated captions into a unified semantic space, where the alignment is enforced through dual mechanisms: predictive alignment and relational consistency. This approach not only preserves the integrity of information across modalities but also enhances the learning process under limited supervision. Our experiments on the MS-COCO dataset demonstrate that SCPRF significantly surpasses existing methods by improving the CIDEr-D scores by over 12\%, evidencing robust performance in complex semi-supervised settings.