Cross-Modal Semantic-Enhanced Image Captioning

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The process of semi-supervised image captioning necessitates generating descriptive captions from images through a hybrid cross-modal inference system, named as the Semantic Consistency and Predictive Regulation Framework (SCPRF). Unlike traditional methods that depend heavily on extensively annotated datasets, our approach leverages a combination of a sparse set of labeled image-caption pairs and a larger corpus of unlabeled images. This paper introduces a novel methodology that bridges the descriptive gap by enforcing semantic consistency and utilizing predictive cues from raw images to guide the caption generation. Specifically, we address the challenge of cross-modal disparities by embedding both images and their generated captions into a unified semantic space, where the alignment is enforced through dual mechanisms: predictive alignment and relational consistency. This approach not only preserves the integrity of information across modalities but also enhances the learning process under limited supervision. Our experiments on the MS-COCO dataset demonstrate that SCPRF significantly surpasses existing methods by improving the CIDEr-D scores by over 12\%, evidencing robust performance in complex semi-supervised settings.

Article activity feed