RGFRCap: Enhancing Image Captioning with Retrieval-Guided Semantic Feature Refinement
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper introduces RGFRCap, an innovative image captioning framework that leverages retrieval-guided semantic feature refinement to enhance caption quality. At its core, RGFRCap integrates an image-text retrieval (ITR) module to fetch candidate captions that closely match the input image, serving as conditional priors for subsequent processing. These priors guide a semantic feature filtering (SFF) mechanism, which refines semantic information extracted via object detection and semantic segmentation, focusing on relevant objects, attributes, and pixel-level details. The refined semantic features are then amalgamated with region-specific visual features in a visual-semantic fusion (VSF) module, enriching the visual representation. A vision-language transformer decoder utilizes this enhanced representation to produce precise and contextually rich captions. Empirical evaluations on MSCOCO, Flickr30K, and two custom traffic datasets (City_cap and FoggyCity_cap) showcase RGFRCap's superior captioning performance, surpassing existing methods on several benchmarks. The codebase and datasets are publicly accessible at https://github.com/fjq-tongji/RGFRCap, fostering further research in the field.