RGFRCap: Enhancing Image Captioning with Retrieval-Guided Semantic Feature Refinement

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper introduces RGFRCap, an innovative image captioning framework that leverages retrieval-guided semantic feature refinement to enhance caption quality. At its core, RGFRCap integrates an image-text retrieval (ITR) module to fetch candidate captions that closely match the input image, serving as conditional priors for subsequent processing. These priors guide a semantic feature filtering (SFF) mechanism, which refines semantic information extracted via object detection and semantic segmentation, focusing on relevant objects, attributes, and pixel-level details. The refined semantic features are then amalgamated with region-specific visual features in a visual-semantic fusion (VSF) module, enriching the visual representation. A vision-language transformer decoder utilizes this enhanced representation to produce precise and contextually rich captions. Empirical evaluations on MSCOCO, Flickr30K, and two custom traffic datasets (City_cap and FoggyCity_cap) showcase RGFRCap's superior captioning performance, surpassing existing methods on several benchmarks. The codebase and datasets are publicly accessible at https://github.com/fjq-tongji/RGFRCap, fostering further research in the field.

Article activity feed