RGFRCap: Enhancing Image Captioning with Retrieval-Guided Semantic Feature Refinement

Jiaqi Fan
Hongqing Chu
Hao Fang
Jia Zhang
Quanbo Ge
Bingzhao Gao

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper introduces RGFRCap, an innovative image captioning framework that leverages retrieval-guided semantic feature refinement to enhance caption quality. At its core, RGFRCap integrates an image-text retrieval (ITR) module to fetch candidate captions that closely match the input image, serving as conditional priors for subsequent processing. These priors guide a semantic feature filtering (SFF) mechanism, which refines semantic information extracted via object detection and semantic segmentation, focusing on relevant objects, attributes, and pixel-level details. The refined semantic features are then amalgamated with region-specific visual features in a visual-semantic fusion (VSF) module, enriching the visual representation. A vision-language transformer decoder utilizes this enhanced representation to produce precise and contextually rich captions. Empirical evaluations on MSCOCO, Flickr30K, and two custom traffic datasets (City_cap and FoggyCity_cap) showcase RGFRCap's superior captioning performance, surpassing existing methods on several benchmarks. The codebase and datasets are publicly accessible at https://github.com/fjq-tongji/RGFRCap, fostering further research in the field.

Version published to 10.21203/rs.3.rs-7721105/v1 on Research Square
Oct 20, 2025

Image Narrator: Bridging Visuals and Language Through AI-Powered Captioning

This article has 5 authors:
1. Indrajit Pal
2. Ashoktaru Pal
3. Susmita Halder
4. Saptarsi Das
5. Sagnik Mondal
This article has no evaluationsLatest version Feb 3, 2026
Multimodal Model Based on Contrastive Language-Image Pretraining for Micro-Expression Recognition

This article has 5 authors:
1. Peng Yang
2. Xiaoguang Wu
3. Yanyang Zhou
4. Qilin Wei
5. Zhifeng Zeng
This article has no evaluationsLatest version Dec 17, 2025
A Comparative Survey of CNN-LSTM Architectures for Image Captioning

This article has 5 authors:
1. Sehran Sajad Bhat
2. Shafin Mehnaz
3. Shadab Ali Shekh
4. Tasbeeha F.
5. Lijimol K.
This article has no evaluationsLatest version Dec 15, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Image Narrator: Bridging Visuals and Language Through AI-Powered Captioning

Multimodal Model Based on Contrastive Language-Image Pretraining for Micro-Expression Recognition

A Comparative Survey of CNN-LSTM Architectures for Image Captioning