SCAP: Enhancing Image Captioning through Lightweight Feature Sifting and Hierarchical Decoding
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Image captioning aims to generate descriptive captions for visual content, thereby strengthening the connection between images and their semantic meanings. In this paper, we propose SCAP, a novel lightweight model that enhances image captioning through an innovative sifting attention mechanism. SCAP incorporates a summary module and a forget module within its encoder to refine visual information, discarding noise and retaining essential details. The hierarchical decoder then leverages sifting attention to align image features with text captions, generating accurate and contextually relevant descriptions. Extensive experiments on the COCO dataset demonstrate SCAP's superior performance, achieving state-of-the-art results while maintaining computational efficiency. This lightweight model represents a promising solution for resource-constrained scenarios, advancing the field of image captioning.