Conditioned Visual Captioning with Spatially-Aware Multimodal Modeling

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Understanding scene text in images is crucial for various real-world applications, especially for visually impaired individuals who rely on comprehensive and contextually relevant descriptions. Traditional text-aware image captioning systems, however, fail to generate personalized captions that cater to diverse user inquiries. To bridge this gap, we introduce a novel and challenging task called Question-driven Text-aware Image Captioning (Q-TAG), where captions are dynamically tailored based on specific user queries. Given an image embedded with multiple scene texts, the system must comprehend user-posed questions, extract relevant textual and visual features, and construct fluent, contextually enriched captions. To facilitate research in this domain, we construct benchmark datasets derived from existing text-aware captioning datasets through an automated data augmentation pipeline. These datasets provide comprehensive quadruples of <image, initial coarse caption, control questions, enriched captions>. We propose an advanced model, Q-TAG, which integrates a Spatially-aware Multimodal Encoder to fuse object-region and scene-text features while considering their geometric relationships. Additionally, a Question-driven Feature Selector filters the most relevant visual-textual elements based on user queries. Finally, a Multimodal Fusion Decoder synthesizes these components to generate highly informative captions. Experimental evaluations demonstrate that Q-TAG surpasses strong baselines in both captioning quality and question relevance, producing more diverse and context-sensitive descriptions than existing models.

Article activity feed