Cognitive Grounding for Visual Question Reasoning via Dynamic Knowledge Imagination
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Visual question answering (VQA) represents a critical intersection of vision and language understanding, where models must perceive visual scenes and reason about their underlying semantics. However, human-like reasoning often extends beyond what is directly observable—requiring the invocation of prior knowledge, inference, and commonsense understanding. In this work, we reexamine the nature of external knowledge in multimodal reasoning and propose a unified framework named \textbf{Cognitive Grounded Imagination Network (COGINet)} that integrates dynamically generated commonsense imagination with vision-language understanding. Instead of merely retrieving symbolic facts from static knowledge bases, our framework leverages contextualized commonsense imagination synthesized through large-scale generative knowledge models, thereby grounding dynamic reasoning on both perceptual evidence and inferred world regularities. COGINet introduces a two-stage process: (1) a knowledge imagination module, which generates plausible contextual hypotheses from the interaction between visual regions and textual queries; and (2) a cross-modal reasoning transformer that fuses these contextualized inferences with multimodal embeddings through adaptive attention. We demonstrate that such cognitive grounding enables the model to reason about abstract or implied concepts, extending beyond explicit cues in the image or text. Extensive experiments conducted on OK-VQA and A-OKVQA benchmarks show that COGINet achieves consistent improvements over prior static-knowledge methods, providing both quantitative gains and qualitative interpretability. Further analysis reveals the model’s ability to discern when external knowledge is useful, selectively invoking imagined context only when required for reasoning. Our findings highlight the importance of dynamic, cognitively inspired commonsense integration for achieving genuine multimodal understanding.