Cognitive Grounding for Visual Question Reasoning via Dynamic Knowledge Imagination

Madison Clarke
Ryan Dhaliwal
Brielle Monroe
Isabelle Chen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Visual question answering (VQA) represents a critical intersection of vision and language understanding, where models must perceive visual scenes and reason about their underlying semantics. However, human-like reasoning often extends beyond what is directly observable—requiring the invocation of prior knowledge, inference, and commonsense understanding. In this work, we reexamine the nature of external knowledge in multimodal reasoning and propose a unified framework named \textbf{Cognitive Grounded Imagination Network (COGINet)} that integrates dynamically generated commonsense imagination with vision-language understanding. Instead of merely retrieving symbolic facts from static knowledge bases, our framework leverages contextualized commonsense imagination synthesized through large-scale generative knowledge models, thereby grounding dynamic reasoning on both perceptual evidence and inferred world regularities. COGINet introduces a two-stage process: (1) a knowledge imagination module, which generates plausible contextual hypotheses from the interaction between visual regions and textual queries; and (2) a cross-modal reasoning transformer that fuses these contextualized inferences with multimodal embeddings through adaptive attention. We demonstrate that such cognitive grounding enables the model to reason about abstract or implied concepts, extending beyond explicit cues in the image or text. Extensive experiments conducted on OK-VQA and A-OKVQA benchmarks show that COGINet achieves consistent improvements over prior static-knowledge methods, providing both quantitative gains and qualitative interpretability. Further analysis reveals the model’s ability to discern when external knowledge is useful, selectively invoking imagined context only when required for reasoning. Our findings highlight the importance of dynamic, cognitively inspired commonsense integration for achieving genuine multimodal understanding.

Version published to 10.20944/preprints202510.1967.v1
Oct 27, 2025

Differentiable Retrieval-Guided Multimodal Reasoning for Knowledge-Intensive Visual Question Understanding

This article has 4 authors:
1. Camille Dupuis
2. Juliette Declerck
3. Elodie Fairchild
4. Thomas Damme
This article has no evaluationsLatest version Oct 23, 2025
Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning

This article has 4 authors:
1. Sander Ridder
2. Noor Verbeeck
3. Callum Hensley
4. Luca Vandenberghe
This article has no evaluationsLatest version Oct 27, 2025
Towards Explainable Language Reasoning via Multi-Modal Knowledge Graphs

This article has 7 authors:
1. chunyu lu
2. jun luo
3. kang yu
4. tianran chen
5. xueli wang
6. feng qian
7. chen xi
This article has no evaluationsLatest version Oct 23, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Differentiable Retrieval-Guided Multimodal Reasoning for Knowledge-Intensive Visual Question Understanding

Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning

Towards Explainable Language Reasoning via Multi-Modal Knowledge Graphs