Differentiable Retrieval-Guided Multimodal Reasoning for Knowledge-Intensive Visual Question Understanding

Camille Dupuis
Juliette Declerck
Elodie Fairchild
Thomas Damme

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Visual understanding in real-world scenarios often extends far beyond what is directly visible in an image, requiring the ability to reason with external and commonsense knowledge. Traditional Visual Question Answering (VQA) systems, while powerful in multimodal comprehension, typically confine their reasoning to the visual scene, making them inadequate when contextual or encyclopedic information is essential for accurate answers. To address this limitation, we introduce \textbf{KnowSight}, a unified framework for \textit{knowledge-grounded visual question answering}, which integrates retrieval-based external knowledge reasoning with multimodal understanding in a single end-to-end architecture. Unlike prior works that separate document retrieval and answer generation, KnowSight establishes a joint optimization scheme that enables the model to dynamically align visual semantics with relevant knowledge sources. Our design incorporates a differentiable retrieval process that allows backpropagation through document scoring, ensuring that knowledge selection is directly informed by the downstream reasoning objective. This paradigm bridges the gap between perception and cognition, allowing the model to answer questions that require factual grounding, causal inference, or commonsense understanding. Comprehensive experiments on OK-VQA and related benchmarks demonstrate that KnowSight significantly surpasses previous retrieval-augmented systems in both knowledge efficiency and interpretability. Furthermore, we propose a new set of diagnostic metrics to disentangle the contributions of visual grounding and knowledge retrieval. Our analysis reveals that integrating structured and unstructured knowledge through joint training substantially reduces reliance on large retrieval sets, leading to faster convergence and more robust reasoning performance. Beyond outperforming existing methods, KnowSight offers a generalized blueprint for multimodal reasoning systems that continuously learn and adapt their external knowledge grounding in open-world environments.

Version published to 10.20944/preprints202510.1820.v1
Oct 23, 2025

Cognitive Grounding for Visual Question Reasoning via Dynamic Knowledge Imagination

This article has 4 authors:
1. Madison Clarke
2. Ryan Dhaliwal
3. Brielle Monroe
4. Isabelle Chen
This article has no evaluationsLatest version Oct 27, 2025
Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning

This article has 4 authors:
1. Sander Ridder
2. Noor Verbeeck
3. Callum Hensley
4. Luca Vandenberghe
This article has no evaluationsLatest version Oct 27, 2025
Text-Enriched Vision-Language Captioning for Contextual Scene Understanding and Accessibility

This article has 4 authors:
1. Chloe Zhang
2. Ethan Moreau
3. Brielle Monroe
4. Lucas Tremblay
This article has no evaluationsLatest version Oct 17, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Cognitive Grounding for Visual Question Reasoning via Dynamic Knowledge Imagination

Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning

Text-Enriched Vision-Language Captioning for Contextual Scene Understanding and Accessibility