Differentiable Retrieval-Guided Multimodal Reasoning for Knowledge-Intensive Visual Question Understanding

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Visual understanding in real-world scenarios often extends far beyond what is directly visible in an image, requiring the ability to reason with external and commonsense knowledge. Traditional Visual Question Answering (VQA) systems, while powerful in multimodal comprehension, typically confine their reasoning to the visual scene, making them inadequate when contextual or encyclopedic information is essential for accurate answers. To address this limitation, we introduce \textbf{KnowSight}, a unified framework for \textit{knowledge-grounded visual question answering}, which integrates retrieval-based external knowledge reasoning with multimodal understanding in a single end-to-end architecture. Unlike prior works that separate document retrieval and answer generation, KnowSight establishes a joint optimization scheme that enables the model to dynamically align visual semantics with relevant knowledge sources. Our design incorporates a differentiable retrieval process that allows backpropagation through document scoring, ensuring that knowledge selection is directly informed by the downstream reasoning objective. This paradigm bridges the gap between perception and cognition, allowing the model to answer questions that require factual grounding, causal inference, or commonsense understanding. Comprehensive experiments on OK-VQA and related benchmarks demonstrate that KnowSight significantly surpasses previous retrieval-augmented systems in both knowledge efficiency and interpretability. Furthermore, we propose a new set of diagnostic metrics to disentangle the contributions of visual grounding and knowledge retrieval. Our analysis reveals that integrating structured and unstructured knowledge through joint training substantially reduces reliance on large retrieval sets, leading to faster convergence and more robust reasoning performance. Beyond outperforming existing methods, KnowSight offers a generalized blueprint for multimodal reasoning systems that continuously learn and adapt their external knowledge grounding in open-world environments.

Article activity feed