Context-Aware Knowledge Harmonization for Visual Question Reasoning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Knowledge-intensive visual question answering requires a model to fluidly integrate visual perception, linguistic comprehension, and external knowledge sources. Although recent advances in knowledge-based VQA have explored the incorporation of structured and unstructured knowledge, they frequently overlook the discrepancies between the visual scene and the retrieved knowledge, resulting in semantic conflicts that degrade reasoning quality. To address this long-standing issue, we introduce \textsc{SenseAlign}, a unified context-harmonization framework designed to assess, quantify, and mitigate semantic divergence between image-grounded evidence and externally retrieved knowledge. The core principle behind \textsc{SenseAlign} is to provide an adaptive mechanism that evaluates whether the newly incorporated knowledge is consistent with the visual–linguistic context, and proportionally adjusts its influence based on a principled inconsistency score. Specifically, we first formulate a novel semantic discrepancy estimator that combines caption-based uncertainty signals with a cross-context semantic similarity evaluation, allowing the system to diagnose whether external knowledge aligns with the underlying visual semantics. Building upon this inconsistency estimator, we further develop an adaptive knowledge assimilation strategy that dynamically regulates explicit knowledge from structured sources and implicit knowledge encoded in pretrained multimodal models. Through this perspective, \textsc{SenseAlign} offers a general mechanism for preventing over-reliance on irrelevant or misleading facts while still enabling the model to leverage genuinely helpful knowledge. Comprehensive experiments on the OK-VQA benchmark demonstrate that our approach consistently surpasses strong baselines and establishes a new state-of-the-art performance. These results highlight the significance of explicitly modeling semantic compatibility when integrating heterogeneous knowledge for visual reasoning tasks.