Contextual Knowledge Infusion via Iterative Semantic Tracing for Vision–Language Understanding

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The challenge of integrating external knowledge into visual reasoning frameworks has motivated a growing interest in models capable of bridging perceptual understanding with abstract, non-visual information. Unlike conventional visual question answering settings, knowledge-driven VQA demands a joint interpretation of visible cues and facts that are absent from the image itself. This paper introduces a new perspective on this task and proposes \textsc{KV-Trace}, a unified semantic tracing framework that emphasizes iterative knowledge refinement and structured visual interpretation. Instead of treating visual and knowledge modalities as homogeneous sources, our framework explicitly distinguishes their representational roles and organizes them into a progressive reasoning pipeline. Through a dynamic knowledge memory space and a query-sensitive semantic propagation mechanism, \textsc{KV-Trace} composes multi-stage reasoning steps that evolve according to the underlying question. Extensive experiments conducted on the KRVQR and FVQA benchmarks demonstrate that our model achieves improved reasoning depth and generalization capacity. Additional ablation studies further verify the contribution of each reasoning component and highlight the interpretability benefits gained from explicit knowledge structuring.

Article activity feed