Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Visual Question Answering (VQA) represents one of the most complex and comprehensive challenges in multimodal understanding, demanding the seamless fusion of visual perception and natural language reasoning. Despite the remarkable advances in deep multimodal learning, existing models still struggle with cases where the correct answer requires precise reading and interpretation of text embedded within images—an ability crucial in real-world scenarios such as understanding street signs, charts, or documents. The gap arises primarily from the inability of conventional visual encoders to align textual tokens extracted from the scene with semantic cues from the question. To address this, we introduce a novel \textbf{Contextually-Guided Multi-Branch Fusion Network (CMFN)}, which adaptively distinguishes between text-dependent and general reasoning pathways. Our model integrates an Optical Character Recognition (OCR)-enhanced representation module that captures scene text semantics and a dynamic routing mechanism that automatically determines whether to invoke a text-centric reasoning branch or a general visual reasoning branch. Furthermore, a contextual alignment gate refines the fusion between multimodal embeddings, ensuring that answer generation remains robust and semantically coherent. Extensive experiments on the VQA v2.0 benchmark demonstrate that CMFN achieves consistent improvements over state-of-the-art baselines, particularly in question types requiring textual understanding, achieving a notable boost in accuracy on “number” and “reading” question categories. Our findings highlight the necessity of text-aware reasoning pathways and adaptive routing strategies for advancing visual question reasoning in complex real-world environments.

Article activity feed