Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning

Sander Ridder
Noor Verbeeck
Callum Hensley
Luca Vandenberghe

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Visual Question Answering (VQA) represents one of the most complex and comprehensive challenges in multimodal understanding, demanding the seamless fusion of visual perception and natural language reasoning. Despite the remarkable advances in deep multimodal learning, existing models still struggle with cases where the correct answer requires precise reading and interpretation of text embedded within images—an ability crucial in real-world scenarios such as understanding street signs, charts, or documents. The gap arises primarily from the inability of conventional visual encoders to align textual tokens extracted from the scene with semantic cues from the question. To address this, we introduce a novel \textbf{Contextually-Guided Multi-Branch Fusion Network (CMFN)}, which adaptively distinguishes between text-dependent and general reasoning pathways. Our model integrates an Optical Character Recognition (OCR)-enhanced representation module that captures scene text semantics and a dynamic routing mechanism that automatically determines whether to invoke a text-centric reasoning branch or a general visual reasoning branch. Furthermore, a contextual alignment gate refines the fusion between multimodal embeddings, ensuring that answer generation remains robust and semantically coherent. Extensive experiments on the VQA v2.0 benchmark demonstrate that CMFN achieves consistent improvements over state-of-the-art baselines, particularly in question types requiring textual understanding, achieving a notable boost in accuracy on “number” and “reading” question categories. Our findings highlight the necessity of text-aware reasoning pathways and adaptive routing strategies for advancing visual question reasoning in complex real-world environments.

Version published to 10.20944/preprints202510.1994.v1
Oct 27, 2025

Differentiable Retrieval-Guided Multimodal Reasoning for Knowledge-Intensive Visual Question Understanding

This article has 4 authors:
1. Camille Dupuis
2. Juliette Declerck
3. Elodie Fairchild
4. Thomas Damme
This article has no evaluationsLatest version Oct 23, 2025
Text-Enriched Vision-Language Captioning for Contextual Scene Understanding and Accessibility

This article has 4 authors:
1. Chloe Zhang
2. Ethan Moreau
3. Brielle Monroe
4. Lucas Tremblay
This article has no evaluationsLatest version Oct 17, 2025
Cognitive Grounding for Visual Question Reasoning via Dynamic Knowledge Imagination

This article has 4 authors:
1. Madison Clarke
2. Ryan Dhaliwal
3. Brielle Monroe
4. Isabelle Chen
This article has no evaluationsLatest version Oct 27, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Differentiable Retrieval-Guided Multimodal Reasoning for Knowledge-Intensive Visual Question Understanding

Text-Enriched Vision-Language Captioning for Contextual Scene Understanding and Accessibility

Cognitive Grounding for Visual Question Reasoning via Dynamic Knowledge Imagination