Relation-Sensitive VQA with A Unified Tri-Modal Graph Framework

Jolien Van Bossche
Thibault Clercq
Callum Hensley
Rune Peeters

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Visual question answering~(VQA) fundamentally requires a model to interpret heterogeneous semantic cues in an image and align them with a natural-language query. Traditional approaches benefit from scene graph representations, yet they often suffer from severe imbalances when handling rich semantic structures, especially when reasoning demands simultaneous consideration of objects, relations, and fine-grained attributes. Existing models frequently overlook the subtle interactions among these three information streams, leading to faulty attribute inference or overlooked relational cues. Addressing these long-standing limitations calls for a more principled integration of all semantic constituents within a unified and expressive reasoning space. In this paper, we introduce \textbf{\textsc{TriUnity-GNN}}, a tri-modal fusion framework that redefines scene graph reasoning by jointly enhancing object-centric, relation-centric, and attribute-centric representations under a unified graph neural paradigm. Instead of treating scene graphs as monolithic structures, our approach restructures the given graph into two complementary modalities, an object-dominant perspective and a relation-dominant perspective, thereby enabling the model to capture multi-granular semantics that are typically under-explored. To further strengthen the expressivity of these representations, \textsc{TriUnity-GNN} integrates attribute cues through an explicit fusion design, significantly enlarging the impact of attribute signals that are otherwise marginalized in classic architectures. Moreover, we design a novel message-passing enhancement module that substantially increases cross-type semantic exchange among objects, relations, and attributes, ensuring that all three modalities collectively shape the final reasoning embedding. We perform comprehensive evaluations on benchmark datasets including GQA, VG, and motif-VG. Across all benchmarks, \textsc{TriUnity-GNN} consistently surpasses prior graph-based VQA systems by a clear margin, demonstrating robustness in handling both straightforward and semantically composite queries. The results verify that a tri-modal, explicitly balanced graph reasoning mechanism is crucial for improving interpretability and accuracy in challenging visual question answering scenarios.

Version published to 10.20944/preprints202511.1373.v1
Nov 18, 2025

Contextual Synergy through Explicit and Implicit Relations: A Unified Perspective for Image Description Generation

This article has 4 authors:
1. Lotte Vermeulen
2. Yara Van den Broeck
3. Callum Hensley
4. Bram Smet
This article has no evaluationsLatest version Oct 8, 2025
Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning

This article has 4 authors:
1. Sander Ridder
2. Noor Verbeeck
3. Callum Hensley
4. Luca Vandenberghe
This article has no evaluationsLatest version Oct 27, 2025
Contextual Knowledge Infusion via Iterative Semantic Tracing for Vision–Language Understanding

This article has 4 authors:
1. Maëlys Dubois
2. Yanis Lambert
3. Elodie Fairchild
4. Elise Berg
This article has no evaluationsLatest version Nov 14, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Contextual Synergy through Explicit and Implicit Relations: A Unified Perspective for Image Description Generation

Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning

Contextual Knowledge Infusion via Iterative Semantic Tracing for Vision–Language Understanding