Relation-Sensitive VQA with A Unified Tri-Modal Graph Framework

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Visual question answering~(VQA) fundamentally requires a model to interpret heterogeneous semantic cues in an image and align them with a natural-language query. Traditional approaches benefit from scene graph representations, yet they often suffer from severe imbalances when handling rich semantic structures, especially when reasoning demands simultaneous consideration of objects, relations, and fine-grained attributes. Existing models frequently overlook the subtle interactions among these three information streams, leading to faulty attribute inference or overlooked relational cues. Addressing these long-standing limitations calls for a more principled integration of all semantic constituents within a unified and expressive reasoning space. In this paper, we introduce \textbf{\textsc{TriUnity-GNN}}, a tri-modal fusion framework that redefines scene graph reasoning by jointly enhancing object-centric, relation-centric, and attribute-centric representations under a unified graph neural paradigm. Instead of treating scene graphs as monolithic structures, our approach restructures the given graph into two complementary modalities, an object-dominant perspective and a relation-dominant perspective, thereby enabling the model to capture multi-granular semantics that are typically under-explored. To further strengthen the expressivity of these representations, \textsc{TriUnity-GNN} integrates attribute cues through an explicit fusion design, significantly enlarging the impact of attribute signals that are otherwise marginalized in classic architectures. Moreover, we design a novel message-passing enhancement module that substantially increases cross-type semantic exchange among objects, relations, and attributes, ensuring that all three modalities collectively shape the final reasoning embedding. We perform comprehensive evaluations on benchmark datasets including GQA, VG, and motif-VG. Across all benchmarks, \textsc{TriUnity-GNN} consistently surpasses prior graph-based VQA systems by a clear margin, demonstrating robustness in handling both straightforward and semantically composite queries. The results verify that a tri-modal, explicitly balanced graph reasoning mechanism is crucial for improving interpretability and accuracy in challenging visual question answering scenarios.

Article activity feed