A Tri-Branch Structure-Aware Network for Visual Question Answering over Structured Analytical Graphics via Visual-Relational-Numerical Alignment

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Visual question answering on structured analytical graphics, such as bar charts, line charts, and scientific plots, requires models to jointly interpret visual marks, relational layouts, and latent numerical values. Generic vision-language models often treat charts as ordinary images, failing to capture the hierarchical composition of axes, legends, series, and marks, and they lack explicit mechanisms for value recovery and question-type-aware reasoning. This paper proposes a tri-branch structure-aware network that formulates the task as a joint alignment problem among visual elements, latent numerical organization, and question semantics. The visual branch extracts multi-scale mark features; the structural branch builds a heterogeneous hierarchical graph over page-level, group-level, and element-level nodes, encodes relations via a typed graph transformer, and reconstructs an implicit data matrix as an auxiliary task; the semantic branch encodes the question and predicts a reasoning routing distribution over lookup, comparison, aggregation, ranking, and trend types. Cross-branch co-attention aligns the three representations, and a routing gate dynamically weights specialized answer heads. Experiments on ChartQA, PlotQA, DVQA, and FigureQA show that the proposed method achieves 76.8% overall accuracy on ChartQA, outperforming strong baselines including UniChart (72.4%) and DePlot (69.7%). Ablation studies confirm the necessity of hierarchical graph encoding, table reconstruction, and routing mechanisms. The model also demonstrates strong cross-dataset generalization and provides interpretable alignments between language and visual structure.

Article activity feed