Caption-Grounded Structural Parsing for Compound Scientific Visuals

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The unprecedented growth of scholarly literature has triggered a parallel explosion in visual artifacts, particularly figures that encapsulate experimental findings. Strikingly, more than 30% of these figures are compound in nature—comprising multiple heterogeneous subfigures—thus presenting formidable obstacles to automated parsing and comprehension. Conventional retrieval and analysis pipelines are typically designed under the assumption that each figure embodies a single, coherent semantic theme. This assumption breaks down when applied to compound figures, where diverse and semantically independent components coexist. To overcome this limitation, we propose SEMCLIP, a layoutsensitive, semantics-driven framework tailored for figure decomposition. Instead of merely segmenting visual regions based on low-level appearance, SEMCLIP introduces the notion of master images: semantically aligned units constructed through explicit modeling of symbolic labels embedded within figures. The system employs a cascaded two-stage design. First, a label localization network identifies references, which encode both structural layout and semantic grouping. These anchors are then fused with learned descriptors of regional visual features, producing coherent segments aligned with caption semantics. To address difficulties posed by uneven annotation distributions and sparse symbolic cues, we develop a bifurcated training paradigm that independently refines detection sensitivity and classification robustness. Experimental results on a large-scale annotated benchmark confirm that SEMCLIP significantly outperforms heuristic- and detection-based baselines, achieving superior segmentation fidelity and improved alignment between visual segments and textual captions. This work establishes a new pathway toward semantically grounded interpretation of visual evidence in scholarly communication.

Article activity feed