Caption-Grounded Structural Parsing for Compound Scientific Visuals

Omar Al-Mansoori
Aiden Johnson
Ava Martinez
Oliver Smith

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The unprecedented growth of scholarly literature has triggered a parallel explosion in visual artifacts, particularly figures that encapsulate experimental findings. Strikingly, more than 30% of these figures are compound in nature—comprising multiple heterogeneous subfigures—thus presenting formidable obstacles to automated parsing and comprehension. Conventional retrieval and analysis pipelines are typically designed under the assumption that each figure embodies a single, coherent semantic theme. This assumption breaks down when applied to compound figures, where diverse and semantically independent components coexist. To overcome this limitation, we propose SEMCLIP, a layoutsensitive, semantics-driven framework tailored for figure decomposition. Instead of merely segmenting visual regions based on low-level appearance, SEMCLIP introduces the notion of master images: semantically aligned units constructed through explicit modeling of symbolic labels embedded within figures. The system employs a cascaded two-stage design. First, a label localization network identifies references, which encode both structural layout and semantic grouping. These anchors are then fused with learned descriptors of regional visual features, producing coherent segments aligned with caption semantics. To address difficulties posed by uneven annotation distributions and sparse symbolic cues, we develop a bifurcated training paradigm that independently refines detection sensitivity and classification robustness. Experimental results on a large-scale annotated benchmark confirm that SEMCLIP significantly outperforms heuristic- and detection-based baselines, achieving superior segmentation fidelity and improved alignment between visual segments and textual captions. This work establishes a new pathway toward semantically grounded interpretation of visual evidence in scholarly communication.

Version published to 10.20944/preprints202508.2200.v1
Aug 29, 2025

Reassessing Multimodal Pathways for Learning Action Meaning

This article has 4 authors:
1. Bastien Morel
2. Anaïs Coppens
3. Elodie Fairchild
4. Mathieu Hoorde
This article has no evaluationsLatest version Dec 22, 2025
The Semantic Scaffold: Functional Dissociation of Visual and Language-derived Features Shapes Human Natural Scene Understanding

This article has 9 authors:
1. Yu Zhang
2. Yuxuan Tu
3. Zihan Yin
4. Jing Zhang
5. Weiyang Shi
6. Siyang Li
7. Jingguo Dai
8. Yongfu Hao
9. Tianzi Jiang
This article has no evaluationsLatest version Jan 12, 2026
Compositional AI-Service Pipeline to Generate Interactive Structured-Data from Scanned Images

This article has 4 authors:
1. Anthony Savidis
2. Yannis Valsamakis
3. Theodoros Chalkidis
4. Stephanos Soultatos
This article has no evaluationsLatest version Jan 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Reassessing Multimodal Pathways for Learning Action Meaning

The Semantic Scaffold: Functional Dissociation of Visual and Language-derived Features Shapes Human Natural Scene Understanding

Compositional AI-Service Pipeline to Generate Interactive Structured-Data from Scanned Images