Capturing Narrative Semantics from Captions for Relational Scene Abstraction

Sofia Nguyen
Noah Macleod
Arjun Patel
Brielle Monroe

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Understanding visual scenes as structured graphs of objects and their interactions is central to advancing high-level visual reasoning. Conventional scene graph generation methods rely on dense and carefully annotated supervision, where each subject-predicate-object triplet is coupled with explicit bounding box labels. Such supervision, however, is expensive to obtain and scales poorly to the open world. In contrast, natural image captions provide abundant descriptions of scenes at a fraction of the cost, though they remain weakly aligned and inherently noisy. In this work, we introduce \textbf{LINGGRAPH}, a new framework that transforms captions into an indirect yet powerful supervisory signal for scene graph generation. Unlike prior efforts that reduce supervision to isolated triplets, we exploit the global semantic organization encoded in captions—where entities, modifiers, and actions co-occur in narrative structures—to capture interdependent relationships and commonsense scene dynamics. LINGGRAPH extracts structured linguistic cues from captions, such as nominal groups, adjectival modifiers, and verbal relations, and leverages them to guide the detection and classification of graph components. To mitigate the noise and incompleteness of captions, we devise an iterative refinement process that progressively aligns textual spans with visual regions, discarding irrelevant associations while strengthening meaningful ones. Our study demonstrates that linguistic regularities encoded in captions can effectively substitute fine-grained annotations for training robust relational models. Experiments reveal that integrating both global narrative semantics and local syntactic features yields superior interpretability and accuracy in graph generation, surpassing existing weakly supervised baselines. By disambiguating visually similar entities and ensuring semantic coherence, our approach establishes captions as a scalable and practical form of weak supervision. This work highlights the potential of free-form language as a bridge for structured visual understanding, underscoring its role in unifying vision and language at the relational level.

Version published to 10.20944/preprints202509.1607.v1
Sep 18, 2025

Contextual Synergy through Explicit and Implicit Relations: A Unified Perspective for Image Description Generation

This article has 4 authors:
1. Lotte Vermeulen
2. Yara Van den Broeck
3. Callum Hensley
4. Bram Smet
This article has no evaluationsLatest version Oct 8, 2025
Caption-Grounded Structural Parsing for Compound Scientific Visuals

This article has 4 authors:
1. Omar Al-Mansoori
2. Aiden Johnson
3. Ava Martinez
4. Oliver Smith
This article has no evaluationsLatest version Aug 29, 2025
Context-Aware Multi-Anchor Captioning for Text-Rich Image Understanding

This article has 4 authors:
1. Théo Marchand
2. Lena Roux
3. Saidi Kareem
4. Noe Gauthier
This article has no evaluationsLatest version Oct 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Contextual Synergy through Explicit and Implicit Relations: A Unified Perspective for Image Description Generation

Caption-Grounded Structural Parsing for Compound Scientific Visuals

Context-Aware Multi-Anchor Captioning for Text-Rich Image Understanding