Contextual Synergy through Explicit and Implicit Relations: A Unified Perspective for Image Description Generation

Lotte Vermeulen
Yara Van den Broeck
Callum Hensley
Bram Smet

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Automatic image description generation, commonly referred to as image captioning, has long been recognized as a highly demanding challenge within artificial intelligence, due to the necessity of bridging the perceptual gap between visual understanding and natural language expression. Conventional encoder-decoder pipelines typically convert salient image regions into textual sentences, yielding reasonable performance across diverse datasets. Nevertheless, such models are still constrained by their limited capacity to capture the nuanced contextual interactions that naturally exist among objects in complex scenes. These contextual cues are often conveyed through visual relationships—some explicitly manifested through spatial or semantic connections, and others implicitly embedded in higher-order global associations. In this study, we propose a novel framework, termed \textbf{VisRelNet}, which explicitly and implicitly explores object relationships to enrich regional semantics for image captioning. On the explicit side, we construct semantic graphs among detected objects and introduce a Gated Graph Convolutional Network (Gated GCN) that dynamically filters relational edges to emphasize informative connections. On the implicit side, we employ a region-level bidirectional transformer encoder (Region BERT), which directly models latent dependencies across all regions without relying on external annotations for relationships. Furthermore, to harmonize the complementary strengths of explicit and implicit cues, we design a Dynamic Mixture Attention (DMA) mechanism, capable of adaptively balancing region-level features through channel-wise gating. We evaluate \textbf{VisRelNet} on the Microsoft COCO benchmark and observe consistent and significant improvements over a range of competitive baselines. Experimental evidence demonstrates that leveraging both explicit and implicit relational reasoning enhances the contextual richness of image representations, thereby producing captions that are more coherent, descriptive, and human-like. This work highlights the importance of multi-faceted relational modeling and provides a pathway toward unified relational reasoning for vision-language tasks.

Version published to 10.20944/preprints202510.0575.v1
Oct 8, 2025

Capturing Narrative Semantics from Captions for Relational Scene Abstraction

This article has 4 authors:
1. Sofia Nguyen
2. Noah Macleod
3. Arjun Patel
4. Brielle Monroe
This article has no evaluationsLatest version Sep 18, 2025
Decoupled Yet Aligned Transformer for Semantic Image-Text Retrieval

This article has 4 authors:
1. Finn Alexander
2. Linh Anh
3. Jannat Roy
4. Ava Grace
This article has no evaluationsLatest version Sep 5, 2025
Context-Aware Multi-Anchor Captioning for Text-Rich Image Understanding

This article has 4 authors:
1. Théo Marchand
2. Lena Roux
3. Saidi Kareem
4. Noe Gauthier
This article has no evaluationsLatest version Oct 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Capturing Narrative Semantics from Captions for Relational Scene Abstraction

Decoupled Yet Aligned Transformer for Semantic Image-Text Retrieval

Context-Aware Multi-Anchor Captioning for Text-Rich Image Understanding