Context-Aware Multi-Anchor Captioning for Text-Rich Image Understanding

Théo Marchand
Lena Roux
Saidi Kareem
Noe Gauthier

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Understanding images embedded with textual elements is fundamental for advancing fine-grained visual reasoning. Unlike traditional image captioning, which focuses on object and scene descriptions, text-based image captioning (TextCap) demands the ability to \emph{read}, \emph{comprehend}, and \emph{contextualize} text within complex visual environments. This challenge arises from the intricate relationships between visual semantics and embedded texts such as road signs, brand names, or product labels, which together convey richer scene-level narratives. Existing models typically adapt classical captioning architectures to this task by generating a single global caption, inevitably oversimplifying the nuanced interdependencies between visual regions and textual content. In this work, we introduce a new framework named \textbf{Multi-Anchor Captioner (MACap)}, which seeks to produce diverse and fine-grained captions through a structured anchoring mechanism. Instead of treating the image as a whole, MACap decomposes it into multiple \emph{anchor-centered subgraphs}, each focusing on a specific text region and its corresponding contextual neighborhood. The framework involves three sequential stages: (1) an \emph{Anchor Proposal Module (APM)} that identifies informative text tokens and groups them with their relevant visual contexts; (2) an \emph{Anchor Graph Constructor (AGC)} that models semantic dependencies across anchors via graph propagation; and (3) a \emph{Multi-View Caption Generator (MCG)} that synthesizes multiple captions under distinct anchor views, ensuring both accuracy and content diversity. Empirical evaluations on the TextCaps benchmark demonstrate that MACap achieves state-of-the-art performance, surpassing existing baselines in both descriptive fidelity and caption diversity metrics. Beyond quantitative superiority, qualitative results reveal MACap’s ability to generate complementary captions covering multifaceted aspects of a single image—ranging from object appearance to textual semantics—highlighting its capacity for comprehensive scene understanding.

Version published to 10.20944/preprints202510.0736.v1
Oct 10, 2025

Capturing Narrative Semantics from Captions for Relational Scene Abstraction

This article has 4 authors:
1. Sofia Nguyen
2. Noah Macleod
3. Arjun Patel
4. Brielle Monroe
This article has no evaluationsLatest version Sep 18, 2025
Contextual Synergy through Explicit and Implicit Relations: A Unified Perspective for Image Description Generation

This article has 4 authors:
1. Lotte Vermeulen
2. Yara Van den Broeck
3. Callum Hensley
4. Bram Smet
This article has no evaluationsLatest version Oct 8, 2025
Caption-Grounded Structural Parsing for Compound Scientific Visuals

This article has 4 authors:
1. Omar Al-Mansoori
2. Aiden Johnson
3. Ava Martinez
4. Oliver Smith
This article has no evaluationsLatest version Aug 29, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Capturing Narrative Semantics from Captions for Relational Scene Abstraction

Contextual Synergy through Explicit and Implicit Relations: A Unified Perspective for Image Description Generation

Caption-Grounded Structural Parsing for Compound Scientific Visuals