Context-Aware Multi-Anchor Captioning for Text-Rich Image Understanding

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Understanding images embedded with textual elements is fundamental for advancing fine-grained visual reasoning. Unlike traditional image captioning, which focuses on object and scene descriptions, text-based image captioning (TextCap) demands the ability to \emph{read}, \emph{comprehend}, and \emph{contextualize} text within complex visual environments. This challenge arises from the intricate relationships between visual semantics and embedded texts such as road signs, brand names, or product labels, which together convey richer scene-level narratives. Existing models typically adapt classical captioning architectures to this task by generating a single global caption, inevitably oversimplifying the nuanced interdependencies between visual regions and textual content. In this work, we introduce a new framework named \textbf{Multi-Anchor Captioner (MACap)}, which seeks to produce diverse and fine-grained captions through a structured anchoring mechanism. Instead of treating the image as a whole, MACap decomposes it into multiple \emph{anchor-centered subgraphs}, each focusing on a specific text region and its corresponding contextual neighborhood. The framework involves three sequential stages: (1) an \emph{Anchor Proposal Module (APM)} that identifies informative text tokens and groups them with their relevant visual contexts; (2) an \emph{Anchor Graph Constructor (AGC)} that models semantic dependencies across anchors via graph propagation; and (3) a \emph{Multi-View Caption Generator (MCG)} that synthesizes multiple captions under distinct anchor views, ensuring both accuracy and content diversity. Empirical evaluations on the TextCaps benchmark demonstrate that MACap achieves state-of-the-art performance, surpassing existing baselines in both descriptive fidelity and caption diversity metrics. Beyond quantitative superiority, qualitative results reveal MACap’s ability to generate complementary captions covering multifaceted aspects of a single image—ranging from object appearance to textual semantics—highlighting its capacity for comprehensive scene understanding.

Article activity feed