Text-Enriched Vision-Language Captioning for Contextual Scene Understanding and Accessibility

Chloe Zhang
Ethan Moreau
Brielle Monroe
Lucas Tremblay

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Understanding visual scenes that contain both pictorial and textual elements remains one of the most underexplored yet socially impactful challenges in multimodal AI. For visually impaired individuals, the ability to interpret text embedded in their surroundings—such as signs, labels, or documents—is indispensable for independent daily functioning. Existing image captioning systems, however, are primarily optimized for general-purpose datasets and fail to attend to textual cues embedded within the image. This omission significantly degrades their utility in real-world accessibility contexts, where the text often conveys crucial semantic details. In this work, we propose \textbf{TEXTSight}, a unified multimodal captioning framework that bridges visual perception and textual reasoning. Unlike conventional models that treat visual and textual elements separately, TEXTSight introduces a joint representation pipeline that explicitly integrates scene text recognized via Optical Character Recognition (OCR) with high-level visual embeddings. Furthermore, we design a selective pointer-copy mechanism that dynamically decides whether to generate a token from the language model or directly copy OCR tokens, preserving factual precision when describing entities, prices, or location names. To validate our approach, we evaluate TEXTSight on the VizWiz dataset, which comprises real-world photos taken by blind users under challenging conditions. Our system demonstrates significant improvements over the AoANet baseline, achieving a relative gain of 32.8\% on CIDEr and 15.7\% on SPICE metrics, while qualitatively providing more contextually faithful and informative captions. We also present detailed ablations highlighting the complementary roles of OCR-aware attention and pointer-copy modules. These results underscore the potential of multimodal grounding between visual and textual modalities in advancing accessibility-driven AI.

Version published to 10.20944/preprints202510.1385.v1
Oct 17, 2025

Context-Aware Multi-Anchor Captioning for Text-Rich Image Understanding

This article has 4 authors:
1. Théo Marchand
2. Lena Roux
3. Saidi Kareem
4. Noe Gauthier
This article has no evaluationsLatest version Oct 10, 2025
Contextual Synergy through Explicit and Implicit Relations: A Unified Perspective for Image Description Generation

This article has 4 authors:
1. Lotte Vermeulen
2. Yara Van den Broeck
3. Callum Hensley
4. Bram Smet
This article has no evaluationsLatest version Oct 8, 2025
Anticipatory Semantics with Bidirectional Guidance for Image Captioning

This article has 3 authors:
1. Noémie Laurent
2. Elodie Fairchild
3. Arthur Delvaux
This article has no evaluationsLatest version Sep 17, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Context-Aware Multi-Anchor Captioning for Text-Rich Image Understanding

Contextual Synergy through Explicit and Implicit Relations: A Unified Perspective for Image Description Generation

Anticipatory Semantics with Bidirectional Guidance for Image Captioning