Text-Enriched Vision-Language Captioning for Contextual Scene Understanding and Accessibility

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Understanding visual scenes that contain both pictorial and textual elements remains one of the most underexplored yet socially impactful challenges in multimodal AI. For visually impaired individuals, the ability to interpret text embedded in their surroundings—such as signs, labels, or documents—is indispensable for independent daily functioning. Existing image captioning systems, however, are primarily optimized for general-purpose datasets and fail to attend to textual cues embedded within the image. This omission significantly degrades their utility in real-world accessibility contexts, where the text often conveys crucial semantic details. In this work, we propose \textbf{TEXTSight}, a unified multimodal captioning framework that bridges visual perception and textual reasoning. Unlike conventional models that treat visual and textual elements separately, TEXTSight introduces a joint representation pipeline that explicitly integrates scene text recognized via Optical Character Recognition (OCR) with high-level visual embeddings. Furthermore, we design a selective pointer-copy mechanism that dynamically decides whether to generate a token from the language model or directly copy OCR tokens, preserving factual precision when describing entities, prices, or location names. To validate our approach, we evaluate TEXTSight on the VizWiz dataset, which comprises real-world photos taken by blind users under challenging conditions. Our system demonstrates significant improvements over the AoANet baseline, achieving a relative gain of 32.8\% on CIDEr and 15.7\% on SPICE metrics, while qualitatively providing more contextually faithful and informative captions. We also present detailed ablations highlighting the complementary roles of OCR-aware attention and pointer-copy modules. These results underscore the potential of multimodal grounding between visual and textual modalities in advancing accessibility-driven AI.

Article activity feed