Image Narrator: Bridging Visuals and Language Through AI-Powered Captioning

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

An image caption generator is an AI- and machine-learning–based tool that automatically creates natural-language textual descriptions of visual content in images, enhancing accessibility for visually impairedindividuals and supporting applications like image retrieval. To address this, we introduce a novel methodfor generating image captions without the need for paired image-text datasets. Our system utilizes Con-volutional Neural Network (CNN)-based encoders, including VGG16, VGG19, VGG30 and InceptionV3to extract detailed image features. These features are then decoded using Long Short-Term Memory(LSTM) and Transformer models to generate captions. Instead of relying on supervised learning, themodel is trained with unpaired data and optimized through deep learning, focusing on language fluencyand semantic coherence. This unsupervised approach allows for more flexible and autonomous captiongeneration. Experimental results on benchmark datasets such as Flickr8k and Conceptual Captionsdemonstrate that our method achieves competitive performance, with BLEU scores of 0.53 (VGG16),0.54 (VGG19), 0.55 (VGG30), 0.42 (Transformer) and 0.72 (InceptionV3 with optimization). The modelconsistently produces fluent and contextually accurate captions, highlighting the effectiveness of vision-language alignment in an unsupervised framework. This research marks a significant step toward buildingscalable and autonomous image captioning systems.

Article activity feed