Rethinking Convolutional Semantics for Image Caption Generation Beyond Recurrent Paradigms

Noah Macdonald
Sofia Leblanc
Landen Whitaker
Arman Chowdhury

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The task of automatically generating natural language descriptions for images has become a cornerstone in bridging visual perception and linguistic understanding. While Recurrent Neural Networks (RNNs) and their variants such as LSTMs have long dominated the decoder component in image captioning systems, recent explorations suggest that Convolutional Neural Networks (CNNs) can serve as viable alternatives. However, the capability of CNN-based decoders to fully capture temporal and semantic dependencies in language has not been comprehensively assessed. In this paper, we introduce \textbf{VISCON} (Visual-Semantic Convolutional Network), a new convolutional decoder framework designed to investigate the strengths and weaknesses of CNN-based architectures in caption generation. Our study conducts a rigorous analysis across multiple dimensions, including network depth, convolutional filter complexity, integration of attention mechanisms, the role of sentence length in training, and the effectiveness of data augmentation strategies. Experiments are carried out on two widely adopted benchmarks, Flickr8k and Flickr30k, where we perform extensive comparisons with RNN-based decoders. Unlike conventional wisdom from recurrent models, our findings reveal that deeper convolutional stacks do not necessarily yield performance improvements, and the utility of visual attention is significantly less pronounced in convolutional decoding pipelines. Moreover, we observe that VISCON maintains competitive accuracy only when trained with relatively short captions, whereas performance degrades sharply as sentence length increases, indicating difficulty in modeling long-range dependencies. Finally, despite showing comparable BLEU and METEOR scores under certain settings, convolutional approaches consistently underperform on CIDEr, raising questions about their capacity to model human-like semantic richness. This comprehensive analysis highlights the underexplored trade-offs in convolutional decoding and contributes new insights into designing future captioning systems that harmonize visual-semantic reasoning with efficient sequence modeling.

Version published to 10.20944/preprints202510.1744.v1
Oct 22, 2025

A Multi-Task Intervention Framework for Semantically Faithful Image Captioning

This article has 4 authors:
1. Emma Claes
2. Milan De Wilde
3. Callum Hensley
4. Nathan Verhoeven
This article has no evaluationsLatest version Oct 21, 2025
Anticipatory Semantics with Bidirectional Guidance for Image Captioning

This article has 3 authors:
1. Noémie Laurent
2. Elodie Fairchild
3. Arthur Delvaux
This article has no evaluationsLatest version Sep 17, 2025
Contextual Synergy through Explicit and Implicit Relations: A Unified Perspective for Image Description Generation

This article has 4 authors:
1. Lotte Vermeulen
2. Yara Van den Broeck
3. Callum Hensley
4. Bram Smet
This article has no evaluationsLatest version Oct 8, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Multi-Task Intervention Framework for Semantically Faithful Image Captioning

Anticipatory Semantics with Bidirectional Guidance for Image Captioning

Contextual Synergy through Explicit and Implicit Relations: A Unified Perspective for Image Description Generation