Improved IEC Performance via Emotional Stimuli-Aware Captioning

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Image emotion classification (IEC), a crucial task in computer vision, aims to infer the emotional state of subjects in images. Existing techniques have focused on the use of semantic information to support visual features. However, a significant affective gap persists between low-level pixel information and high-level emotions, due to the abstract and complex nature of cognitive processes. This gap limits corresponding semantic representations and hinders the resulting model performance. In this study, we draw inspiration from psychological findings and advances in natural language processing. Specifically, we explore the use of image captions as auxiliary information, combined with visual features, for enhanced emotional discernment. We introduce the emotional stimuli-aware captioning network (ESCNet), which leverages generative captions to augment visual representations. An affective captioning dataset, based on emotional attributes, is also developed to generate emotion-related captions and pre-train the image captioning model. Visual features related to the captions are then generated to highlight emotionally charged words and a fusion module combining cross-attention with self-attention is introduced to learn correlations between images and captions. We also introduce a variable-weight loss function to emphasize hard-to-classify samples. Extensive validation experiments using multiple public datasets demonstrated that our approach outperformed state-of-the-art models. Ablation studies and visualization results further confirmed the effectiveness of our proposed network and its modules.

Article activity feed