CLARA: Enhancing Multimodal Sentiment Analysis via Efficient Vision-Language Fusion

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Understanding sentiment in social images requires integrating visual content with short text, where cross-modal conflicts are prevalent. We introduce CLARA, a parameter-efficient vision-language framework for multimodal sentiment analysis on image-text pairs. CLARA employs lightweight LoRA adapters on frozen encoders, coupled with multi-head co-attention for aligning visual regions and textual spans. A consistency-verification step refines the fused representation before classification. Our approach achieves state-of-the-art results on three diverse datasets: MVSA-Single (83.04% weighted F1), MVSA-Multiple (73.45% weighted F1), and HFM hate speech detection (87.82% macro F1), demonstrating effective generalization while maintaining parameter efficiency (7.45% trainable parameters). Here, we show that CLARA significantly improves neutral class prediction and provides well-calibrated predictions under modal disagreement. The implementation of this work is available at https://doi.org/10.5281/zenodo. 17862924 and https://github.com/phuonglamgithub/CLARA

Article activity feed