Multimodal Fusion Network for Multimodal Sentiment Analysis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Recent advances in pretrained language models have reshaped multimodal learning, yet this progress often comes with increased computational demands. In this paper, we introduce the \textbf{Enhanced Multimodal Fusion Network (EMFN)}, a novel architecture that integrates textual, acoustic, and visual signals for sentiment analysis. EMFN incorporates specialized adapter modules and cross-layer fusion strategies to effectively combine multimodal representations while preserving the robust features of the underlying frozen language model. By decoupling the pretrained weights from task-specific updates and utilizing lightweight, trainable fusion layers, our approach enables rapid and data-efficient adaptation. Empirical evaluations on the CMU-MOSEI dataset reveal that EMFN achieves a relative error reduction of $3.7\%$ and a $2.4\%$ improvement in seven-class classification accuracy compared to conventional fine-tuning techniques. These insights, alongside comprehensive experiments, attest to the robustness and adaptability of EMFN under challenging, noisy conditions.