Harmonious Multi-Grained Integration for Robust Multimodal Emotion Analysis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In this paper, we introduce \textbf{HarmoFusion}, a comprehensive framework that seamlessly integrates multi-granular information for robust multimodal emotion analysis. Traditional approaches in emotion recognition often rely solely on either holistic, pre-trained utterance-level embeddings or isolated fine-grained features, which can result in suboptimal performance when dealing with the subtle and dynamic nature of emotional expression. HarmoFusion bridges this gap by unifying pre-trained global representations with detailed interactions at the phoneme and word levels. Inspired by advancements in transformer-based text-to-speech systems, our model employs a hierarchical attention mechanism to capture intricate cross-modal dependencies. More specifically, our architecture fuses phonetic details and lexical semantics via a novel transformer module that computes interactions. Additionally, we introduce further formulations to model the integration of word-level and phoneme-level embeddings. Extensive experiments conducted on the IEMOCAP dataset demonstrate that HarmoFusion not only surpasses current state-of-the-art methods in terms of accuracy and robustness but also exhibits enhanced performance when incorporating fine-grained interactions. Our ablation studies further highlight the importance of each component in capturing the complex nuances of multimodal emotional signals, ultimately paving the way for more effective human-computer interaction systems.