Harmonious Multi-Grained Integration for Robust Multimodal Emotion Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In this paper, we introduce \textbf{HarmoFusion}, a comprehensive framework that seamlessly integrates multi-granular information for robust multimodal emotion analysis. Traditional approaches in emotion recognition often rely solely on either holistic, pre-trained utterance-level embeddings or isolated fine-grained features, which can result in suboptimal performance when dealing with the subtle and dynamic nature of emotional expression. HarmoFusion bridges this gap by unifying pre-trained global representations with detailed interactions at the phoneme and word levels. Inspired by advancements in transformer-based text-to-speech systems, our model employs a hierarchical attention mechanism to capture intricate cross-modal dependencies. More specifically, our architecture fuses phonetic details and lexical semantics via a novel transformer module that computes interactions. Additionally, we introduce further formulations to model the integration of word-level and phoneme-level embeddings. Extensive experiments conducted on the IEMOCAP dataset demonstrate that HarmoFusion not only surpasses current state-of-the-art methods in terms of accuracy and robustness but also exhibits enhanced performance when incorporating fine-grained interactions. Our ablation studies further highlight the importance of each component in capturing the complex nuances of multimodal emotional signals, ultimately paving the way for more effective human-computer interaction systems.