Multigranular Unified Synthesis Encoder for Fine-grained Multimodal Emotion Understanding
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate emotion understanding from multimodal signals has become a pivotal research area, especially given its relevance in enhancing human-computer interaction systems. However, the inherent complexity of emotional expression across modalities, coupled with the scarcity of high-quality annotated data, poses significant barriers to progress. In this work, we present MUSE, a novel multigranular unified synthesis encoder framework, designed to seamlessly integrate fine-grained representations and global pre-trained embeddings for superior emotion recognition. In contrast to prior studies which narrowly emphasize either modality-level pretraining or local feature alignment, our method orchestrates both perspectives synergistically. Drawing inspiration from advances in text-to-speech synthesis, MUSE employs a multilevel Transformer-based module that explicitly models cross-modal associations among phonemes, words, and utterances. Furthermore, it leverages self-supervised learning backbones to exploit large-scale unlabeled corpora efficiently. Our extensive evaluations on the widely adopted IEMOCAP benchmark reveal that MUSE consistently surpasses existing approaches, establishing new state-of-the-art performances. Additionally, we demonstrate that our multigranular fusion strategy yields substantial gains over conventional fusion schemes.