FusionX: A Symbolic-fused Multimodal Emotion Interaction Framework

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Understanding human emotion through multimodal signals—such as linguistic content, vocal acoustics, and facial expressions—remains a complex and nuanced challenge for artificial systems. Unlike humans, who intuitively infer emotions through intricate cross-modal cues, machines must systematically decode heterogeneous information. To address this gap, we propose a novel multimodal emotion recognition framework, \textbf{FusionX}, that systematically models inter-modal dynamics from multiple perspectives. FusionX decomposes multimodal input signals into three complementary types of interaction representations: modality-complete (preserving full unimodal information), modality-synergistic (capturing shared inter-modal contributions), and modality-unique (highlighting distinctive aspects of each modality). To further refine the integration of these representations, we introduce a text-prioritized fusion mechanism named \textbf{Text-Centric Hierarchical Tensor Fusion} (TCHF). This module constructs a deep hierarchical tensor network that accentuates the semantic richness of textual modality while harmonizing its contribution with the audio and visual streams. To validate FusionX, we conduct extensive evaluations across three widely-used benchmarks: MOSEI, MOSI, and IEMOCAP. Results reveal that our method significantly surpasses previous state-of-the-art baselines in both classification accuracy and regression metrics, demonstrating the superiority of hierarchical and perspective-aware interaction modeling in emotion understanding.

Article activity feed