A Deep Learning Framework for Emotion Recognitionin Music Using Multimodal Data Fusion
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Advancements in human-media interaction and computer vision have increasingly underscored the significance of multimodaldata fusion in enhancing semantic understanding within computational systems. The ability to seamlessly integrate auditory,visual, and contextual information plays a pivotal role in a wide range of applications, including emotion recognition, contentbased retrieval, and interactive multimedia systems. Despite the substantial progress made in this area, existing methodologiescontinue to grapple with persistent challenges. They often exhibit limited generalization capabilities when applied across diversemusical genres, styles, and cultural contexts. These approaches struggle with the inadequate modeling of hierarchical temporalstructures that are inherently complex and dynamic in musical compositions. To overcome these limitations, we present aninnovative deep learning framework that incorporates two key advancements inclduing the Harmonic Semantic Encoder (HSE)and the Contrastive Harmonic Alignment (CHA) strategy. The HSE module is designed to effectively capture both fine-grainedacoustic patterns and long-range temporal dependencies by integrating convolutional layers with transformer-basedarchitectures. This dual structure allows the model to simultaneously learn local textures and global harmonic progressions.Complementing the HSE, the CHA strategy introduces multi-level contrastive learning objectives that not only enhance thealignment of learned representations with harmonic and rhythmic structures but also enforce temporal consistency acrossvarying musical segments. Extensive empirical evaluations on standard multimodal music datasets show that our methodconsistently surpasses state-of-the-art baselines in tasks like emotion recognition and semantic music retrieval. By capturingmore nuanced emotional expressions and structural patterns in music, our framework advances the field of human-media interaction and computer vision, offering a robust and scalable solution for multimodal semantic analysis in real-world applications.