A Deep Learning Framework for Emotion Recognitionin Music Using Multimodal Data Fusion

Runhua Li

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Advancements in human-media interaction and computer vision have increasingly underscored the significance of multimodaldata fusion in enhancing semantic understanding within computational systems. The ability to seamlessly integrate auditory,visual, and contextual information plays a pivotal role in a wide range of applications, including emotion recognition, contentbased retrieval, and interactive multimedia systems. Despite the substantial progress made in this area, existing methodologiescontinue to grapple with persistent challenges. They often exhibit limited generalization capabilities when applied across diversemusical genres, styles, and cultural contexts. These approaches struggle with the inadequate modeling of hierarchical temporalstructures that are inherently complex and dynamic in musical compositions. To overcome these limitations, we present aninnovative deep learning framework that incorporates two key advancements inclduing the Harmonic Semantic Encoder (HSE)and the Contrastive Harmonic Alignment (CHA) strategy. The HSE module is designed to effectively capture both fine-grainedacoustic patterns and long-range temporal dependencies by integrating convolutional layers with transformer-basedarchitectures. This dual structure allows the model to simultaneously learn local textures and global harmonic progressions.Complementing the HSE, the CHA strategy introduces multi-level contrastive learning objectives that not only enhance thealignment of learned representations with harmonic and rhythmic structures but also enforce temporal consistency acrossvarying musical segments. Extensive empirical evaluations on standard multimodal music datasets show that our methodconsistently surpasses state-of-the-art baselines in tasks like emotion recognition and semantic music retrieval. By capturingmore nuanced emotional expressions and structural patterns in music, our framework advances the field of human-media interaction and computer vision, offering a robust and scalable solution for multimodal semantic analysis in real-world applications.

Version published to 10.21203/rs.3.rs-7308115/v1 on Research Square
Sep 19, 2025

HISF: Hierarchical Interactive Semantic Fusion for Multi-Modal Prompt Learning

This article has 2 authors:
1. Haohan Feng
2. Chen Li
This article has no evaluationsLatest version Nov 11, 2025
Music Content Understanding Models forPersonalized Recommendation Systems

This article has 1 author:
1. Li Jing
This article has no evaluationsLatest version Sep 30, 2025
Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

This article has 2 authors:
1. Anuj Attri
2. HariOm .
This article has no evaluationsLatest version Nov 4, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

HISF: Hierarchical Interactive Semantic Fusion for Multi-Modal Prompt Learning

Music Content Understanding Models forPersonalized Recommendation Systems

Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation