Research on a Multimodal Emotion Perception Model Based on GCN+GIN Hybrid Model
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Graph neural networks (GNNs) have demonstrated strong performance in handling graph-structured data in recent years, particularly in capturing complex inter-node relationships among data samples, showcasing advantages over traditional neural networks. However, challenges persist, including difficulties in cross-modal information fusion, inadequate modeling of modal relationships, and high computational costs. To address these limitations, this paper proposes GGMEN, a novel model that integrates the local neighborhood aggregation capability of graph convolutional networks with the global structural expressiveness of graph isomorphic networks (GINs). Leveraging shallow feature extraction via time-frequency joint analysis, the paper extracts 14 representative physiological statistical features. Simultaneously, the Transformer model captures spatial features from individual facial expression video frames, enabling spatio-temporal modeling of facial expressions. The GCN layer models temporal dependencies in physiological signals and spatial relationships of facial key points, while the GIN layer enhances modeling of complex higher-order relationships. Multi-modal emotion perception is achieved through attention-based modality fusion. Experiments on the DEAP dataset validate the model’s effectiveness across multiple emotion perception benchmarks, achieving an emotion recognition accuracy of 81.25%. Comparative analyses with existing models confirm the accuracy improvement of the proposed framework.