Multi-Domain Feature Enhancement and Fusion Transformer with Bilateral Facial Structure Awareness for Robust and Cross-Domain Facial Expression Recognition

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Facial expression recognition (FER) is essential for affective computing and human–computer interaction, but robust performance under unconstrained conditions (illumination, pose, occlusion, cultural diversity) remains difficult to achieve. Traditional CNNs focus on local details and struggle with global dependencies, while Vision Transformers (ViT) model global context yet often overlook fine-grained texture and frequency cues that are crucial for subtle expression discrimination. To address these issues, we propose a unified Multi-Domain Feature Enhancement and Fusion (MDFEF) framework that combines a ViT-based global encoder with channel, spatial, and frequency branches for complementary feature learning. Taking into account the approximately bilateral symmetry of human faces and the asymmetric distortions introduced by pose, occlusion, and illumination, MDFEF is designed to learn symmetry-aware and asymmetry-robust representations for facial expression recognition across diverse domains. An adaptive Cross-Domain Feature Enhancement and Fusion (CDFEF) module further aligns and integrates heterogeneous features, enhancing domain-consistent and illumination-robust expression understanding. Experiments on KDEF, FER2013, and RAF-DB show that the proposed model outperforms representative CNN-, Transformer-, and ensemble-based baselines in both accuracy and F1-score, confirming its effectiveness and strong generalization ability for real-world FER.

Article activity feed