Multimodal Model Based on Contrastive Language-Image Pretraining for Micro-Expression Recognition
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Recognizing involuntary, low-intensity micro-expressions (MEs) is challenging due to their subtlety and a lack of large-scale annotated data. This study introduces MECLIP, a novel dual-modal framework based on Contrastive Language-Image Pretraining (CLIP), to enhance ME recognition accuracy through enriched semantic supervision. MECLIP reformulates the CLIP architecture by incorporating a hierarchical temporal transformer to model visual dynamics and leverages a large language model to generate fine-grained physiological descriptors as textual guidance. An adaptive weighting mechanism fuses these spatiotemporal visual features with the nuanced textual semantics via contrastive learning. On the CAS(ME)³ dataset, MECLIP achieved a state-of-the-art unweighted average recall (UAR) of 39.6% and an unweighted F1-score (UF1) of 40.0%, outperforming existing benchmarks. The model also demonstrated strong zero-shot learning capabilities on the CASMEII dataset. Language-augmented multimodal learning presents a promising paradigm for improving micro-expression analysis, effectively compensating for data scarcity through fine-grained semantic feature alignment.