Multimodal Model Based on Contrastive Language-Image Pretraining for Micro-Expression Recognition

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Recognizing involuntary, low-intensity micro-expressions (MEs) is challenging due to their subtlety and a lack of large-scale annotated data. This study introduces MECLIP, a novel dual-modal framework based on Contrastive Language-Image Pretraining (CLIP), to enhance ME recognition accuracy through enriched semantic supervision. MECLIP reformulates the CLIP architecture by incorporating a hierarchical temporal transformer to model visual dynamics and leverages a large language model to generate fine-grained physiological descriptors as textual guidance. An adaptive weighting mechanism fuses these spatiotemporal visual features with the nuanced textual semantics via contrastive learning. On the CAS(ME)³ dataset, MECLIP achieved a state-of-the-art unweighted average recall (UAR) of 39.6% and an unweighted F1-score (UF1) of 40.0%, outperforming existing benchmarks. The model also demonstrated strong zero-shot learning capabilities on the CASMEII dataset. Language-augmented multimodal learning presents a promising paradigm for improving micro-expression analysis, effectively compensating for data scarcity through fine-grained semantic feature alignment.

Article activity feed