EGY-MER: Establishing The First Egyptian Arabic Multimodal Emotion Recognition Dataset for Affective Computing
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This research paper introduces EGY-MER, a new multimodal dataset for emotion recognition in the Egyptian Arabic language. The aim is to fill an important research gap in affective computing regarding this dialect. The data samples were collected and organized using the MODALINK pipeline, which provides synchronized multimodal alignment and high-quality annotations. Each sample comprises transcribed speech, corresponding audio, and facial frames with an associated emotion category. Three pretrained encoders were used for establishing baseline results: text data was processed with AraBERTv2, speech with Wav2Vec2-ER, and vision data with a Swin Transformer. A late-fusion strategy was used to combine high-level representations from each encoder. Baseline experiments revealed that combining the different modalities results in improved emotion recognition performance, in contradiction to the unimodal configurations. Weighted F1 and macro-F1 scores suggest the potential of cross-modal features for capturing affective cues in Egyptian Arabic. In addition, the results demonstrate the dataset’s consistency and applicability in multimodal learning research. This research presents the first dataset for multimodal emotion recognition in Egyptian Arabic, along with reproducible baselines. The main aim is that the dataset and provided benchmark models will facilitate further research in emotion recognition for low-resource languages, multimodal fusion, and affective computing in Arabic.