A Word2vec-BERT Model for Enhanced Sentiment Analysis of Arabic Social Media

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Let T denote a corpus of short, colloquial texts from social media, where each text t 2 T is composed of a sequence of tokens t = (w1; w2; : : : ; wn), with wi representing a word, emoji, or other linguistic unit. Sentiment analysis on T , particularly for Arabic texts, is challenging due to linguistic complexity, dialectal variations, code-switching, and the pervasive use of emojis. Traditional methods, such as lexicon-based approaches and standalone embedding models (e.g., Word2vec), suffer from semantic sparsity and lack contextual awareness, while transformer-based models like BERT face computational inefficiency and domain-specific adaptation issues. We propose a novel hybrid framework H that integrates Word2vec?s learned semantic richness from a large corpus with BERT?s contextualized representations, augmented by a comprehensive emojito-text translation module E. Formally, let vWord2vec(wi) and vBERT(wi) denote the embeddings of token wi generated by Word2vec and BERT, respectively. The hybrid embedding vH(wi) is defined as: vH(wi) = α · vWord2vec(wi) + (1 - α) · vBERT(wi); where α 2 [0; 1] is a weighting parameter optimized during training. The emoji translation module E maps each emoji ej to its corresponding textual sentiment representation vE(ej), which is concatenated with vH(wi) to form the final input representation vfinal(wi). Our model is evaluated on the Arabic Sentiment Twitter Corpus D, consisting of 58K Arabic tweets (47K training, 11K test). The framework achieves state-of-the-art performance with an accuracy A = 91:63% and a Macro-F1 score F1 = 0:9162. These results significantly outperform standalone Word2vec embeddings (∆A = +19:13%) and fine-tuned BERT models (∆A = +9:03%). Ablation studies confirm the critical roles of both the emoji translation module (∆A = +11:33%) and the hybrid embeddings (∆A = +19:13% against a Word2Vec-only baseline). The implementation, curated dataset D, and pre-trained models are made openly available to ensure full reproducibility and serve as a benchmark for future research in Arabic social media analysis.

Article activity feed