A Word2vec-BERT Model for Enhanced Sentiment Analysis of Arabic Social Media

Mohammed Maree

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Let T denote a corpus of short, colloquial texts from social media, where each text t 2 T is composed of a sequence of tokens t = (w1; w2; : : : ; wn), with wi representing a word, emoji, or other linguistic unit. Sentiment analysis on T , particularly for Arabic texts, is challenging due to linguistic complexity, dialectal variations, code-switching, and the pervasive use of emojis. Traditional methods, such as lexicon-based approaches and standalone embedding models (e.g., Word2vec), suffer from semantic sparsity and lack contextual awareness, while transformer-based models like BERT face computational inefficiency and domain-specific adaptation issues. We propose a novel hybrid framework H that integrates Word2vec?s learned semantic richness from a large corpus with BERT?s contextualized representations, augmented by a comprehensive emojito-text translation module E. Formally, let vWord2vec(wi) and vBERT(wi) denote the embeddings of token wi generated by Word2vec and BERT, respectively. The hybrid embedding vH(wi) is defined as: vH(wi) = α · vWord2vec(wi) + (1 - α) · vBERT(wi); where α 2 [0; 1] is a weighting parameter optimized during training. The emoji translation module E maps each emoji ej to its corresponding textual sentiment representation vE(ej), which is concatenated with vH(wi) to form the final input representation vfinal(wi). Our model is evaluated on the Arabic Sentiment Twitter Corpus D, consisting of 58K Arabic tweets (47K training, 11K test). The framework achieves state-of-the-art performance with an accuracy A = 91:63% and a Macro-F1 score F1 = 0:9162. These results significantly outperform standalone Word2vec embeddings (∆A = +19:13%) and fine-tuned BERT models (∆A = +9:03%). Ablation studies confirm the critical roles of both the emoji translation module (∆A = +11:33%) and the hybrid embeddings (∆A = +19:13% against a Word2Vec-only baseline). The implementation, curated dataset D, and pre-trained models are made openly available to ensure full reproducibility and serve as a benchmark for future research in Arabic social media analysis.

Version published to 10.21203/rs.3.rs-6406436/v1 on Research Square
Apr 10, 2025

AI-Powered Fake News Detection Tool for Nepali Media

This article has 1 author:
1. Ghimire Plan
This article has no evaluationsLatest version May 30, 2025
Advanced deep learning technology for emotion analysis on social media platform

This article has 2 authors:
1. Chunyang Shi
2. Yue Li
This article has no evaluationsLatest version Apr 21, 2025
State-of-the-Art Machine Learning Techniques in Sentiment Analysis for Social Media

This article has 3 authors:
1. Mohsen Mohammadagha
2. Israel Tshitenge
3. Ifetilayo Adebambo
This article has no evaluationsLatest version May 12, 2025

Listed in

Abstract

Article activity feed

Related articles

AI-Powered Fake News Detection Tool for Nepali Media

Advanced deep learning technology for emotion analysis on social media platform

State-of-the-Art Machine Learning Techniques in Sentiment Analysis for Social Media