Analyzing Multilingual Conversations During COVID-19: An Imbalanced Class-Ensemble Learning Approach with Reweighted AdaBoost-SVM for Code-Switched Text Classification

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This study confronts the challenge of analyzing multilingual, code-switched conversations during the COVID-19 pandemic, a context where traditional classifiers often fall short. We developed a cost-sensitive ensemble learning approach that combines a reweighted AdaBoost-SVM model with an SVM as its base learner, specifically designed to effectively manage the imbalanced dataset common in code-switched communication scenarios. A key innovation of our approach is the novel rebalancing of AdaBoost weights. By incrementally adjusting the weights of misclassified samples from both minority and majority classes, we achieve a more balanced classification in each iteration. This strategy significantly improves the accuracy for minority class classification, a common issue with existing models. In the testing phase, we employed a comprehensive selection of both machine and deep learning classifiers, including Naive Bayes, Decision Trees, SMOTEBoost, CNN, Bi-LSTM, etc. These classifiers underwent comprehensive evaluation across two different multilingual datasets, assessed using six distinct metrics, including P-mean. The results from our experiments demonstrate that our proposed ensemble learning approach, fine-tuned with optimal hyperparameters and leveraging M-BERT for feature extraction, achieved remarkable accuracies of 78.84%, 86.56% and 83.96% on the test sets of the CTSA, TUNIZI and combined CTSA-TUNIZI datasets, respectively. This performance not only surpassed traditional classification methods but also outperformed advanced deep learning models, such as Bi-LSTM.

Article activity feed