Comparing Cost-Sensitive and Data-Level Strategies to Address Extreme Class Imbalance in Educational Review Sentiment Analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Educational review sentiment analysis poses serious challenges due to class imbalance, where positive reviews (4-5 stars) predominate, and the dataset lacks critical negative feedback. The result of this imbalance is the creation of models that fail to recognize the critical aspect of minority views, thereby limiting their usefulness in measuring the quality of education. Objective: The paper aims to explore and compare various class imbalance management strategies to enhance sentiment classification accuracy, particularly for minority classes in educational review data. Methods: A large-scale dataset of 107,018 reviews of educational courses was used to conduct a thorough analysis that revealed a severe imbalance (32:1 between the majority and minority classes). Five imbalance handling methods were systematically compared: classification of the baseline, weighted learning on cost-sensitive classification, oversampling by human intervention, weighted learning by ensemble, and combined resampling methods. It was evaluated on macro F1-score, per-class F1-scores, and statistical significance. Results: The weighted Logistic Regression was found to be the best method, with the greatest percentage change in macro F1-score (0.3691 to 0.4087) compared to the traditional methods (10.7). The strategy showed significant improvement across the minority classes: 39.3% in 2-star reviews, 40.9% in 3-star reviews, 49.7% in 4-star reviews, and 99.9% in 5-star reviews. The statistical analysis revealed significant improvements across all underrepresented classes. Conclusions: This study demonstrates that basic cost-conscious learning strategies can effectively counteract extreme class imbalance in educational sentiment analysis, eliminating the need for complicated resampling techniques or ensemble analyses. The results are beneficial, offering feasible recommendations for creating more balanced and reliable sentiment analysis systems for academic use, enabling the effective identification of the critical feedback needed to enhance the quality of education.

Article activity feed