Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Upsampling under Varying Imbalance Levels
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study examines the efficacy of Random Forest and XGBoost classifiers in conjunction with three upsampling techniques—SMOTE, ADASYN, and Gaussian Noise Up-Sampling (GNUS)—across datasets with varying class imbalance levels, ranging from moderate to extreme (15% to 1% churn rate). Employing metrics such as F1-Score, ROC AUC, PR AUC, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa, this research provides a comprehensive evaluation of classifier performance under different imbalance scenarios, with a focus on applications in the telecommunications domain. The findings highlight that XGBoost paired with SMOTE consistently achieves the highest F1-Score and robust performance across all imbalance levels. SMOTE emerged as the most effective upsampling method, particularly when used with XGBoost, whereas Random Forest performed poorly under severe imbalance. ADASYN showed moderate effectiveness with XGBoost but underperformed with Random Forest, and GNUS produced inconsistent results. The study underscores the impact of data imbalance, with MCC, Kappa, and F1-Scores fluctuating significantly, whereas ROC AUC and PR AUC remained relatively stable. Moreover, rigorous statistical analyses employing the Friedman test and Nemenyi post-hoc comparisons confirmed that the observed improvements in F1-Score, PR-AUC, Kappa, and MCC were statistically significant (p<0.05), with Tuned_XGB_SMOTE significantly outperforming Tuned_RF_GNUS. While differences in ROC-AUC were not significant, the consistency of these results across multiple performance metrics underscores the reliability of our framework, offering a statistically validated and attractive solution for model selection in imbalanced classification scenarios.