The Decline of Synthetic Oversampling in Large-Scale Imbalanced Learning:A Post-SMOTE Empirical and Theoretical Study (2020–2025)
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
For over twenty years, SMOTE has been the standard default for addressing class imbalance. Yet a striking paradox has emerged: while researchers continue citing SMOTE extensively, practitioners have largely abandoned it in production systems. To understand this disconnect, we conducted a large-scale empirical and theoretical study of 821 papers published between 2020–2025, revealing a fundamental paradigm shift. Approximately 30% of new solutions now employ Generative AI (Diffusion Models), 30% rely on cost-sensitive loss functions, while the remainder explore hybrid approaches. We mathematically show why SMOTE fails at modern scales: its O ( N 2 ) complexity exhausts memory on billion-sample datasets, its nearest-neighbor logic distorts high-dimensional manifolds, and its CPU-bound design is incompatible with GPU pipelines. Beyond theory, we identify the novel “SMOTE Paradox”—the critical divergence between academic citations and real-world deployment. This systematic study maps the emerging post-SMOTE landscape, syn- thesizes theoretical foundations across three competing paradigms, and provides decision rules for practitioners. Empirically, we validate these findings on real-world fraud data (N = 284,807, im- balance 578:1), confirming that cost-sensitive learning achieves parity with oversampling (+ 0.29% F1-score improvement) while eliminating preprocessing overhead. Our contributions include: (1) quantitative documentation of the paradigm shift through systematic analysis, (2) mathematical proofs of SMOTE’s failure modes, (3) the novel SMOTE Paradox framework, and (4) empirical validation on production-scale data. We conclude with actionable guidelines for practitioners and identify two critical open problems for future research in large-scale imbalanced learning.