Benchmarking Ensemble Machine Learning Algorithms for the Early Prediction of Stroke in Imbalanced Clinical Cohorts: A Comparative Analysis and Decision Curve Assessment

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Stroke remains a leading cause of mortality and long-term disability globally, necessitating effective primary prevention strategies. While machine learning (ML) models offer superior predictive capabilities compared to traditional linear risk scores, their application in clinical practice is often hindered by the "class imbalance" problem, where the rarity of stroke events leads to biased, low-sensitivity models. Furthermore, the literature currently lacks rigorous head-to-head benchmarking of modern boosting algorithms on moderate-sized clinical datasets. This study aimed to identify the optimal predictive model for stroke by systematically benchmarking seven ensemble algorithms and validating their clinical utility using Decision Curve Analysis (DCA).Methods: We analyzed a retrospective multi-center cohort of 5,110 patients, characterized by a severe class imbalance (4.9% stroke incidence). Feature engineering included the encoding of sociodemographic determinants and clinical biomarkers. We conducted a rigorous 10-fold stratified cross-validation tournament to compare seven classifiers: Linear Discriminant Analysis (LDA), Extra Trees, AdaBoost, Gradient Boosting, XGBoost, LightGBM, and CatBoost. Performance was evaluated using the Area Under the Receiver Operating Characteristic Curve (AUC) and Brier Score for calibration. To address clinical safety, decision thresholds were optimized to maximize sensitivity. Clinical utility was assessed using Decision Curve Analysis to quantify net benefit across relevant risk thresholds.Results: The classical Gradient Boosting Classifier emerged as the top-performing model, achieving a mean AUC of 0.842 (95% CI: 0.82–0.86). It statistically outperformed both the linear baseline (LDA, AUC=0.833) and complex modern implementations such as XGBoost (AUC=0.787) and Extra Trees (AUC=0.748). By tuning the decision threshold to 0.01, the champion model achieved a screening Sensitivity of 86.0% and Specificity of 53.6%. SHAP (SHapley Additive exPlanations) analysis identified Age, Average Glucose Level, and BMI as the dominant non-linear predictors. Crucially, Decision Curve Analysis demonstrated that the Gradient Boosting model provided a higher net clinical benefit than "treat-all" or "treat-none" strategies across threshold probabilities of 1% to 40%.Conclusion: Contrary to current trends favoring deep learning or complex boosting implementations, classical Gradient Boosting architectures demonstrated superior generalization on imbalanced tabular clinical data. The developed model combines high discriminatory power with proven clinical utility, supporting its deployment as an automated, high-sensitivity screening tool in primary care settings.

Article activity feed