Mitigating Data Leakage and Class Imbalance in Explainable AI for Stroke Prediction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Stroke prediction using machine learning (ML) is highly sensitive to data preprocessing strategies, with significant implications for data leakage and model interpretability. Methods: This study systematically investigates the effects of three key prepro-cessing components—missing value imputation, class imbalance correction, and clinically guided feature binning—on the performance and explainability of six machine learning (ML) models: logistic regression (LR), support vector machine (SVM), decision tree (DT), random forest (RF), CatBoost, and XGBoost, using a public stroke dataset. Results: Our findings demonstrate that applying SMOTE before data splitting introduces significant data leakage, artificially inflating AUC values up to 0.99—a misleading representation. Mitigating this leakage by restricting SMOTE to the training set caused a marked drop in performance: CatBoost’s recall declined from 0.96 to 0.08, and XGBoost’s AUC decreased from 0.99 to 0.84. Similarly, imputing missing values before splitting led to inflated metrics, albeit to a lesser extent. In contrast, class-weight adjustment, a leakage-free strategy, consistently achieved robust and balanced results (AUC up to 0.86). While clinically guided feature binning improved interpretability with minimal performance trade-off, SHAP analysis confirmed that improper preprocessing distorted feature importance rankings, reducing the clinical plausibility and trustworthiness of model interpretations. Conclusion: These findings underscore that rigorous, leakage-free preprocess-ing is essential for developing reliable, interpretable, and clinically meaningful 1 stroke prediction models. This study offers methodologically grounded guidance for constructing trustworthy AI systems in healthcare.