Predicting Mortality and Risk Factors in Cystic Fibrosis Using a Boruta- Enhanced Machine Learning Pipeline: Comparative Evaluation of Ensemble and Penalized Regression Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Early and accurate prediction of mortality risk in patients with cystic fibrosis (CF) can guide clinical decision-making and resource allocation, especially in settings with limited access to advanced therapies. Traditional prognostic tools often rely on one or a few variables (e.g. FEV₁) and may fail to capture complex, nonlinear interactions among clinical and laboratory features. Methods We collected clinical and laboratory data from 349 CF patients monitored at Masih Daneshvari Hospital (Tehran, 2021–2024), excluding records with unavailable vital status. After filtering out features with > 30% missingness, we applied the Boruta algorithm to select relevant predictors. The dataset was split 80/20 into training and test sets. To address missingness, we performed multiple imputation (MICE, m  = 5) separately on training and test sets to avoid leakage. On each imputed training fold, we applied SMOTE (K = 5, dup_size = 4) to balance classes, and trained three models: Random Forest (300 trees), XGBoost (eta = 0.1, max_depth = 6, 100 rounds), and penalized logistic regression (elastic net, α = 0.5, λ via 5-fold CV). For each model-imputation pair, optimal probability thresholds were derived using the Youden index on training predictions; final thresholds were the median across imputations. Test predictions were pooled by averaging probabilities across imputations and applying the median threshold. Models were evaluated on accuracy, sensitivity, specificity, precision, F1-score, and AUC (ROC; PRROC for precision-recall). Results Boruta selected 17 predictors (e.g. ALP, WBC, PCO₂, respiratory distress). In the test set, Random Forest achieved 0.83 accuracy, specificity 0.91, sensitivity 0.40, precision 0.89, F1 0.90, and AUC 0.75. XGBoost achieved 0.85 accuracy, specificity = 0.89, sensitivity = 0.60, precision = 0.92, F1 = 0.91, and AUC = 0.77. Penalized logistic regression (GLMnet) achieved accuracy = 0.81, specificity = 0.70, sensitivity = 0.70, precision = 0.94, F1 = 0.88, and AUC = 0.75. Conclusions Among the evaluated models, XGBoost offers the best balance of sensitivity and specificity, making it a promising candidate for clinical deployment in mortality risk stratification in CF. The selected 17 features are biologically plausible and align with CF pathophysiology. Future work should validate our findings in multi-center cohorts and incorporate longitudinal data to further improve prognostic performance.

Article activity feed