Predicting Mortality and Risk Factors in Cystic Fibrosis Using a Boruta- Enhanced Machine Learning Pipeline: Comparative Evaluation of Ensemble and Penalized Regression Models

Farzaneh Hamidi
Anoshirvan Kazemnejad
Maryam Hassanzad
Mina Jahangiri

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Early and accurate prediction of mortality risk in patients with cystic fibrosis (CF) can guide clinical decision-making and resource allocation, especially in settings with limited access to advanced therapies. Traditional prognostic tools often rely on one or a few variables (e.g. FEV₁) and may fail to capture complex, nonlinear interactions among clinical and laboratory features. Methods We collected clinical and laboratory data from 349 CF patients monitored at Masih Daneshvari Hospital (Tehran, 2021–2024), excluding records with unavailable vital status. After filtering out features with > 30% missingness, we applied the Boruta algorithm to select relevant predictors. The dataset was split 80/20 into training and test sets. To address missingness, we performed multiple imputation (MICE, m = 5) separately on training and test sets to avoid leakage. On each imputed training fold, we applied SMOTE (K = 5, dup_size = 4) to balance classes, and trained three models: Random Forest (300 trees), XGBoost (eta = 0.1, max_depth = 6, 100 rounds), and penalized logistic regression (elastic net, α = 0.5, λ via 5-fold CV). For each model-imputation pair, optimal probability thresholds were derived using the Youden index on training predictions; final thresholds were the median across imputations. Test predictions were pooled by averaging probabilities across imputations and applying the median threshold. Models were evaluated on accuracy, sensitivity, specificity, precision, F1-score, and AUC (ROC; PRROC for precision-recall). Results Boruta selected 17 predictors (e.g. ALP, WBC, PCO₂, respiratory distress). In the test set, Random Forest achieved 0.83 accuracy, specificity 0.91, sensitivity 0.40, precision 0.89, F1 0.90, and AUC 0.75. XGBoost achieved 0.85 accuracy, specificity = 0.89, sensitivity = 0.60, precision = 0.92, F1 = 0.91, and AUC = 0.77. Penalized logistic regression (GLMnet) achieved accuracy = 0.81, specificity = 0.70, sensitivity = 0.70, precision = 0.94, F1 = 0.88, and AUC = 0.75. Conclusions Among the evaluated models, XGBoost offers the best balance of sensitivity and specificity, making it a promising candidate for clinical deployment in mortality risk stratification in CF. The selected 17 features are biologically plausible and align with CF pathophysiology. Future work should validate our findings in multi-center cohorts and incorporate longitudinal data to further improve prognostic performance.

Version published to 10.21203/rs.3.rs-8908152/v1 on Research Square
Mar 27, 2026

Predicting Mortality Risk in Sepsis-Induced Early Coagulopathy: A Multicenter Comparison of Machine Learning and Nomogram Approaches

This article has 2 authors:
1. hongwei duan
2. Yan Huang
This article has no evaluationsLatest version Apr 12, 2026
Screening of key variables and development and validation of a prognostic model for hepatocellular carcinoma

This article has 8 authors:
1. Jiang Chen
2. Hangyu Zhi
3. Mian Guo
4. Xin Meng
5. Yibo Zhang
6. Huan Xia
7. Cong Yao
8. Kai Qu
This article has no evaluationsLatest version Mar 23, 2026
Ensemble Machine Learning and SMOTE-NC forthe Multi-Stage Classification of Chronic KidneyDisease Using Routine Clinical Data

This article has 5 authors:
1. Shruthi Mohan
2. Akshat Choudhary
3. Rohit Rajesh
4. Nandini K
5. Arpita Paria
This article has no evaluationsLatest version Mar 30, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Predicting Mortality Risk in Sepsis-Induced Early Coagulopathy: A Multicenter Comparison of Machine Learning and Nomogram Approaches

Screening of key variables and development and validation of a prognostic model for hepatocellular carcinoma

Ensemble Machine Learning and SMOTE-NC forthe Multi-Stage Classification of Chronic KidneyDisease Using Routine Clinical Data