Machine Learning Model Based on Routine Blood Indicators for High-Risk Screening of Interstitial Lung Disease in Patients with Connective Tissue Disease: A Cost-Effective Triage Strategy

Haoran Wang
Lele Zhang
Huifang Xing
Dong Yang
Chenshen Liu
Hongping Liang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background A machine learning model based on routine blood test indicators was constructed to predict the risk of interstitial lung disease (ILD) in patients with connective tissue disease (CTD), thereby providing a convenient tool for clinical screening of high-risk populations. Methods A total of 225 inpatients with connective tissue disease (CTD) admitted to Shanxi Provincial People's Hospital from May 2022 to May 2023 were retrospectively enrolled, including 85 cases in the CTD-ILD group and 140 cases in the pure CTD group. Clinical and laboratory data were collected, including gender, age, serum KL-6 levels, and a full set of routine blood test indicators (including derived inflammatory and immune indexes such as systemic immune-inflammation index [SII] and neutrophil-to-lymphocyte ratio [NLR]). The Adaptive Synthetic Sampling (ADASYN )technique (with three proportions: 50%, 75%, and 100%) was applied to address the issue of data imbalance, and the balanced dataset was divided into a training set and a test set at a ratio of 7:3. Five machine learning models were constructed, namely eXtreme Gradient Boosting (XGBoost), logistic regression, random forest (RF), support vector machine (SVM), and k-nearest neighbor (KNN). Hyperparameters were optimized via 10-fold cross-validation and grid search. Model performance was comprehensively evaluated from four dimensions: discriminative ability (ROC curve, PR curve), calibration (calibration plot), clinical utility (decision curve analysis [DCA]), and confusion matrix metrics (AUC, accuracy, sensitivity, etc.). The SHAP (SHapley Additive exPlanations) method was used to interpret key predictive features, and the Bootstrap method was applied to verify the stability of the models and the reliability of the results. Results Sixteen indicators, including gender, systemic immune-inflammation index (SII), white blood cell count (WBC), mean corpuscular volume (MCV) and red blood cell distribution width standard deviation (RDW-SD) exhibited statistically significant differences between the CTD-ILD group and the pure CTD group (p < 0.05). After feature screening, SII, WBC, MCV and RDW-SD were identified as the core predictive variables. The overall model performance was optimal at the 75% ADASYN sampling ratio, among which the random forest (RF) model achieved the best performance: in the validation set, the area under the curve (AUC) was 0.846, average precision (AP) was 0.896, and F1-score was 0.842. The calibration plot indicated that the model had the minimum deviation between predicted probabilities and actual risks (calibration error = 0.146), and decision curve analysis (DCA) confirmed that the model yielded net clinical benefits across the entire threshold range. SHAP analysis elucidated the action mechanisms of each core variable. Finally, 1000-time Bootstrap resampling validation showed that the RF model had a mean AUC of 0.740 ± 0.040 with a 95% confidence interval (95% CI) of [0.608, 0.848]. All performance indicators presented low coefficients of variation, demonstrating favorable stability and reliability of the model. Conclusions: The random forest (RF) model based on routine blood test indicators demonstrated favorable discriminative ability, calibration and clinical utility in the risk prediction of CTD-ILD. Although its specificity was inferior to that of serum markers such as KL-6, the model leverages the advantages of routine blood tests—high popularity, low cost and strong timeliness—and thus can serve as a convenient tool for the preliminary screening of CTD patients at high risk of ILD, and is particularly suitable for primary medical institutions and large-scale screening scenarios. Trial registration: Clinical trial number: not applicable.

Version published to 10.21203/rs.3.rs-9173851/v1 on Research Square
Mar 31, 2026

Ensemble Machine Learning and SMOTE-NC forthe Multi-Stage Classification of Chronic KidneyDisease Using Routine Clinical Data

This article has 5 authors:
1. Shruthi Mohan
2. Akshat Choudhary
3. Rohit Rajesh
4. Nandini K
5. Arpita Paria
This article has no evaluationsLatest version Mar 30, 2026
Predicting Mortality and Risk Factors in Cystic Fibrosis Using a Boruta- Enhanced Machine Learning Pipeline: Comparative Evaluation of Ensemble and Penalized Regression Models

This article has 4 authors:
1. Farzaneh Hamidi
2. Anoshirvan Kazemnejad
3. Maryam Hassanzad
4. Mina Jahangiri
This article has no evaluationsLatest version Mar 27, 2026
Application Of Multi-Inflammatory Index To Predict 28-Day Mortality In ICU Patients With Heart Failure: A Retrospective Machine Learning Study Based On The MIMIC-IV Database

This article has 5 authors:
1. Longcha Liu
2. Zhenjie Dai
3. Xueshu Yu
4. Zhi Chen
5. Yanqiu Lin
This article has no evaluationsLatest version Mar 28, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Ensemble Machine Learning and SMOTE-NC forthe Multi-Stage Classification of Chronic KidneyDisease Using Routine Clinical Data

Predicting Mortality and Risk Factors in Cystic Fibrosis Using a Boruta- Enhanced Machine Learning Pipeline: Comparative Evaluation of Ensemble and Penalized Regression Models

Application Of Multi-Inflammatory Index To Predict 28-Day Mortality In ICU Patients With Heart Failure: A Retrospective Machine Learning Study Based On The MIMIC-IV Database