Machine Learning Model Based on Routine Blood Indicators for High-Risk Screening of Interstitial Lung Disease in Patients with Connective Tissue Disease: A Cost-Effective Triage Strategy

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background A machine learning model based on routine blood test indicators was constructed to predict the risk of interstitial lung disease (ILD) in patients with connective tissue disease (CTD), thereby providing a convenient tool for clinical screening of high-risk populations. Methods A total of 225 inpatients with connective tissue disease (CTD) admitted to Shanxi Provincial People's Hospital from May 2022 to May 2023 were retrospectively enrolled, including 85 cases in the CTD-ILD group and 140 cases in the pure CTD group. Clinical and laboratory data were collected, including gender, age, serum KL-6 levels, and a full set of routine blood test indicators (including derived inflammatory and immune indexes such as systemic immune-inflammation index [SII] and neutrophil-to-lymphocyte ratio [NLR]). The Adaptive Synthetic Sampling (ADASYN )technique (with three proportions: 50%, 75%, and 100%) was applied to address the issue of data imbalance, and the balanced dataset was divided into a training set and a test set at a ratio of 7:3. Five machine learning models were constructed, namely eXtreme Gradient Boosting (XGBoost), logistic regression, random forest (RF), support vector machine (SVM), and k-nearest neighbor (KNN). Hyperparameters were optimized via 10-fold cross-validation and grid search. Model performance was comprehensively evaluated from four dimensions: discriminative ability (ROC curve, PR curve), calibration (calibration plot), clinical utility (decision curve analysis [DCA]), and confusion matrix metrics (AUC, accuracy, sensitivity, etc.). The SHAP (SHapley Additive exPlanations) method was used to interpret key predictive features, and the Bootstrap method was applied to verify the stability of the models and the reliability of the results. Results Sixteen indicators, including gender, systemic immune-inflammation index (SII), white blood cell count (WBC), mean corpuscular volume (MCV) and red blood cell distribution width standard deviation (RDW-SD) exhibited statistically significant differences between the CTD-ILD group and the pure CTD group (p < 0.05). After feature screening, SII, WBC, MCV and RDW-SD were identified as the core predictive variables. The overall model performance was optimal at the 75% ADASYN sampling ratio, among which the random forest (RF) model achieved the best performance: in the validation set, the area under the curve (AUC) was 0.846, average precision (AP) was 0.896, and F1-score was 0.842. The calibration plot indicated that the model had the minimum deviation between predicted probabilities and actual risks (calibration error = 0.146), and decision curve analysis (DCA) confirmed that the model yielded net clinical benefits across the entire threshold range. SHAP analysis elucidated the action mechanisms of each core variable. Finally, 1000-time Bootstrap resampling validation showed that the RF model had a mean AUC of 0.740 ± 0.040 with a 95% confidence interval (95% CI) of [0.608, 0.848]. All performance indicators presented low coefficients of variation, demonstrating favorable stability and reliability of the model. Conclusions: The random forest (RF) model based on routine blood test indicators demonstrated favorable discriminative ability, calibration and clinical utility in the risk prediction of CTD-ILD. Although its specificity was inferior to that of serum markers such as KL-6, the model leverages the advantages of routine blood tests—high popularity, low cost and strong timeliness—and thus can serve as a convenient tool for the preliminary screening of CTD patients at high risk of ILD, and is particularly suitable for primary medical institutions and large-scale screening scenarios. Trial registration: Clinical trial number: not applicable.

Article activity feed