Application of Machine Learning in Early Screening of Yu Disease: Model Construction and Analysis Based on Routine Laboratory Data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Yu disease, a representative "Shenzhi Disease" in Traditional Chinese Medicine (TCM), is characterized by liver qi stagnation, emotional depression, and chest/hypochondriac distension, but lacks objective early screening tools. This study pioneers an interpretable machine learning (ML) model that converts "qi-blood disharmony"(a TCM concept) into quantifiable biomarkers, using only routine blood biochemical indicators to enable low-cost early warning. Methods Clinical data of 3,347 patients (including those with Yu disease and non-Yu disease controls) were collected from Chifeng Mental Health Prevention and Treatment Hospital, covering the period from March 2013 to September 2019. The dataset included demographic information and routine laboratory test results. After data cleaning, 54 features were retained for baseline analysis, and 16 optimal features were selected using the backward elimination method. Four ML algorithms-Deep Neural Network (DNN), Extreme gradient boosting (XGBoost), Logistic Regression (LR), and Support Vector Machine (SVM)- were employed to build Yu disease prediction models. Model performance was evaluated via cross-validation. To intuitively interpret the XGBoost model results, the Shapley Additive exPlanations (SHAP) method was used. Results The XGBoost model outperformed the other models, achieving an accuracy of 0.904, sensitivity of 0.886, specificity of 0.915, and a Receiver Operating Characteristic curve area (ROC-AUC) of 0.964. SHAP analysis identified the key features: albumin, basophil percentage, platelet distribution width, age, globulin, indirect bilirubin that influenced the predictions. Conclusion This study successfully developed a high-accuracy early screening model for Yu disease using routine laboratory data, providing a new tool for clinical practice. Future studies should focus on multicenter validation and the incorporation of additional biomarkers to enhance model generalizability.