Application of Explainable Machine Learning in Early Diagnosis Models for Risk Prediction of Severe Mycoplasma pneumoniae Pneumonia in Children
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Mycoplasma pneumoniae is a leading cause of community-acquired pneumonia in children, with severe cases (SMPP) posing a significant threat to pediatric health. Current diagnostic approaches rely primarily on imaging and clinical signs, lacking objective and quantitative tools for early risk prediction. Traditional statistical models face limitations in capturing the complexity of this condition, while machine learning (ML) methods offer the potential to uncover nonlinear relationships. However, the "black box" nature of many ML models hinders their clinical application. This study aimed to develop an interpretable ML model for the early prediction of SMPP risk in children and to enhance model transparency using Shapley Additive Explanations (SHAP) methods, thereby facilitating informed clinical decision-making. Methods: The study retrospectively analyzed data from 286 inpatients with MPP admitted to the Affiliated Hospital of Yan'an University between August 2023 and August 2024. Patients were divided into MPP (n = 163) and SMPP (n = 123) groups based on their clinical condition. Forty-four clinical variables, including symptoms, laboratory parameters, and imaging features, were collected. Pearson correlation analysis, Mann-Whitney U test, chi-square test, and LASSO regression were employed to identify key predictors. Seven machine learning models (CatBoost, XGBoost, LightGBM, SVM, KNN, LR, GNB) were constructed using Python. Hyperparameters were optimized through 5-fold cross-validation and grid search, and model performance was evaluated by accuracy, AUC, and other metrics. Model interpretability was analyzed using the SHAP method. Results Twenty-one key features, such as thermal peak, thermal path, D-dimer, CRP, pleural effusion, and bronchoscopic manifestations, were evaluated. Among the seven machine learning models, the CatBoost model demonstrated superior performance, achieving an AUC of .961, an accuracy of .907, a recall of .923, and an F1 score of .900. The performance discrepancy between the training and test sets was minimal, indicating robust generalization. SHAP visual analysis identified thermal peak, thermal range, D-dimer, and CRP as the most significant positive predictors. Decision curve analysis further validated the CatBoost model's higher clinical net benefit across a broad threshold range. Conclusions This study developed and internally validated an interpretable machine learning model using the CatBoost algorithm to effectively predict early risk of SMPP in children, outperforming traditional methods. The model incorporates multidimensional clinical features, emphasizing the significance of bronchoscopic findings, and offers a quantitative tool for early identification of high-risk children. Future multi-center, prospective studies are necessary to further assess the model's generalizability and clinical applicability.