A Machine Learning Model Based on CBC-Derived Parameters to Distinguish Benign from Malignant Lymphoproliferative Disorders
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective Infectious mononucleosis (IM) and malignant lymphoproliferative disorders often present with similar initial symptoms and signs. The purpose of study is to develop a machine learning-based model to distinguish IM from lymphoid hematologic malignancies using both routine and research parameters derived from complete blood count (CBC) analysis. Methods The multicenter model development and validation study utilized data from three independent institutions. Patients with a final confirmed diagnosis of infectious mononucleosis (IM), acute lymphoblastic leukemia (ALL), or chronic lymphoproliferative disorders (CLPD) were included. A total of 24 routine and 21 report-derived parameters from the complete blood count (CBC) at initial presentation to our institution were collected. Nine candidate biomarkers and five machine learning classifiers were employed to construct predictive models from the training data. The models were validated using an independent test dataset. Model performance was assessed by calculating the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV), Matthews correlation coefficient (MCC), and Cohen’s Kappa coefficient. The diagnostic performance of all five models was evaluated using both internal ROC analysis and external validation datasets. Result A total of 114 patients with infectious mononucleosis (IM), 108 with hematologic diseases, and 150 healthy controls were included in the study. In both the IM versus hematologic disease classification model and the IM versus healthy control model, excluding the decision tree (DT), other methods achieved good evaluation indicators, most of them are above 87% in the validation cohorts, suggesting reliable diagnostic capability. In the classification model distinguishing healthy individuals from patients with hematologic diseases, the XGBoost algorithm showed stable and high performance across the training, validation, and test sets. The receiver operating characteristic (ROC) curves confirmed that XGBoost was the most effective model, the area under the curve (AUC) values was 0.995 in the test sets. Conclusion The XGBoost model demonstrated the most satisfactory performance. Machine learning algorithms show promise for clinical implementation, and the proposed model may aid in the early identification of IM and malignant lymphoid disorders with overlapping initial presentations. This provides valuable assistance for clinical doctors to intervene early and improve prognosis.