Machine Learning-Based Cancer Prediction Using Complete Blood Count: A Retrospective Study on the Diagnostic Potential of Hematological Parameters in Lung Cancer Screening

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Early detection of lung cancer is crucial for improving outcomes, yet existing screening methods are costly and limited in accessibility. This study evaluated the diagnostic potential of routine complete blood count (CBC) parameters combined with machine learning (ML) for lung cancer prediction. Design and methods : Data from 12,964 lung cancer patients and 169,703 healthy controls were retrospectively collected, including CBC, coagulation, and tumor marker results. After rigorous data preprocessing, multiple ML models were developed and validated, with CatBoost showing the best performance (AUC 95.84%, precision 92.53%, accuracy 89.81%, recall 86.60%, F1-score 89.46%). Results Feature importance analysis identified platelet distribution width (PDW), age, neutrophil percentage (NE%), and red blood cell count (RBC) as the most significant predictors. A reduced model using these four features retained high accuracy (AUC 94.93%), indicating their strong discriminative value. Compared to tumor markers and coagulation data, CBC-derived features alone were robust for lung cancer prediction. Conclusions Routine CBC parameters, paired with ML, may enable accurate and cost-effective lung cancer screening in retrospective, single-center data. Key features such as PDW, NE%, and RBC may serve as early diagnostic indicators. This approach offers a scalable solution for early cancer detection, particularly in resource-limited settings, and requires prospective, multi-center validation prior to clinical implementation.

Article activity feed