Lung Cancer Prediction Using Machine Learning on Structured Clinical Data: A Systematic Review of Diagnosis, Risk, and Survival Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Context: Lung cancer is a leading cause of cancer-related mortality world- wide. While machine learning (ML) offers significant potential for improving prediction tasks, comprehensive reviews synthesizing its application across the full spectrum of lung cancer prediction—from diagnosis and risk to sur- vival—using structured data are scarce. Objective: This systematic review aims to comprehensively analyze ML techniques for three key lung cancer prediction tasks—diagnosis, risk assessment, and survival analysis—using structured or tabular data sources. Method: We followed the PRISMA 2020 guidelines, systematically searching five databases (PubMed, Scopus, IEEE Xplore, ACM Digital Library, Science Direct) for studies published by August 2025. From an initial 772 records, 42 studies met our inclusion criteria, which mandated the use of structured data and ML models for pre- diction. Results: Ensemble methods, particularly XGBoost and Random Forest, were the most prevalent and high-performing models across all tasks. However, performance was highly task-dependent. Key predictive features included demographics, clinical parameters, and lifestyle factors. Datasets were also task-specific: SEER and NLST for diagnosis; population registries (e.g., Danish) for risk; and SEER and TCGA for survival. Common val- idation techniques included holdout and cross-validation, with SHAP and LIME emerging as dominant interpretability tools. While many studies re- ported high performance (e.g., accuracies up to 99%), these results must be interpreted with caution due to dataset imbalances and a general lack of external validation. Conclusion: This review provides a structured syn- thesis of ML applications across the lung cancer prediction continuum. It highlights the dominance of ensemble methods and the critical importance of task-specific data and modeling. The findings reveal a pressing need for more rigorous external validation, standardized reporting, and direct com- parison to established clinical models to foster the development of robust, clinically actionable ML tools.