Predicting Lung Cancer Stages Using Data Mining and Machine Learning Techniques: A Comparative Analysis of Logistic Regression, Random Forest, and XGBoost Models

Omar Anwar Zegama
Anas Albakar
soobia saeed

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Lung cancer continues to be among the most common reasons for death caused by cancer all over the world, mainly because the diagnosis is often delayed and treatment options in advanced stages are limited. Besides, immediate recognition is very important in improving the survival rate-making predictive analytics a necessary tool in the healthcare sector. The present research work employs data mining and machine learning methods to separate lung cancer patients into early and late disease stages. This is done using a Kaggle dataset that contains 53,427 clinical and demographic records. After a thorough cleaning of data, addressing of missing values and encoding of categorical variables, three different classification models—Logistic Regression, Random Forest and XGBoost—were created and their performances evaluated. Through the exploratory data analysis, it was found that there is a class distribution that is balanced and there is slight multicollinearity between the variables like age, gender, tobacco usage, race and days to diagnosis, etc. The performance of the models was measured by accuracy, precision, recall, F1-score, and ROC-AUC metrics. The best performance was obtained by Logistic Regression (Accuracy = 0.56, F1-score = 0.57, AUC = 0.58) which was better than Random Forest and XGBoost. Though the overall predictive accuracy did not exceed a certain level, the results have pointed out the possibility of data-based modeling in helping doctors to give priority to the high-risk patients in terms of early treatment. Among the things to be recommended for the next research work are advanced feature engineering, hyperparameter tuning, handling of class imbalance, and incorporation of different clinical variables which would make the model stronger and more useful for diagnosis.

Version published to 10.20944/preprints202511.1913.v1
Nov 26, 2025

Machine Learning-Based Survival Time Prediction in Colorectal Cancer with Peritoneal Metastasis: A Multi-Institutional Registry-Based Study

This article has 32 authors:
1. Yoshiko Bamba
2. Michio Itabashi
3. Hirotoshi Kobayashi
4. Kenjiro Kotake
5. Masayasu Kawasaki
6. Yukihide Kanemitsu
7. Yusuke Kinurgasa
8. Hideki Ueno
9. Kotaro Maeda
10. Takeshi Suto
11. Kimihiko Funahashi
12. Heita Ozawa
13. Fumikazu Koyama
14. Shingo Noura
15. Hideyuki Ishida
16. Masayuki Ohue
17. Tomomichi Kiyomatsu
18. Soichiro Ishihara
19. Keiji Koda
20. Hideo Baba
21. Kenji Kawada
22. Yojiro Hashiguchi
23. Takanori Goi
24. Yuji Toiyama
25. Naohiro Tomita
26. Eiji Sunami
27. Yoshito Akagi
28. Jun Watanabe
29. Kenichi Hakamada
30. Goro Nakayama
31. Kenichi Sugihara
32. Yoichi Ajioka
This article has no evaluationsLatest version Jan 21, 2026
Development and validation of machine learning models for predicting short- and long-term mortality in gastroparesis patients: a retrospective cohort study using the MIMIC-IV database

This article has 5 authors:
1. Lei Zhu
2. Qi Han
3. Bei Pei
4. Jie Zhang
5. Haolong Qi
This article has no evaluationsLatest version Dec 31, 2025
RETRACTED: Development and Validation of a Simplified Machine Learning Model Based on T-SPOT.TB and Routine Clinical Data for the Diagnosis of Tuberculous Pleural Effusion

This article has 5 authors:
1. Shuangyin Yang
2. Kuiliang Yang
3. Lizhi Wang
4. Jie Pu
5. Pu Wang
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Machine Learning-Based Survival Time Prediction in Colorectal Cancer with Peritoneal Metastasis: A Multi-Institutional Registry-Based Study

Development and validation of machine learning models for predicting short- and long-term mortality in gastroparesis patients: a retrospective cohort study using the MIMIC-IV database

RETRACTED: Development and Validation of a Simplified Machine Learning Model Based on T-SPOT.TB and Routine Clinical Data for the Diagnosis of Tuberculous Pleural Effusion