Predicting Lung Cancer Stages Using Data Mining and Machine Learning Techniques: A Comparative Analysis of Logistic Regression, Random Forest, and XGBoost Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Lung cancer continues to be among the most common reasons for death caused by cancer all over the world, mainly because the diagnosis is often delayed and treatment options in advanced stages are limited. Besides, immediate recognition is very important in improving the survival rate-making predictive analytics a necessary tool in the healthcare sector. The present research work employs data mining and machine learning methods to separate lung cancer patients into early and late disease stages. This is done using a Kaggle dataset that contains 53,427 clinical and demographic records. After a thorough cleaning of data, addressing of missing values and encoding of categorical variables, three different classification models—Logistic Regression, Random Forest and XGBoost—were created and their performances evaluated. Through the exploratory data analysis, it was found that there is a class distribution that is balanced and there is slight multicollinearity between the variables like age, gender, tobacco usage, race and days to diagnosis, etc. The performance of the models was measured by accuracy, precision, recall, F1-score, and ROC-AUC metrics. The best performance was obtained by Logistic Regression (Accuracy = 0.56, F1-score = 0.57, AUC = 0.58) which was better than Random Forest and XGBoost. Though the overall predictive accuracy did not exceed a certain level, the results have pointed out the possibility of data-based modeling in helping doctors to give priority to the high-risk patients in terms of early treatment. Among the things to be recommended for the next research work are advanced feature engineering, hyperparameter tuning, handling of class imbalance, and incorporation of different clinical variables which would make the model stronger and more useful for diagnosis.