Interpretable Machine Learning Models for Bladder Cancer Overall Survival Prediction Development and External Validation via SEER Database and Chinese Cohort Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective We developed interpretable machine learning(ML) models to predict overall survival in bladder cancer patients. This approach aims to improve the interpretability and transparency of our modeling results. Methods We collected clinical and pathological information on bladder cancer patients from the SEER database, allocating it to training and validation sets in a 7:3 ratio. At the same time, we obtained an external validation cohort from Kashgar First People's Hospital in Xinjiang, China. We performed LASSO regression and Cox regression analyses to identify relevant risk factors and then combined these to develop CoxPH and six ML models: Random Survival Forest(RSF), Gradient Boosting with Component Linear(GLMboost), decision tree(dt), boosted tree(bt), DeepSurv, and neural multi-task logistic regression(NMTLR). We evaluated the predictive performance of these ML models using the consistency index (C-index), the area under the cumulative/dynamic curve (AUC) and the integrated Brier score and Kolmogorov-Smirnov(KS). For interpretability assessment, we employed three complementary methods: (1)time-dependent variable importance to quantify feature contribution across follow-up periods; (2)partial correlation survival plots to visualize individual variable effects; and (3)aggregated survival SHapley additive interpretation(SurvSHAP) plots with mean absolute deviation metrics to validate feature impact stability at both individual and population levels. Results The final ML model consists of 14 factors: the patient's age, AJCCStage, chemotherapy, Mstage, marital, Tstage, bone metastasis(BoneMets), stage, radiation, histology, liverMets, Nstage, sex. Our predictive models demonstrates significant discriminative ability, with the boosting tree model performing the best. The AUC for 1-year, 3-year, and 5-year overall survival (OS) was above 0.770 for the training set, validation set, and external validation set, respectively, with the overall Brier score consistently below 0.180. The interpretability analysis of the boosting trees model further indicated that AJCCStage, age, chemotherapy, stage, Mstage, marital were the most influential predictors via quantifiable SurvSHAP values and time-dependent importance weights, with their effects visually validated through partial correlation survival curves. Conclusions The boosting trees model prognostic model has the best performance and can be used to predict OS in bladder cancer patients, helping physicians to accurately assess patients' overall survival rates, and providing valuable and important references for patient diagnosis, treatment, and prognosis evaluation.