Pathway-based machine learning for breast cancer risk stratification: an interpretable framework validated in two independent cohorts
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background. Patients with the same breast cancer molecular subtype can have markedly different survival outcomes. Standard PAM50 subtype labels group patients into five categories but do not explain within-subtype variability. Gene expression data provide deeper biological insight, yet models using thousands of genes are often difficult to interpret and prone to overfitting in small datasets. Methods. We derived seven pathway activity scores from RNA-sequencing data in two independent cohorts: TCGA-BRCA (n = 213, training) and SCAN-B/GSE96058 (n = 1,483, validation). Pathways included cell proliferation, estrogen response, immune infiltration, apoptosis, epithelial-to-mesenchymal transition, HER2 signaling, and angiogenesis. Scores were calculated by averaging standardized gene expression within each pathway. We compared this approach to single-sample gene set enrichment analysis (ssGSEA) and a PAM50-only baseline. Performance was evaluated using Cox proportional hazards models, five-year survival classification with three machine learning classifiers under five-fold cross-validation, and external validation via fixed-model transfer. SHAP values, calibration, and decision curve analysis were also assessed. Results. The combined model (pathway scores and clinical features) achieved AUC 0.856 (95% CI: 0.833–0.879) and concordance index 0.827 (±0.031), beating the PAM50-only baseline by 0.243 AUC and 0.214 C-index. Pathway scoring exceeded ssGSEA by 0.010 to 0.038 AUC. In 719 Luminal A patients, the model significantly stratified survival (log-rank p = 5.9 × 10⁻²²; C-index 0.848). Importance was stable (Spearman ρ = 0.82). External validation yielded AUC 0.719–0.762. Conclusions. Pathway activity scores improve prognostic accuracy beyond subtype classification and generalize across cohorts, particularly when clinical data are limited. Prospective validation is warranted.