Pathway-based machine learning for breast cancer risk stratification: an interpretable framework validated in two independent cohorts

Suhaan Thayyil
Eshaan Nidee

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background. Patients with the same breast cancer molecular subtype can have markedly different survival outcomes. Standard PAM50 subtype labels group patients into five categories but do not explain within-subtype variability. Gene expression data provide deeper biological insight, yet models using thousands of genes are often difficult to interpret and prone to overfitting in small datasets. Methods. We derived seven pathway activity scores from RNA-sequencing data in two independent cohorts: TCGA-BRCA (n = 213, training) and SCAN-B/GSE96058 (n = 1,483, validation). Pathways included cell proliferation, estrogen response, immune infiltration, apoptosis, epithelial-to-mesenchymal transition, HER2 signaling, and angiogenesis. Scores were calculated by averaging standardized gene expression within each pathway. We compared this approach to single-sample gene set enrichment analysis (ssGSEA) and a PAM50-only baseline. Performance was evaluated using Cox proportional hazards models, five-year survival classification with three machine learning classifiers under five-fold cross-validation, and external validation via fixed-model transfer. SHAP values, calibration, and decision curve analysis were also assessed. Results. The combined model (pathway scores and clinical features) achieved AUC 0.856 (95% CI: 0.833–0.879) and concordance index 0.827 (±0.031), beating the PAM50-only baseline by 0.243 AUC and 0.214 C-index. Pathway scoring exceeded ssGSEA by 0.010 to 0.038 AUC. In 719 Luminal A patients, the model significantly stratified survival (log-rank p = 5.9 × 10⁻²²; C-index 0.848). Importance was stable (Spearman ρ = 0.82). External validation yielded AUC 0.719–0.762. Conclusions. Pathway activity scores improve prognostic accuracy beyond subtype classification and generalize across cohorts, particularly when clinical data are limited. Prospective validation is warranted.

Version published to 10.21203/rs.3.rs-9297397/v1 on Research Square
Apr 8, 2026

Integrated Multi-Omics Analysis for the Identification of Disease-Associated Variations and Prognostic Biomarkers in Triple-Negative Breast Cancer (TNBC)

This article has 2 authors:
1. NAGENDRA MANNEKUNTA
2. ELAMATHI NATRAJAN
This article has no evaluationsLatest version May 6, 2026
Integrated bulk and single-cell transcriptomic analysis reveals a tryptophan metabolism-driven prognostic signature and therapeutic landscape in triple- negative breast cancer

This article has 6 authors:
1. Youjun Wu
2. Xiaorong Pang
3. Feng Cen
4. Liang Xie
5. Xiang Feng
6. Xianglan Mo
This article has no evaluationsLatest version Apr 3, 2026
Decoding Tumor Phenotypes: A Radiologist-Inspired Deep Learning Framework for Breast Cancer Recurrence Prediction

This article has 17 authors:
1. Tao Tan
2. Chunyao Lu
3. Tianyu Zhang
4. Xinglong Liang
5. Antonio Portaluri
6. Luyi Han
7. Yaqian Chen
8. Nika Rasoolzadeh
9. Ruixiang Qi
10. Yuan Gao
11. Xin Wang
12. Yaofei Duan
13. Zahra Aghdam
14. Muzhen He
15. Jonas Teuwen
16. Maciej Mazurowski
17. Ritse Mann
This article has no evaluationsLatest version Apr 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Integrated Multi-Omics Analysis for the Identification of Disease-Associated Variations and Prognostic Biomarkers in Triple-Negative Breast Cancer (TNBC)

Integrated bulk and single-cell transcriptomic analysis reveals a tryptophan metabolism-driven prognostic signature and therapeutic landscape in triple- negative breast cancer

Decoding Tumor Phenotypes: A Radiologist-Inspired Deep Learning Framework for Breast Cancer Recurrence Prediction