Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background PAM50 is a widely adopted multigene signature for breast-cancer subtyping and prognosis, but cross-platform variability and incomplete gene coverage limit its portability. We developed a streamlined, platform-agnostic core-PAM50 panel (40 genes) and a fully documented pipeline to deliver reproducible prognostic modeling across major public cohorts. Methods Transcriptomes and clinical data from METABRIC (microarray, n = 2,173), TCGA-BRCA (RNA-seq, n = 1,098), and GSE25066 (microarray, neoadjuvant chemotherapy, n = 508) were harmonized using HGNC symbol mapping and intra-cohort gene-wise z-scaling. Models were trained in METABRIC with LASSO-penalized Cox regression and explored with Random Survival Forests; the fixed METABRIC coefficients were applied without recalibration to TCGA and GSE25066. Performance was assessed by C-index, time-dependent AUC, calibration at 5 years, decision-curve analysis (DCA), and meta-analysis of hazard ratios (HR). Intrinsic subtypes were assigned by nearest-centroid correlation restricted to the 40 genes, and cross-cohort subtype centroids were compared by Pearson r. Results The LASSO model retained 20/40 genes capturing a luminal–proliferative axis; internal discrimination in METABRIC was C-index 0.584. External discrimination was AUC₆₀ ≈ 0.60–0.63 in GSE25066 and attenuated in TCGA (C-index ≈ 0.42), consistent with short follow-up and low event rates. Using Low vs High risk orientation, HRs were 0.50 (METABRIC OS; ~0.40–0.67), 0.89 (TCGA OS; 0.67–1.19), and 0.50 (GSE25066 DRFS; 0.35–0.73). The random-effects pooled estimate across validation cohorts was HR 0.68 (0.39–1.20), indicating a consistent protective direction for the low-risk group. Calibration was excellent in METABRIC and good in GSE25066; DCA showed positive net benefit in clinically relevant threshold ranges in both. Subtype centroids were highly concordant across platforms (r > 0.8, often ≈ 0.9), and PCA reproduced expected basal–luminal separation. Conclusions The core-PAM50 condenses PAM50 to 40 cross-platform genes while preserving intrinsic-subtype biology and yielding a portable, reproducible prognostic score validated across microarray and RNA-seq cohorts. Its transparency and parsimony provide a practical path toward cost-effective assays (qPCR/targeted RNA-seq) and facilitate meta-analytic reuse. Prospective studies and integration with clinical or immune features may further enhance clinical utility. Trial registration Not applicable.