Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background PAM50 is a widely adopted multigene signature for breast-cancer subtyping and prognosis, but cross-platform variability and incomplete gene coverage limit its portability. We developed a streamlined, platform-agnostic core-PAM50 panel (40 genes) and a fully documented pipeline to deliver reproducible prognostic modeling across major public cohorts. Methods Transcriptomes and clinical data from METABRIC (microarray, n = 2,173), TCGA-BRCA (RNA-seq, n = 1,098), and GSE25066 (microarray, neoadjuvant chemotherapy, n = 508) were harmonized using HGNC symbol mapping and intra-cohort gene-wise z-scaling. Models were trained in METABRIC with LASSO-penalized Cox regression and explored with Random Survival Forests; the fixed METABRIC coefficients were applied without recalibration to TCGA and GSE25066. Performance was assessed by C-index, time-dependent AUC, calibration at 5 years, decision-curve analysis (DCA), and meta-analysis of hazard ratios (HR). Intrinsic subtypes were assigned by nearest-centroid correlation restricted to the 40 genes, and cross-cohort subtype centroids were compared by Pearson r. Results The LASSO model retained 20/40 genes capturing a luminal–proliferative axis; internal discrimination in METABRIC was C-index 0.584. External discrimination was AUC₆₀ ≈ 0.60–0.63 in GSE25066 and attenuated in TCGA (C-index ≈ 0.42), consistent with short follow-up and low event rates. Using Low vs High risk orientation, HRs were 0.50 (METABRIC OS; ~0.40–0.67), 0.89 (TCGA OS; 0.67–1.19), and 0.50 (GSE25066 DRFS; 0.35–0.73). The random-effects pooled estimate across validation cohorts was HR 0.68 (0.39–1.20), indicating a consistent protective direction for the low-risk group. Calibration was excellent in METABRIC and good in GSE25066; DCA showed positive net benefit in clinically relevant threshold ranges in both. Subtype centroids were highly concordant across platforms (r > 0.8, often ≈ 0.9), and PCA reproduced expected basal–luminal separation. Conclusions The core-PAM50 condenses PAM50 to 40 cross-platform genes while preserving intrinsic-subtype biology and yielding a portable, reproducible prognostic score validated across microarray and RNA-seq cohorts. Its transparency and parsimony provide a practical path toward cost-effective assays (qPCR/targeted RNA-seq) and facilitate meta-analytic reuse. Prospective studies and integration with clinical or immune features may further enhance clinical utility. Trial registration Not applicable.

Article activity feed