Batch-Harmonized Machine Learning Framework for Cross-Cohort RNA Biomarker Discovery in Pancreatic Adenocarcinoma
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Pancreatic ductal adenocarcinoma (PDAC) lacks reliable prognostic biomarkers. RNA-based signatures suffer from poor reproducibility due to batch effects and platform heterogeneity between microarray and RNA-seq data, limiting machine learning applications.
Methods
We developed a computational pipeline harmonizing RNA-seq data from multiple repositories using ComBat batch correction, followed by Random Forest and XGBoost classification. Restricting analysis to RNA-seq platforms only, we achieved 14,137 common genes between TCGA-PAAD (n=178) and validation cohort GSE71729 (n=357). We quantified batch correction efficacy via silhouette coefficients and trained models on survival outcomes.
Results
ComBat correction eliminated dataset-specific clustering (silhouette coefficient: 0.866→-0.012). Random Forest achieved 64% training accuracy, identifying five prognostic biomarkers: LAMC2, DKK1, ITGB6, GPRC5A, and MAL2. These genes showed consistent importance across models and biological relevance to invasion, epithelial-mesenchymal transition, and tumor suppression. Models successfully generalized independent validation data.
Conclusions
We present the first open-source R pipeline optimized for RNA-seq-based, cross-cohort biomarker discovery in pancreatic cancer. Platform-matched datasets yielded superior gene coverage versus multi-platform approaches, enabling robust machine learning classification. Our framework identifies five novel prognostic genes and provides a reproducible method for multi-center RNA biomarker studies, available through an interactive Shiny application.
Availability
All code, processed data, and the interactive Shiny application are available at https://github.com/MarkBarsoumMarkarian/rna-harmonization-ai