Batch-Harmonized Machine Learning Framework for Cross-Cohort RNA Biomarker Discovery in Pancreatic Adenocarcinoma

Mark Barsoum Markarian

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Pancreatic ductal adenocarcinoma (PDAC) lacks reliable prognostic biomarkers. RNA-based signatures suffer from poor reproducibility due to batch effects and platform heterogeneity between microarray and RNA-seq data, limiting machine learning applications.

Methods

We developed a computational pipeline harmonizing RNA-seq data from multiple repositories using ComBat batch correction, followed by Random Forest and XGBoost classification. Restricting analysis to RNA-seq platforms only, we achieved 14,137 common genes between TCGA-PAAD (n=178) and validation cohort GSE71729 (n=357). We quantified batch correction efficacy via silhouette coefficients and trained models on survival outcomes.

Results

ComBat correction eliminated dataset-specific clustering (silhouette coefficient: 0.866→-0.012). Random Forest achieved 64% training accuracy, identifying five prognostic biomarkers: LAMC2, DKK1, ITGB6, GPRC5A, and MAL2. These genes showed consistent importance across models and biological relevance to invasion, epithelial-mesenchymal transition, and tumor suppression. Models successfully generalized independent validation data.

Conclusions

We present the first open-source R pipeline optimized for RNA-seq-based, cross-cohort biomarker discovery in pancreatic cancer. Platform-matched datasets yielded superior gene coverage versus multi-platform approaches, enabling robust machine learning classification. Our framework identifies five novel prognostic genes and provides a reproducible method for multi-center RNA biomarker studies, available through an interactive Shiny application.

Availability

All code, processed data, and the interactive Shiny application are available at https://github.com/MarkBarsoumMarkarian/rna-harmonization-ai

Graphical Abstract: Machine Learning Workflow for Prognostic Biomarker Discovery

Version published to 10.1101/2025.11.14.688421 on bioRxiv
Nov 14, 2025

Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

This article has 2 authors:
1. Rafael de Negreiros Botan
2. Joao Batista de Sousa
This article has no evaluationsLatest version Dec 19, 2025
Multi-Omic Integration and Machine Learning Reveal Regulatory Networks Driving Breast Cancer Progression

This article has 2 authors:
1. Unmilita Das Moon
2. Kushal Raj Roy
This article has no evaluationsLatest version Dec 11, 2025
Prospective Germline Exome and Machine Learning-Based Risk Score Identify Predictive and PrognosticBiomarkers of Immunotherapy Outcomes in Advanced Non-Small Cell Lung Cancer

This article has 15 authors:
1. Andrea González-Hernández
2. Alberto Ríos
3. Juan Luis Onieva
4. Alexandra Cantero
5. Marina Rivero-Aguilar
6. Guillermo Paz-López
7. Antonio Rueda-Dominguez
8. María Garrido-Barros
9. Beatriz Martínez-Gálvez
10. Juan Zafra
11. Laura Cristina Figueroa-Ortiz
12. Elisabeth Pérez-Ruiz
13. José Carlos Benitez
14. Isabel Barragan
15. Javier Oliver
This article has no evaluationsLatest version Jan 20, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Availability

Graphical Abstract: Machine Learning Workflow for Prognostic Biomarker Discovery

Article activity feed

Related articles

Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

Multi-Omic Integration and Machine Learning Reveal Regulatory Networks Driving Breast Cancer Progression

Prospective Germline Exome and Machine Learning-Based Risk Score Identify Predictive and PrognosticBiomarkers of Immunotherapy Outcomes in Advanced Non-Small Cell Lung Cancer