Robust methylome analysis and tumour–normal classification in TCGA–COAD: a reproducible workflow
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Colorectal adenocarcinoma is caused in part by widespread epigenetic deregulation, yet the analysis of genome-wide DNA methylation of colorectal adenocarcinoma is complicated due to spatial correlation among CpG, multiscale patterns of differential methylation, and confounding cellular heterogeneity in bulk tissue. This study develops a simple yet effective framework that combines rigorous statistical modelling with modern machine learning-based prediction, on the Illumina 450K data from The Cancer Genome Atlas (TCGA) colorectal adenocarcinoma cohort. In our framework, differentially methylated regions (DMRs) were first detected using functional smoothing and permutation-based bump hunting, revealing both focal CpG island hypermethylation and broad hypomethylated domains spanning hundreds of kilobases. Next, we performed reference-based cell-type deconvolution and surrogate variable analysis (SVA) controlled immune/stromal admixture and hidden confounding effects, yielding well-calibrated single-site and region-level inference. Then, for tumour-status prediction, we empirically study the performance of classical Logistic Regression model against Random Forest, Gradient Boosting (XGBoost), and Feed-forward neural network; the results show that Logistic Regression achieves the lowest root-mean-square error and Brier score, reflecting its superior probability calibration. In general, our integrated framework provides biologically interpretable and highly predictive methylation signatures of colorectal cancer and offers a transferable baseline for future large-scale cancer epigenomics studies. Our code is publicly available at https://github.com/matekum/tcga-coad-methylation-ekum-2025a .
Author summary
We present a reproducible framework for analyzing genome-wide DNA methylation in colorectal cancer using public TCGA data. Our aim is twofold. We started by making single-site and regional findings statistically reliable by accounting for spatial correlation, hidden confounding, and cell-type mixtures; and second, we evaluate practical classifiers for tumour vs. normal prediction. We smooth effect sizes across neighbouring CpG sites and use permutation tests to detect differentially methylated regions, while surrogate variable analysis and reference-based cell-fraction estimates reduce unwanted variation. We then compare logistic regression with random forests, gradient boosting, and a neural network using principal component features. Across models, we observe near-perfect ranking of samples and strong probability calibration, with logistic regression performing best on calibration. All code, parameters, and figure scripts are openly available so others can reproduce and adapt the framework. Beyond colorectal cancer, the approach provides a template for robust methylome analysis and predictive modelling in other large-scale epigenomic studies.