CpG Traceability and Pathway Mapping in Epigenetic Aging with Explainable AI
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
DNA methylation at CpG sites stands out as one of the most reliable markers for aging we have. Sure, machine learning models can predict biological age with decent accuracy—but the real challenge is figuring out what those predictions mean. Most models work like black boxes; they spit out an answer, but give you little sense of how specific CpGs actually influence gene regulation or downstream pathways. That’s the gap we wanted to close. In this study, we combined classic regression models with explainable AI methods to make CpG traceability clear and direct. We started with whole blood methylation data from 656 people (GSE40279) and used feature selection to zero in on the most informative CpGs. Then we trained predictive models using XGBoost, LightGBM, and a few ensemble tricks, testing their accuracy with cross-validation. The top stacked ensemble reached an R² of 0.73 and a mean absolute error of 6.1 years—not just solid numbers, but a strong foundation for interpretation. But we didn’t stop with prediction. We traced each CpG through enhancer annotations to its target genes, then mapped those to biological processes. Sankey diagrams showed the same story, again and again: pathways linked to transcriptional regulation and cell proliferation, both major players in the aging process, kept coming up enriched. This approach shows that explainable AI can do more than just predict—it can actually connect methylation markers to meaningful biological functions. By linking CpGs to enhancers, genes, and Gene Ontology terms, we get a transparent look at how epigenetic drift might drive aging at the molecular level. In short, we’ve set the stage for interpretable epigenetic modeling, with the next steps geared toward validating these findings across different tissues.