System-level health profiling from blood DNA methylation with explainable deep learning

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Genome-scale DNA methylation (DNAm) profiles capture organismal physiology, but most predictive models lack transparency and multi-level applicability. Here we develop an explainable framework that quantifies respiratory, cardiovascular, and metabolic status as bounded health scores (0–1) derived from sex-specific clinical reference ranges and disease penalties, and then predicts these scores from whole-blood DNAm. Using Generation Scotland and case-control samples (n = 14,496 individuals), we screened 39 covariates for disease relevance and DNAm predictability, yielding system- relevant panels that were aggregated into scores. We compressed DNAm profiles with a protein-interaction-guided autoencoder, and trained health predictors on 128- dimensional embeddings using fully connected networks. On held-out samples, models reproduced the composite scores with strong rank agreement (Spearman ρ = 0.87, R 2 = 0.71 for respiratory health; ρ = 0.82, R 2 = 0.66 for cardiovascular; ρ = 0.81, R 2 = 0.64 for metabolic) and recover expected population structure in a generally healthy cohort, with clear separation between “single-system low” and “multi-system low” phenotypes, and graded coupling across systems without redundancy. Further, the top features retrieved from each explainable predictor aligned with system biology: airway epithelial repair, hypoxia and inflammatory trafficking for respiratory; endothelial remodeling and cardiomyocyte programs for cardiovascular; glucose-lipid metabolism and metaflammation for metabolic. These results show that DNAm embeddings can yield accurate, transparent, and system-aware health profiling from blood, providing actionable summaries while revealing the molecular processes the models use to infer multi-system status. This approach positions DNAm embeddings plus interpretable penalty targets as a practical bridge from epigenomic signal to system-level triage and is extensible for evaluation in larger, more diverse cohorts.

Article activity feed