CpGPT: a Foundation Model for DNA Methylation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
DNA methylation is a type of epigenetic modification that plays a significant role in development, aging, and disease. Despite extensive research into the molecular mechanisms of DNA methylation, they remain poorly understood today. Foundation models are a class of machine learning model that leverage vast quantities of data to make sense of complex data types, such as genome sequences or single-cell transcriptomes. Here, we present the Cytosine-phosphate-Guanine Pretrained Transformer (CpGPT), a novel foundation model pretrained on CpGCorpus, a novel database with more than 2,000 DNA methylation datasets encompassing over 150,000 samples from diverse conditions. CpGPT leverages an improved transformer architecture to learn comprehensive representations of methylation patterns, allowing it to impute and reconstruct genome-wide methylation profiles from limited input data. By capturing sequence, positional, and epigenetic contexts, CpGPT outperforms specialized models when finetuned for agingrelated tasks, such as mortality risk and morbidity assessments. The model is highly adaptable and can impute beta values across different methylation platforms, tissue types, mammalian species, and even single-cell data. As a foundation model, CpGPT can be leveraged as a new tool for biological discovery in the field of epigenetics. The open-source code and model can be found at http://github.com/lcamillo/CpGPT .
Highlights
-
CpGPT is a novel foundation model for DNA methylation analysis, pretrained on over 2,000 datasets encompassing 150,000+ samples.
-
The model demonstrates strong performance in zero-shot tasks including imputation, array conversion, and reference mapping.
-
CpGPT achieves state-of-the-art results in mortality prediction and chronological age estimation.