scWGBS-GPT: A Foundation Model for Capturing Long-Range CpG Dependencies in Single-Cell Whole-Genome Bisulfite Sequencing to Enhance Epigenetic Analysis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Single-Cell Whole-Genome Bisulfite Sequencing (scWGBS) is a powerful technique for profiling DNA methylation at single-cell and single-nucleotide resolution, providing critical insights into epigenetic regulation in development, disease, and cellular heterogeneity. However, analyzing scWGBS data remains a significant challenge due to ultra-long genomic sequences with millions of CpG sites and sparse, stochastic coverage. Here, we introduce scWGBS-GPT, the first generative foundational language model specifically designed for high-resolution analysis of scWGBS data at the single-CpG level. By combining CpG special token design, Mamba backbone and Cross-Attention head, scWGBS-GPT efficiently processes ultra-long sequences while preserving both local CpG interactions and broader genomic context, enabling biologically meaningful interpretations. Pretrained on 1,000,000 single cells from 28 human and mouse tissues, scWGBS-GPT reconstructs sparse methylation landscapes, enhancing the resolution and accuracy of epigenetic analyses. It significantly outperforms existing methods in key biomedical applications—improving cell clustering to uncover previously unrecognized cellular heterogeneity, enhancing trajectory inference to map precise differentiation pathways, and automating the identification of critical epigenetic biomarkers relevant to disease progression and therapeutic targeting. These advancements establish scWGBS-GPT as a transformative tool in single-cell epigenomics, setting a new standard for DNA methylation analysis and unlocking novel insights into epigenetic mechanisms underlying health and disease. The code for scWGBS-GPT is available at https://github.com/ChaoqiLiang/scWGBS-GPT .
Highlights
-
scWGBS-GPT is the first foundation language model for scWGBS data analysis, trained on over 1,000,000 single cells from 28 human and mouse tissues, spanning 221 cell types.
-
scWGBS-GPT combines CpG special token design, Mamba backbone, and Cross-Attention head for efficient processing of up to 2 million CpG sites per cell.
-
scWGBS-GPT achieves State-of-the-Art Performance in Cell Clustering, Annotation, Pseudotime Inference.
-
scWGBS-GPT is compatible with bulk WGBS data and excels in deconvoluting cell-free DNA methylation data, precisely inferring the tissue or organ of origin, enhancing non-invasive diagnostic capabilities.