GIMMEcpg: Global Imputation of Mean CpG MEthylation in Real-time
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Whole-genome DNA methylation (methylome) analysis is of broad interest to biomedical research due to its central role in human development and disease. However, generating high-quality methylomes at scale remains challenging due to inherent technical limitations. While imputation has the potential to help overcome this problem, no existing approach adequately addresses the scaling issue.
Here, we present GIMMEcpg (Global Imputation of Mean cpg MEthylation), a novel imputation tool that scales efficiently from single samples to large cohort studies. GIMMEcpg uses a custom feature dataset built from known CpG sites within the same dataset to impute missing values by calculating the distance-weighted mean of the methylation value of the two immediately neighbouring CpG sites.
We benchmarked GIMMEcpg for speed and accuracy against multiple imputation methods using downsampled datasets produced from high-quality (∼100x) Whole Genome Bisulfite Sequencing (WGBS) data. With a 10x downsampled dataset, GIMMEcpg was able to process the dataset and impute 9.14 Million CpG sites within 7 seconds (R: 0.78, MAE: +5.6%, RMSE: +10.9%). Our results demonstrate that GIMMEcpg is 39-2,562 times faster than three existing methylation imputation tools (BoostMe, DeepCpG, and MethImpute) while maintaining comparable accuracy.
To quantify GIMMEcpg’s scalability, we applied it to the most extensive single collection of WGBS data (N=645 at variable coverage) from the EpiATLAS generated by the International Human Epigenome Consortium (IHEC). Using a single, standard CPU server, GIMMEcpg processed and imputed an additional 2.4 billion CpG methylation values across the 645 datasets in less than a day, enriching the EpiATLAS methylome resource by 20%. This demonstrates that GIMMEcpg scales to large cohort studies with only a subtle impact on accuracy, as illustrated by our benchmark.
We also developed a machine learning variant, GIMMEcpg.ml, which delivers a higher accuracy compared to existing methodologies. Using the same 10x downsampled benchmarking dataset, GIMMEcpg.ml achieved a Person Correlation of 0.87 compared to the ground truth, representing an improvement of 0.11 over the best performing alternative method. Additionally, GIMMEcpg.ml has a Mean Absolute Error (MAE) of 8.67%, which is 2.63% lower than the most accurate performing alternative. While this enhanced accuracy comes at the cost of increased computation requirements, GIMMEcpg.ml is a useful tool where higher accuracy is preferred over scalability.
For increased accessibility, GIMMEcpg is freely available under an MIT license as R and Python packages at https://github.com/ucl-medical-genomics/gimmecpg-r and https://github.com/ucl-medical-genomics/gimmecpg-python , respectively.