Complete Reference Genome and Pangenome Expand Biologically Relevant Information for Genome-Wide DNA Methylation Analysis Using Short-Read Sequencing and Array Data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
The new complete telomere-to-telomere human genome assembly, T2T-CHM13, and the first draft of the human pangenome reference provide unique opportunities to update the reference genome for epigenetics investigations and clinical research. However, it is largely unclear how these reference genome updates may impact DNA methylation (DNAm) analysis.
Results
Compared to the previous GRCh38 assembly, we found an average increase of 7.4% (range 5.4%–9.9% across samples and sequencing methods) in the number of CpGs genome-wide using T2T-CHM13 with data from four commonly used short-read sequencing DNAm profiling methods. The increase in number of CpGs facilitated discovery of 88 new differentially methylated CpGs within cancer driver genes in an epigenome-wide association study (EWAS) of colon cancer. Further, by aligning probe sequences from the commonly used and recently released Illumina DNAm arrays to T2T-CHM13 and GRCh38, we showed the enhanced utility of T2T-CHM13 for evaluation of potential probe cross-reactivity (i.e., where probes match multiple regions) and mismatch (i.e., where probes do not perfectly match the target region), resulting in the identification of new and more reproducible sets of unambiguous probes (i.e., probes uniquely mapping to the target region) (HM450K, n = 430,719; EPIC, n = 777,491; EPICv2, n = 859,216). In EWASs of 24 cancer types, an average of 945 additional differentially methylated CpG sites were identified in the new unambiguous probe set rather than in the GRCh38-based unambiguous probe set, with enrichments in cancer driver genes and cancer signaling pathways. Moreover, the pangenome called 4.5% more CpGs on average in short-read sequencing data than T2T-CHM13 and identified cross-population and population-specific unambiguous probes in DNAm arrays, owing to its improved representation of genetic diversity. These additional CpGs were overlapped with the promoters and gene bodies of various biologically and medically relevant genes and pangenome-based unambiguous probes can potentially facilitate the discovery of DNAm alterations in more than 200 cancer driver genes in each cancer type.
Conclusions
Use of T2T-CHM13 and pangenome references can benefit epigenome-wide association studies by including CpGs previously unobserved in short-read sequencing data and by improving the identification of unambiguous probes for DNAm arrays, thus expanding biologically relevant information. This study highlights the practical applications of T2T-CHM13 and pangenome for genome biology and provides a basis for expansion of epigenetics investigations.