Comparative Analysis of Six Correlation Metrics on Identifying DNA Co-Methylation Patterns
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
DNA methylation is an important epigenetic event and is significantly associated with cancer. Different genomic sites tend to be co-methylated. It is unclear which correlation metrics should be used to study co-methylation and how different metrics perform in identifying highly co-methylated (HCM) sites. It is also unclear the impact of different features of the data, e.g., outlier, low variance, and data transformation (from B to M = logit(B)). We therefore conducted comprehensive analyses of six metrics, Pearson, Spearman, Kendall, Hoeffding, Distance, and Maximal Information Coefficient (MIC). Key findings are summarized below. First, the runtime of the six metrics drastically differed and increased with sample size, especially for MIC. Spearman and Pearson were much faster (with no missing data). Second, the numbers of HCM sites identified by the six metrics were very different. Pearson and Distance both identified more HCM sites and had strong similarities. However, these metrics were susceptible to outliers and data transformation. They identified more highly co-methylated sites when using B values, but these sites tended to have outliers and lower variance. Third, Kendall and Hoeffding's scores were significantly lower than other metrics’ correlation coefficients, making it difficult to identify HCM sites without a proper cutoff. Fourth, MIC required a large sample size to perform properly. Although it may detect unique correlation patterns, it is difficult to interpret these patterns biologically. Finally, considering all factors together (runtime, outlier, low variance, and data transformation), Spearman is relatively better for co-methylation analysis with no missing data.