A spatially-aware unsupervised pipeline to identify co-methylation regions in DNA methylation data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
DNA methylation (DNAm) plays a central role in modern epigenetic research; however, the high dimensionality of DNAm data comprising hundreds of thousands of spatially ordered probes continues to present major analytical challenges. The multiple testing burden in these data introduces redundancy and reduces statistical power, contributing to the limited reproducibility often observed in association studies. Moreover, DNAm probes frequently exhibit correlated methylation patterns with neighboring sites, reflecting underlying biological co-regulation and spatial dependence along the genome. Ignoring these spatial correlations can bias parameter and standard error estimates, inflate type I error rates, and obscure biologically meaningful effects. Existing methods for detecting methylation co-regulation and reducing DNAm data dimensions, typically rely on fixed distance or correlation thresholds and arbitrary hyperparameter settings that lack data adaptivity. In this study, we introduce SACOMA (Spatially-Aware Clustering for Co-Methylation Analysis), a flexible, data-driven, and unsupervised framework designed to identify co-methylated regions which are genomic regions where adjacent sites show correlated methylation levels. SACOMA employs spatially constrained hierarchical clustering to group neighboring DNAm sites based on both spatial proximity and methylation similarity. A tunable, data-adaptive mixing parameter allows SACOMA to avoid rigid assumptions and remain robust to hyperparameter choices. Although developed for DNAm array data, SACOMA provides a generalizable framework applicable to any data exhibiting spatial dependence, enabling the identification of spatially correlated features across diverse domains. Through extensive simulations, SACOMA demonstrated superior sensitivity while maintaining effective false-positive control compared to existing methods. In population-level DNAm data analyses, SACOMA successfully identified biologically relevant co-regulated methylation regions with functional roles. Overall, SACOMA reduces the multiple-testing burden and enhances both the discovery and specificity of statistical associations, leading to improved reproducibility and more reliable biological inference.