Optimizing effect sizes and specificity trumps machine learning when building DNA methylation reference panels for cell-type deconvolution

Xiaolong Guo
Andrew E Teschendorff

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Accurate cell-type deconvolution is critical for correct interpretation of Epigenome-Wide Association Studies. Such cell-type deconvolution involves estimating underlying cell-type fractions in a sample, which is accomplished using a DNA methylation reference panel built from sorted or single-cell DNAm data. Two competing approaches have emerged to build such reference panels, one which uses machine-learning, and another based on optimizing effect size and cell-type specificity. Here we demonstrate that the latter approach is preferable, because, owing to the relatively small number of sorted samples used in building panels, standard machine learning does not optimize effect size and cell-type specificity, causing the model to overfit and underperform when tested in independent data. Furthermore, adult blood panels built from cell-type specific hypomethylated markers improves inference when compared to panels built from hypermethylated ones. These insights provide important guidelines for optimizing the construction of future DNAm reference panels. To aid this task, we have added a function for building an optimized DNAm reference panel to our EpiDISH R-package.

Version published to 10.1101/2025.08.20.671293 on bioRxiv
Aug 23, 2025

Discovering cell types and states from reference atlases with heterogeneous single-cell ATAC-seq features

This article has 2 authors:
1. Xiuwei Zhang
2. Yuqi Cheng
This article has no evaluationsLatest version Dec 10, 2025
Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences

This article has 5 authors:
1. Radim Krupička
2. Mariana Komárková
3. Bohuslav Dvorský
4. Kateřina Kollinová
5. Ondřej Klempíř
This article has no evaluationsLatest version Dec 23, 2025
Predicting gene expression from whole slide images in prostate cancer using deep learning

This article has 14 authors:
1. Anxuan Han
2. Bo Li
3. Chui Yan Mah
4. Jessica Logan
5. Yanan Wang
6. Ning Liu
7. Feargal Ryan
8. David Lynn
9. Darren Foreman
10. John O’Leary
11. Douglas Brooks
12. Jose Polo
13. Lisa Butler
14. Fuyi Li
This article has no evaluationsLatest version Feb 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Discovering cell types and states from reference atlases with heterogeneous single-cell ATAC-seq features

Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences

Predicting gene expression from whole slide images in prostate cancer using deep learning