Imputation strategy for population DNA methylation sequencing data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: DNA methylation is a central epigenetic mechanism involved in regulating gene expression and responses to environmental factors. Although it can sometimes be passed down through generations, its heritability remains variable depending on the species and biological context. These characteristics make it a key marker for studying genotype-environment interactions. However, whole-genome sequencing for DNA methylation analysis remains costly when applied to large numbers of individuals, prompting researchers to focus on specific regions of interest. This targeted approach often results in data matrices with missing values for some individuals, which can hinder downstream analyses. Results: Our study used 200 and 189 poplar and oak individuals from natural populations, respectively. We tested and compared seven methods for missing data imputation in the specific context of targeted DNA methylation sequencing data obtained in the three different DNA methylation contexts in plants (CpG, CHG, and CHH). The comparison of the different imputation result allows to evaluate their performance to determine the most suitable approach for this type of data. Among them, NIPALS, MissForest, and LOESS provided the highest accuracy. NIPALS delivered the best overall performance but with moderate computational cost, MissForest achieved similar accuracy with faster computation, and LOESS offered competitive results suitable for large datasets. Conclusions: Our results provide a reference for the selection of imputation strategies in targeted sequencing studies, improving the reliability of DNA methylation analyses and broadening the applicability of this type of data in epigenomic research.

Article activity feed