Imputation and Maximum Likelihood Haplotype Refinement of Simulated Ancient Mitochondrial Genomes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background : Mitochondrial DNA (mtDNA) has long served as a foundational target in ancient DNA (aDNA) and palaeogenomic research, owing to its high copy number and well-resolved phylogenetic structure. Yet, external taphonomic and diagenetic factors, including burial environment, hydrolytic and oxidative damage, microbial colonization, and soil chemistry, promote molecular fragmentation. These factors complicate haplotype determination and raise uncertainty about the minimal sequencing coverage needed for reliable haplogroup assignment and phylogenetic inference. Although aDNA studies often apply thresholds between 2x and 10x depth of coverage, a systematic assessment for mtDNA quality has not yet been undertaken. Moreover, while genotype imputation is routinely employed to recover missing data from genomic DNA, its accuracy for ancient mtDNA remains largely untested. Results : Here, we compiled a reference panel of 46,791 complete human mtDNA genomes and simulated aDNA degradation across coverage depths from 0.25x to 15x (n=3500 mtDNA simulations) using gargammel . Simulated paired-end FASTQ files were processed with the EAGER pipeline (v2.5.2), consensus sequences classified in Haplogrep3 , and then imputed using our novel Hidden-Markov Models (HMM)-based pipeline, MAVEN , alongside an existing k -Nearest Neighbor ( k NN)-based imputation tool, MitoImp . Analyses reveal that a mean depth of 10x or breadth of coverage of 88% is necessary for robust haplogroup assignment in whole mtDNA genomes, with minimal gains in correctness of assignment at coverages greater than 10x. In addition, MAVEN consistently performed better than MitoImp performance at ultra-low coverage (<2x), particularly when using more stringent correct assignment criteria, specifically a Haplogrep3 quality score ≥0.90. Nonetheless, absolute probabilities of correct haplotype classification remained modest at the sub-cluster level, highlighting the inherent difficulties of imputing low-coverage haploid genomes. Conclusion : Our findings establish the first comprehensive evaluation of coverage thresholds for mtDNA analysis and underscore the limitations of applying imputation to highly degraded mtDNA. Our results suggest a minimum depth of coverage of 10x, and breadth of coverage of at least 88% (i.e., no more than 12% missing nucleotides) is required for accurate haplogroup assessment, and that HMM-models outperform unsupervised kNN models at mtDNA imputation.

Article activity feed