Genome complexity, not ploidy, dictates long-read variant-calling accuracy
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate characterization of genetic variation is fundamental to genomics. While long-read sequencing technologies promise to resolve complex genomic regions and improve variant detection, their application in polyploid and complex genomes remains challenging. Here, we systematically investigate the factors influencing variant calling accuracy using long reads. Using human trio data with known variants to simulate variable ploidy levels (diploid, tetraploid, hexaploid), we demonstrate that while variant sites can often be identified accurately, genotyping accuracy significantly decreases with increasing ploidy due to allelic dosage uncertainty. This highlights a specific challenge in assigning correct allele counts in polyploids even with high depth, separate from the initial variant discovery. We then assessed variant detection performance in genomes with varying complexity: the relatively simple diploid Fragaria vesca, the tetraploid Solanum tuberosum, and the highly repetitive diploid Zea mays. Our results reveal that overall variant calling accuracy correlates more strongly with inherent genome complexity (e.g., repeat content) than with ploidy level alone. Furthermore, we identify a critical mechanism impacting variant discovery: structural variations between the reference and sample genomes, particularly those containing repetitive elements, induce spurious read mapping. This leads to false variant calls, constituting a distinct and more dominant source of error than allelic-dosage uncertainty. Our findings underscore the multifaceted challenges in long-read variant analysis and highlight the need for ploidy-aware genotypers, complexity-informed variant callers, and bias-aware mapping strategies to fully realize the potential of long reads in diverse organisms.