Benchmarking long-read variant calling in diploid and polyploid genomes: insights from human and plants

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Accurate characterization of genetic variation is fundamental to genomics. While long-read sequencing technologies promise to resolve complex genomic regions and improve variant detection, their application in complex genomes has not been well validated. Here, we systematically investigate the factors influencing variant calling accuracy using accurate long reads. Using human trio data with known variants to simulate variable ploidy levels (diploid, tetraploid, hexaploid), we demonstrate that while variant sites can often be identified accurately, genotyping accuracy decreases with increasing ploidy due to allelic dosage uncertainty. This highlights a specific challenge in assigning correct allele counts in polyploids even with high depth, separate from the initial variant discovery. We then assessed genotyping and variant detection performance in real genomes with varying complexity: the relatively simple diploid Fragaria vesca, the tetraploid Solanum tuberosum, and the highly repetitive diploid Zea mays. Our results reveal that overall variant calling accuracy is influenced strongly by inherent genome complexity (e.g., repeat content). Furthermore, we identify a critical mechanism impacting variant discovery: structural variations between the reference and sample genomes, particularly those containing repetitive elements, can induce spurious read mapping. This effect is likely exacerbated by the length and accuracy of long reads. This leads to false variant calls, constituting a distinct and more dominant source of error than allelic-dosage uncertainty. Our findings underscore the multifaceted challenges in long-read variant analysis and highlight the need for ploidy-aware genotypers and bias-aware mapping strategies to fully realize the potential of long reads in diverse organisms.

Article activity feed