Benchmarking Framework to Catalyze Individual Human Genome Projects

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Individual human genome projects still aim for chromosome-level gapless assemblies, which rely on high-coverage reads from multiple long-read sequencing platforms using a multiplicity of assembly pipelines. Moreover, the dependence on DNA derived from primary cell lines for these assemblies makes such projects prohibitively expensive to scale for individual genome initiatives and to catalyze clinical applications.

Over the past decades, genome assembly quality has advanced remarkably from draft assemblies in the early 2010s, to chromosome-level assemblies using error-prone long reads in the late 2010s, to the recent T2T gapless assemblies enabled by high-quality next-generation long-read technologies. That said, a systematic evaluation of trade-offs from assemblies obtained at various coverages, starting at 3x, from a single long-read sequencing platform, is critical for developing a cost-effective and practical strategy for catalyzing individual genome initiative.

Here, by assembling contigs at various coverage levels through downsampling of existing PacBio HiFi reads from three individuals, we demonstrate that high-quality assemblies, as measured by standard assembly metrics and DNA-level linearity relative to a reference across most chromosomes (developed inhouse), can be achieved at approximately 12x coverage. Interestingly, starting at coverages as low as 6x, assembly metrics, including BUSCO scores and DNA-level linearity, begin to saturate, suggesting minimal trade-offs. Furthermore, we show that known structural variants (e.g., the 8p23.1 inversion) can be reliably identified even at 6x coverage.

Together, these results suggest that cost-effective strategies can be developed to advance individual genome initiatives potentially from PacBio HiFi reads from a single SMRT cell per human genome.

Article activity feed