Tile-X: A vertex reordering approach for scalable long read assembly
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Traditional approaches for long read assembly compute overlapping reads and subsequently use that overlap information to assemble the contigs. Inherent to this approach is the subproblem of ordering the reads as per their (unknown) genomic positions of origin. However, existing approaches are not designed to explicitly target computation of this true ordering during the assembly process; instead the ordering information becomes available only after the assembly is complete. In this paper, we posit that prior computing of a reliable read ordering, even if imperfect, can significantly reduce the computational burden of the assembly process, preserve assembly quality, and enhance parallel scalability. Specifically, we present Tile-X, a novel graph-theoretic vertex reordering-centric approach to compute long read assemblies. The main idea of the approach is to efficiently compute an overlap graph first, use the overlap graph to (re)order the reads (vertices of the graph), and use that ordering to generate a partitioned parallel assembly. We test this idea with two classes of vertex reordering schemes: a) one that uses standard graph vertex reordering schemes that maximize graph locality or bandwidth measures; and b) another class where we custom define a sparsified reordering scheme that exploits sequence characteristics of the underlying graph to reduce the memory and time-footprint for the final assembly step. Using experiments on a combination of real-world and simulated PacBio High Fidelity (HiFi) long reads generated from real genomes, we demonstrate that the Tile-X approach is able to achieve substantial improvements over state-of-the-art long read assemblers, in memory efficiency and runtime, while preserving assembly quality metrics such as NGA50 and largest alignment. On average, across all the inputs, Tile-X achieved an NGA50 between 1.06x and 2.1x larger than state-of-the-art assemblers we compared with, while reducing runtime by up to 3.5x and memory consumption up to 3.3x. Keywords: long read assembly, vertex reordering, sparsified reordering, parallel genome assembly, farthest neighbor