Tile-X: A vertex reordering approach for scalable long read assembly

Oieswarya Bhowmik
Ananth Kalyanaraman

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Traditional approaches for long read assembly compute overlapping reads and subsequently use that overlap information to assemble the contigs. Inherent to this approach is the subproblem of ordering the reads as per their (unknown) genomic positions of origin. However, existing approaches are not designed to explicitly target computation of this true ordering during the assembly process; instead the ordering information becomes available only after the assembly is complete. In this paper, we posit that prior computing of a reliable read ordering, even if imperfect, can significantly reduce the computational burden of the assembly process, preserve assembly quality, and enhance parallel scalability. Specifically, we present Tile-X, a novel graph-theoretic vertex reordering-centric approach to compute long read assemblies. The main idea of the approach is to efficiently compute an overlap graph first, use the overlap graph to (re)order the reads (vertices of the graph), and use that ordering to generate a partitioned parallel assembly. We test this idea with two classes of vertex reordering schemes: a) one that uses standard graph vertex reordering schemes that maximize graph locality or bandwidth measures; and b) another class where we custom define a sparsified reordering scheme that exploits sequence characteristics of the underlying graph to reduce the memory and time-footprint for the final assembly step. Using experiments on a combination of real-world and simulated PacBio High Fidelity (HiFi) long reads generated from real genomes, we demonstrate that the Tile-X approach is able to achieve substantial improvements over state-of-the-art long read assemblers, in memory efficiency and runtime, while preserving assembly quality metrics such as NGA50 and largest alignment. On average, across all the inputs, Tile-X achieved an NGA50 between 1.06x and 2.1x larger than state-of-the-art assemblers we compared with, while reducing runtime by up to 3.5x and memory consumption up to 3.3x. Keywords: long read assembly, vertex reordering, sparsified reordering, parallel genome assembly, farthest neighbor

Version published to 10.1101/2025.04.21.649853v1 on bioRxiv
Apr 29, 2025

FEDRANN: effective overlap graph construction based on dimensionality reduction and approximate nearest neighbors

This article has 12 authors:
1. Jia-Yuan Zhang
2. Changjiu Miao
3. Teng Qiu
4. Xiaoshuang Xia
5. Lei He
6. Junyi He
7. Yuhui Sun
8. Tao Zeng
9. Yuxiang Li
10. Xun Xu
11. Yijun Ruan
12. Yuliang Dong
This article has no evaluationsLatest version Jun 2, 2025
De Bruijn Graph Partitioning for Scalable and Accurate DNA Storage Processing

This article has 3 authors:
1. Florestan De Moor
2. Olivier Boullé
3. Dominique Lavenier
This article has no evaluationsLatest version May 23, 2025
Movi Color: fast and accurate long-read classification with the move structure

This article has 4 authors:
1. Steven Tan
2. Sina Majidian
3. Ben Langmead
4. Mohsen Zakeri
This article has no evaluationsLatest version May 27, 2025

Listed in

Abstract

Article activity feed

Related articles

FEDRANN: effective overlap graph construction based on dimensionality reduction and approximate nearest neighbors

De Bruijn Graph Partitioning for Scalable and Accurate DNA Storage Processing

Movi Color: fast and accurate long-read classification with the move structure