Fast and Scalable Parallel External-Memory Construction of Colored Compacted de Bruijn Graphs with Cuttlefish 3
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid growth of genomic data over the past decade has made scalable and efficient sequence analysis algorithms, particularly for constructing de Bruijn graphs and their colored and compacted variants critical components of many bioinformatics pipelines. Colored compacted de Bruijn graphs condense repetitive sequence information, significantly reducing the data burden on downstream analyses like assembly, indexing, and pan-genomics. However, direct construction of these graphs is necessary as constructing the original uncompacted graph is essentially infeasible at large scale. In this paper, we introduce C uttlefish 3, a state-of-the-art parallel, external-memory algorithm for constructing (colored) compacted de Bruijn graphs. C uttlefish 3 introduces novel algorithmic improvements that provide its scalability and speed, including optimizations to significantly speed up local contractions within subgraphs, a parallel algorithm to join local solutions based on parallel list-ranking, and a sparsification method to vastly reduce the amount of data required to compute the colored graph. Leveraging these algorithmic strategies along with algorithm engineering optimizations in parallel and external-memory setting, C uttlefish 3 demonstrates state-of-the-art performance, surpassing existing approaches in speed and scalability across various genomic datasets in both colored and uncolored scenarios.