extgfa: a low-memory on-disk representation of genome graphs
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The representation of genomes and genomic sequences through graph structures has undergone a period of rapid development in recent years, particularly to accommodate the growing size of genome sequences that are being produced. Genome graphs have been employed extensively for a variety of purposes, including assembly, variance detection, visualization, alignment, and pangenomics. Many tools have been developed to work with and manipulate such graphs. However, the majority of these tools tend to load the complete graph into memory, which results in a significant burden even for relatively straightforward operations such as extracting subgraphs, or executing basic algorithms like breadth-first or depth-first search. In procedurally generated open-world games like Minecraft , it is not feasible to load the complete world into memory. Instead, a mechanism that keeps most of the world on disk and only loads parts when needed is necessary. Accordingly, the world is partitioned into chunks which are loaded or unloaded based on their distance from the player. Furthermore, to conserve memory, the system unloads chunks that are no longer in use based on the player’s movement direction, sending them back to the disk. In this paper, we investigate the potential of employing a similar mechanism on genome graphs. To this end, we have developed a proof-of-concept implementation, which we called “extgfa” (for external GFA). Our implementation applies a similar chunking mechanism to genome graphs, whereby only the necessary parts of the graphs are loaded and the rest stays on disk. We demonstrate that this proof-of-concept implementation improves the memory profile when running an algorithm such as BFS on a large graph, and is able to reduce the memory profile by more than one order of magnitude for certain BFS parameters.
Our implementation is written in Python and available on Github under the MIT license https://github.com/fawaz-dabbaghieh/extgfa
