Movi 2: Fast and Space-Efficient Queries on Pangenomes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Space-efficient compressed indexing methods are critical for pangenomics and for avoiding reference bias. In the Movi study, we implemented the move structure index, highlighting its locality-of-reference and speed. However, Movi had a high memory footprint compared to other compressed indexes. Here we introduce Movi 2 and describe new methods that greatly reduce size and memory footprint of move structure based indexes. The most compressed version of Movi 2 reduces the Movi index space footprint more than fivefold. We also introduce sampling approaches that enable trade-offs between query and space efficiency. To demonstrate, we show that Movi 2 achieves advantageous time and space tradeoffs when applied to large pangenome collections, including both the first and second releases of the Human Pangenome Reference Consortium (HPRC) collection, the latter of which spans over 460 human haplotyes. We show that Movi 2 dominates prior methods on both speed and memory footprint, including both r-index-based and our previous move structure-based method. The methods we developed for Movi 2 are publicly available at https://github.com/mohsenzakeri/Movi.

Article activity feed