FroM Superstring to Indexing: a space-efficient index for unconstrained k -mer sets using the Masked Burrows-Wheeler Transform (MBWT)

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The exponential growth of DNA sequencing data limits the searchable proportion of the data. In this context, tokenization of genomic data via their k -merization provides a path towards efficient algorithms for their compression and search. However, indexing even single k -mer sets still remains a significant bioinformatics challenge, especially if k -mer sets are sketched or subsampled. Here, we develop the FMSI index, a space-efficient data structure for unconstrained k -mer sets, based on approximated shortest superstrings and the Masked Burrows Wheeler Transform (MBWT), an adaptation of the BWT for masked superstrings. We implement this in a program called FMSI, and via extensive evaluations using prokaryotic pan-genomes, we show FMSI substantially improves space efficiency compared to the state of the art, while maintaining a competitive query time. Overall, our work demonstrates that superstring indexing is a highly general, parameter-free approach for modern k -mer sets, without imposing any constraints on their structure.

Article activity feed