FroM Superstring to Indexing: a space-efficient index for unconstrained k -mer sets using the Masked Burrows-Wheeler Transform (MBWT)

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The growing volumes and heterogeneity of genomic data call for scalable and versatile k -mer-set indexes. However, state-of-the-art indexes such as Spectral Burrows-Wheeler Transform (SBWT) and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small k , sampled data, or high-diversity settings. Here, we introduce FMSI, a superstring-based index for arbitrary k -mer sets that supports efficient membership and compressed dictionary queries with strong theoretical guarantees. FMSI builds on recent advances in k -mer superstrings and uses the Masked Burrows-Wheeler Transform (MBWT), a novel extension of the classical BWT that incorporates position masking. Across a range of k values and dataset types – including genomic, pangenomic, and metagenomic – FMSI consistently achieves superior query space efficiency, using up to 2–3× less memory than state-of-the-art methods, while maintaining competitive query times. Only a space-optimized version of SBWT can match the FMSI’s footprint in some cases, but then FMSI is 2–3× faster. Our results establish superstring-based indexing as a robust, scalable, and versatile framework for arbitrary k -mer sets across diverse bioinformatics applications.

Article activity feed