FroM Superstring to Indexing: a space-efficient index for unconstrained k -mer sets using the Masked Burrows-Wheeler Transform (MBWT)
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The growing volumes and heterogeneity of genomic data call for scalable and versatile k -mer-set indexes. However, state-of-the-art indexes such as Spectral Burrows-Wheeler Transform (SBWT) and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small k , sampled data, or high-diversity settings. Here, we introduce FMSI, a superstring-based index for arbitrary k -mer sets that supports efficient membership and compressed dictionary queries with strong theoretical guarantees. FMSI builds on recent advances in k -mer superstrings and uses the Masked Burrows-Wheeler Transform (MBWT), a novel extension of the classical BWT that incorporates position masking. Across a range of k values and dataset types – including genomic, pangenomic, and metagenomic – FMSI consistently achieves superior query space efficiency, using up to 2–3× less memory than state-of-the-art methods, while maintaining competitive query times. Only a space-optimized version of SBWT can match the FMSI’s footprint in some cases, but then FMSI is 2–3× faster. Our results establish superstring-based indexing as a robust, scalable, and versatile framework for arbitrary k -mer sets across diverse bioinformatics applications.