Towards Efficient k- Mer Set Operations via Function-Assigned Masked Superstrings
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The design of efficient dynamic data structures for large k -mer sets belongs to central challenges of sequence bioinformatics. Recent advances in compact k -mer set representations via Spectrum-Preserving String Sets (SPSS), culminating with the masked superstring framework, have provided data structures of remarkable space efficiency for wide ranges of k -mer sets. However, the possibility to perform set operations with the resulting indexes has remained limited due to the static nature of the underlying compact representations. Here, we develop f -masked superstrings, a concept combining masked superstrings with custom demasking functions f to enable k -mer set operations based on index merging. Combined with the FMSI index for masked superstrings, we obtain a memory-efficient k -mer membership index and compressed dictionary supporting set operations via Burrows-Wheeler Transform merging. The framework provides a promising theoretical solution to a pressing bioinformatics problem and highlights the potential of f -masked superstrings to become an elementary data type for k -mer sets.