Towards Efficient k- Mer Set Operations via Function-Assigned Masked Superstrings

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The design of efficient dynamic data structures for large k -mer sets belongs to central challenges of sequence bioinformatics. Recent advances in compact k -mer set representations via Spectrum-Preserving String Sets (SPSS), culminating with the masked superstring framework, have provided data structures of remarkable space efficiency for wide ranges of k -mer sets. However, the possibility to perform set operations with the resulting indexes has remained limited due to the static nature of the underlying compact representations. Here, we develop f -masked superstrings, a concept combining masked superstrings with custom demasking functions f to enable k -mer set operations based on index merging. Combined with the FMSI index for masked superstrings, we obtain a memory-efficient k -mer membership index and compressed dictionary supporting set operations via Burrows-Wheeler Transform merging. The framework provides a promising theoretical solution to a pressing bioinformatics problem and highlights the potential of f -masked superstrings to become an elementary data type for k -mer sets.

Article activity feed