Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for k -Mer Sets

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The exponential growth of genome databases calls for novel space-efficient algorithms for data compression and search. State-of-the-art approaches often rely on k -merization for data tokenization, yet efficiently representing and querying k -mer sets remains a significant challenge in bioinformatics. Our recent work has introduced the concept of masked superstring for compactly representing k -mer sets, designed without reliance on common structural assumptions on k -mer data. However, despite their compactness, the practicality of masked superstrings for set operations and membership queries was previously unclear. Here, we propose the f -masked superstring framework, which additionally integrates demasking functions f , enabling efficient k -mer set operations through concatenation. When combined with the FMS-index, a new index for f -masked superstrings based on a simplified FM-index, we obtain a versatile, compact data structure for k -mer sets. We demonstrate its power through the FMSI program, which, when evaluated on bacterial pan-genomic data, achieves memory savings of a factor of 3 to 10 compared to state-of-the-art single k -mer-set indexing methods such as SBWT and CBL. Our work presents a theoretical framework with promising practical advantages such as space-efficiency, demonstrating the potential of f -masked superstrings in k -mer-based methods as a generic data type.

Article activity feed