Fast Set Operations for Compact k -mer Sets
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The k -mer spectrum of a set of sequences is the set of k -length substrings the sequences contain. This lossy representation of sequence content pervades modern genomics. Recently, the spectral Burrows-Wheeler transform (SBWT) has emerged as a space-efficient representation of k -spectra that also supports efficient k -mer lookup queries and, more generally, easy navigation of the de Bruijn graph of the k -spectrum. In this paper, we examine primitive set operations, such as intersection, union, and set difference, on SBWT-encoded k -spectra and show that these operations can be supported efficiently. Moreover, efficient merging leads directly to a new memory-efficient algorithm for SBWT construction, which was able to build the SBWT for the 661K bacterial dataset containing 88 billion distinct k -mers in 50 hours using 186 GiB of RAM and 112 GiB of disk space. Given the pervasiveness of k -mer sets in genomics and the continued rapid growth of genomic databases, our work opens the door to a wide array of future applications that manipulate and reason about genomic data by dealing directly with simultaneously compact and searchable k -mer set representations offered by the SBWT.
2012 ACM Subject Classification
Theory of computation → Design and analysis of algorithms
Digital Object Identifier
10.4230/LIPIcs.WABI.2026.
Supplementary Material
Software (Source Code) : https://github.com/LoreDepuydt/sbwt-set-operations
Funding
This work has benefited from funding from the French State under the France 2030 program, reference ANR-21-IDES-0006. The European Metropolis of Lille and the University of Lille are also acknowledged for their funding and support of the project WILL-CHAIRES-25-001-BOSSA.