Fast Set Operations for Compact k -mer Sets

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The k -mer spectrum of a set of sequences is the set of k -length substrings the sequences contain. This lossy representation of sequence content pervades modern genomics. Recently, the spectral Burrows-Wheeler transform (SBWT) has emerged as a space-efficient representation of k -spectra that also supports efficient k -mer lookup queries and, more generally, easy navigation of the de Bruijn graph of the k -spectrum. In this paper, we examine primitive set operations, such as intersection, union, and set difference, on SBWT-encoded k -spectra and show that these operations can be supported efficiently. Moreover, efficient merging leads directly to a new memory-efficient algorithm for SBWT construction, which was able to build the SBWT for the 661K bacterial dataset containing 88 billion distinct k -mers in 50 hours using 186 GiB of RAM and 112 GiB of disk space. Given the pervasiveness of k -mer sets in genomics and the continued rapid growth of genomic databases, our work opens the door to a wide array of future applications that manipulate and reason about genomic data by dealing directly with simultaneously compact and searchable k -mer set representations offered by the SBWT.

2012 ACM Subject Classification

Theory of computation → Design and analysis of algorithms

Digital Object Identifier

10.4230/LIPIcs.WABI.2026.

Supplementary Material

Software (Source Code) : https://github.com/LoreDepuydt/sbwt-set-operations

Funding

This work has benefited from funding from the French State under the France 2030 program, reference ANR-21-IDES-0006. The European Metropolis of Lille and the University of Lille are also acknowledged for their funding and support of the project WILL-CHAIRES-25-001-BOSSA.

Article activity feed