The open-closed mod-minimizer algorithm
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Sampling algorithms that deterministically select a subset of k -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one k -mer out of every window of w consecutive k -mers. The folklore and most used scheme is the random minimizer that selects the smallest k -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected k -mers) of 2 / ( w + 1). In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1 /w when k → ∞ , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024).
In this work, we study methods that achieve low density when k ≤ w . In this small- k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method – the open-closed minimizer – offers improved density for small k ≤ w while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small- k regime, our method has comparable density while being computationally simpler and intuitive.
Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when k > w is large. We hence obtain the open-closed mod-minimizer , a practical method that improves over the mod-minimizer for all k .