Generating low-density minimizers
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Minimizers is the most popular k -mer selection scheme. It is used in many algorithms and data structures analyzing high-throughput sequencing data. In a minimizers scheme, the smallest k -mer by some predefined order is selected as the representative of a sequence window containing w consecutive k -mers, which results in overlapping windows often selecting the same k -mer. Minimizers that achieve the smallest number of selected k -mers over a random DNA sequence, termed the expected density, are desired for improved performance of high-throughput sequencing analyses. Yet, no method to date exists to generate minimizers that achieve minimum expected density. Moreover, existing selection schemes fail to achieve low density for values of k and w that are most practical for high-throughput sequencing algorithms and data structures. Here, we present GreedyMini , a novel greedy algorithm to generate minimizers with low expected density. Moreover, we present innovative techniques to transform minimizers from binary to larger alphabets and to larger k values, an extension of GreedyMini to generate minimizers that achieve low density for a particular DNA sequence, and efficient methods to calculate the exact expected density. We combine these innovations into GreedyMini +, a novel method to generate DNA minimizers for practical values of k and w . We demonstrate over various combinations of practical k and w values that GreedyMini + generates minimizers that achieve expected densities very close to a recent theoretical lower bound, and both expected and particular densities much lower compared to existing selection schemes. We expect GreedyMini + to improve the performance of many high-throughput sequencing algorithms and data structures and advance the research of k -mer selection schemes.