Generating low-density minimizers

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Minimizers is the most popular k -mer selection scheme. It is used in many algorithms and data structures analyzing high-throughput sequencing data. In a minimizers scheme, the smallest k -mer by some predefined order is selected as the representative of a sequence window containing w consecutive k -mers, which results in overlapping windows often selecting the same k -mer. Minimizers that achieve the smallest number of selected k -mers over a random DNA sequence, termed the expected density, are desired for improved performance of high-throughput sequencing analyses. Yet, no method to date exists to generate minimizers that achieve minimum expected density. Moreover, existing selection schemes fail to achieve low density for values of k and w that are most practical for high-throughput sequencing algorithms and data structures. Here, we present GreedyMini , a novel greedy algorithm to generate minimizers with low expected density. Moreover, we present innovative techniques to transform minimizers from binary to larger alphabets and to larger k values, an extension of GreedyMini to generate minimizers that achieve low density for a particular DNA sequence, and efficient methods to calculate the exact expected density. We combine these innovations into GreedyMini +, a novel method to generate DNA minimizers for practical values of k and w . We demonstrate over various combinations of practical k and w values that GreedyMini + generates minimizers that achieve expected densities very close to a recent theoretical lower bound, and both expected and particular densities much lower compared to existing selection schemes. We expect GreedyMini + to improve the performance of many high-throughput sequencing algorithms and data structures and advance the research of k -mer selection schemes.

Article activity feed