GreedyMini: Generating low-density DNA minimizers

Shay Golan
Ido Tziony
Matan Kraus
Yaron Orenstein
Arseny Shur

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Minimizers is the most popular k -mer selection scheme in algorithms and data structures analyzing high-throughput sequencing (HTS) data. In a minimizers scheme, the smallest k -mer by some predefined order is selected as the representative of a sequence window containing w consecutive k -mers, which results in overlapping windows often selecting the same k -mer. Minimizers that achieve the lowest frequency of selected k -mers over a random DNA sequence, termed the expected density, are desired for improved performance of HTS analyses. Yet, no method to date exists to generate minimizers that achieve minimum expected density. Moreover, for k and w values used by common HTS algorithms and data structures there is a gap between the densities achieved by existing selection schemes and a recent theoretical lower bound. Here, we present GreedyMini , a toolkit of methods to generate minimizers with low expected or particular density, to improve minimizers, to extend minimizers to larger alphabets, k , and w , and to measure the expected density of a given minimizer efficiently. We demonstrate over various combinations of k and w values, including those of popular HTS methods, that GreedyMini can generate DNA minimizers that achieve expected densities very close to the lower bound, and both expected and particular densities much lower compared to existing selection schemes. Additionally, we show that the k -mer rank-retrieval time by GreedyMini is comparable to that of common k -mer hash functions. We expect GreedyMini to improve the performance of many HTS algorithms and data structures and advance the research of k -mer selection schemes.

Version published to 10.1101/2024.10.28.620726v3 on bioRxiv
Feb 3, 2025
Version published to 10.1101/2024.10.28.620726v2 on bioRxiv
Feb 2, 2025
Version published to 10.1101/2024.10.28.620726v1 on bioRxiv
Nov 2, 2024

A k-mer-based estimator of the substitution rate between repetitive sequences

This article has 3 authors:
1. Haonan Wu
2. Antonio Blanca
3. Paul Medvedev
This article has no evaluationsLatest version Jun 25, 2025
Sequence alignment with k -bounded matching statistics

This article has 4 authors:
1. Tommi Mäklin
2. Jarno N. Alanko
3. Elena Biagi
4. Simon J. Puglisi
This article has no evaluationsLatest version May 26, 2025
OptiK: An Entropy-Driven Framework for Optimal k-mer Size Selection for Bacterial Genomics

This article has 1 author:
1. AJ Gutierrez-Escobar
This article has no evaluationsLatest version May 26, 2025

Listed in

Abstract

Article activity feed

Related articles

A k-mer-based estimator of the substitution rate between repetitive sequences

Sequence alignment with k -bounded matching statistics

OptiK: An Entropy-Driven Framework for Optimal k-mer Size Selection for Bacterial Genomics