OptiK: An Entropy-Driven Framework for Optimal k-mer Size Selection for Bacterial Genomics

AJ Gutierrez-Escobar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

K-mer-based approaches have become fundamental (Zielezinski et al., 2017) to modern computational genomics, underpinning tools for genome assembly, metagenomic classification, variant calling, and phylogenetic analysis. Despite their ubiquity, selecting an appropriate k-mer size (k) is often made arbitrarily or heuristically, with little consideration for the underlying signal quality relative to a given dataset. Here, I introduce OptiK, a novel alignment-free tool that evaluates the information richness of k-mer encodings across a range of k values to identify the optimal k for comparative analysis. OptiK operates by constructing k-mer frequency matrices from genome collections, reducing their dimensionality via truncated singular value decomposition (SVD), and evaluating clustering structure through unsupervised metrics including the Silhouette coefficient, Calinski-Harabasz index, and Davies-Bouldin index. We validate OptiK on a curated dataset of 1044 Helicobacter pylori genomes with well-characterized population structure. OptiK robustly identifies k = 8 as the optimal k-mer size, yielding latent structures in UMAP space that align with fineSTRUCTURE-defined subpopulations without relying on prior labels or reference alignments. These results demonstrate that OptiK provides a reproducible, alignment-free strategy for optimizing k-mer resolution in bacterial comparative genomics.

Version published to 10.1101/2025.05.21.655412 on bioRxiv
May 26, 2025

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

This article has 8 authors:
1. Louis-Maël Guéguen
2. Alban Mathieu
3. Simon Pelletier
4. Anthony Woo
5. Namita Misra
6. Magali Moreau
7. Olivier Perin
8. Arnaud Droit
This article has no evaluationsLatest version Jan 29, 2026
Retrieval-Based AI Framework for Viral Genomic Analysis

This article has 3 authors:
1. Ahmed M. Fahmy
2. Melissa Ayad
3. Hassan M. Ahmed
This article has no evaluationsLatest version Jan 29, 2026
Reframing Population Genetic Structure as a Quantum Optimization Problem

This article has 1 author:
1. Andrew Davinack
This article has no evaluationsLatest version Dec 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

Retrieval-Based AI Framework for Viral Genomic Analysis

Reframing Population Genetic Structure as a Quantum Optimization Problem