A k-mer-based estimator of the substitution rate between repetitive sequences

Haonan Wu
Antonio Blanca
Paul Medvedev

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

K-mer-based analysis of genomic data is ubiquitous, but the presence of repetitive k-mers continues to pose problems for the accuracy of many methods. For example, the Mash tool (Ondov et al 2016) can accurately estimate the substitution rate between two low-repetitive sequences from their k-mer sketches; however, it is inaccurate on repetitive sequences such as the centromere of a human chromosome. Follow-up work by Blanca et al. (2021) has attempted to model how mutations affect k-mer sets based on strong assumptions that the sequence is non-repetitive and that mutations do not create spurious k-mer matches. However, the theoretical foundations for extending an estimator like Mash to work in the presence of repeat sequences have been lacking.

In this work, we relax the non-repetitive assumption and propose a novel estimator for the mutation rate. We derive theoretical bounds on our estimator’s bias. Our experiments show that it remains accurate for repetitive genomic sequences, such as the alpha satellite higher order repeats in centromeres. We demonstrate our estimator’s robustness across diverse datasets and various ranges of the substitution rate and k-mer size. Finally, we show how sketching can be used to avoid dealing with large k-mer sets while retaining accuracy. Our software is available at https://github.com/medvedevgroup/Repeat-Aware_Substitution_Rate_Estimator .

Version published to 10.1101/2025.06.19.660607 on bioRxiv
Jun 25, 2025

Kappa-Frameshift Background Mutations and Long-Range Correlations of the DNA Base Sequences

This article has 1 author:
1. Elias Koorambas
This article has no evaluationsLatest version Dec 17, 2025
Genome-wide Statistical and Machine Learning Analysis of Long-Range Sequence Memory Across Human Chromosomes 1, 2, and 22

This article has 1 author:
1. khushboo rani
This article has no evaluationsLatest version Jan 6, 2026
META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

This article has 8 authors:
1. Louis-Maël Guéguen
2. Alban Mathieu
3. Simon Pelletier
4. Anthony Woo
5. Namita Misra
6. Magali Moreau
7. Olivier Perin
8. Arnaud Droit
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Kappa-Frameshift Background Mutations and Long-Range Correlations of the DNA Base Sequences

Genome-wide Statistical and Machine Learning Analysis of Long-Range Sequence Memory Across Human Chromosomes 1, 2, and 22

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing