Efficient reconciliation of genomic datasets of high similarity

Yoshihiro Shibuya
Djamal Belazzougui
Gregory Kucherov

Read the full article

Listed in

@ZonaPellucida_'s saved articles (unknown_user_13)

Abstract

We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k -mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k -mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k -mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k -mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k -mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data ( SARS-CoV-2 and Streptococcus Pneumoniae genomes).

Available at: https://github.com/yhhshb/km-peeler.git

Version published to 10.1101/2022.06.07.495186v2 on bioRxiv
Jul 11, 2022
Version published to 10.1101/2022.06.07.495186v1 on bioRxiv
Jun 9, 2022

A k-mer-based estimator of the substitution rate between repetitive sequences

This article has 3 authors:
1. Haonan Wu
2. Antonio Blanca
3. Paul Medvedev
This article has no evaluationsLatest version Jun 25, 2025
Kaminari: a resource-frugal index for approximate colored k -mer queries

This article has 6 authors:
1. Victor Levallois
2. Yoshihiro Shibuya
3. Bertrand Le Gal
4. Rob Patro
5. Pierre Peterlongo
6. Giulio Ermanno Pibiri
This article has no evaluationsLatest version May 21, 2025
FastGA: Fast Genome Alignment

This article has 3 authors:
1. Gene Myers
2. Richard Durbin
3. Chenxi Zhou
This article has no evaluationsLatest version Jun 19, 2025

Listed in

Abstract

Article activity feed

Related articles

A k-mer-based estimator of the substitution rate between repetitive sequences

Kaminari: a resource-frugal index for approximate colored k -mer queries

FastGA: Fast Genome Alignment