Efficient reconciliation of genomic datasets of high similarity

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k -mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k -mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k -mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k -mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k -mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data ( SARS-CoV-2 and Streptococcus Pneumoniae genomes).

Available at: https://github.com/yhhshb/km-peeler.git

Article activity feed