BinDash 2.0: New MinHash Scheme Allows Ultra-fast and Accurate Genome Search and Comparisons

Jianshu Zhao
Xiaofei Zhao
Jean Pierre-Both
Konstantinos T. Konstantinidis

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

Comparing large number of genomes in term of their genomic distance is becoming more and more challenging because there is an increasing number of microbial genomes deposited in public databases. Nowadays, we may need to estimate pairwise distances between millions or even billions of genomes. Few softwares can perform such comparisons efficiently.

Results

Here we update the multi-threaded software BinDash by implementing several new MinHash algorithms and computational optimization (e.g. Simple Instruction Multiple Data, SIMD) for ultra-fast and accurate genome search and comparisons at trillion scale. That is, we implemented b-bit one-permutation rolling MinHash with optimal/faster densification with SIMD. Now with BinDash 2, we can perform 0.1 trillion (or ∼10^11) pairs of genome comparisons in about 1.8 hours on a descent computer cluster or several hours on personal laptops, a ∼50% or more improvement over original version. The ANI (average nucleotide identity) estimated by BinDash is well correlated with other accurate but much slower ANI estimators such as FastANI or alignment-based ANI. In line with the findings from comparing 90K genomes (∼10^9 comparisons) via FastANI, the 85% ∼ 95% ANI gap is consistent in our study of ∼10^11 prokaryotic genome comparisons via BinDash2, which indicates fundamental ecological and evolutionary forces keeping species-like unit (e.g., > 95% ANI) together.

Availability and implementation

BinDash is released under the Apache 2.0 license at: https://github.com/zhaoxiaofei/bindash

Contact

kostas.konstantinidis@gatech.edu

Supplementary information

Supplementary data are available at Bioinformatics online.

Version published to 10.1101/2024.03.13.584875 on bioRxiv
Mar 14, 2024

Lossless Pangenome Indexing Using Tag Arrays

This article has 3 authors:
1. Parsa Eskandar
2. Benedict Paten
3. Jouni Sirén
This article has no evaluationsLatest version Jan 18, 2026
META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

This article has 8 authors:
1. Louis-Maël Guéguen
2. Alban Mathieu
3. Simon Pelletier
4. Anthony Woo
5. Namita Misra
6. Magali Moreau
7. Olivier Perin
8. Arnaud Droit
This article has no evaluationsLatest version Jan 29, 2026
GPU-accelerated modeling of biological regulatory networks

This article has 7 authors:
1. Joyce Reimer
2. Pranta Saha
3. Chris Chen
4. Neeraj Dhar
5. Brook Byrns
6. Steven Rayan
7. Gordon Broderick
This article has no evaluationsLatest version Jan 5, 2026

Discuss this preprint

Listed in

Abstract

Motivation

Results

Availability and implementation

Contact

Supplementary information

Article activity feed

Related articles

Lossless Pangenome Indexing Using Tag Arrays

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

GPU-accelerated modeling of biological regulatory networks