High-Accuracy, Ultrafast DNA Barcode Identification via Statistical Sketching and Approximate Nearest Neighbor Search
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
High-throughput DNA barcoding, a cornerstone of modern biodiversity and environmental genomics, is critically limited by the computational cost of traditional, alignment-based identification methods. While faster alignment-free approaches have been proposed, first-generation techniques based on k-mer hashing are fundamentally unreliable due to their inherent sensitivity to insertions and deletions (indels), a common form of sequence variation. Here, we introduce DNA-Sketch, a novel alignment-free framework that overcomes this limitation. DNA-Sketch transforms a DNA sequence into a robust statistical fingerprint by vectorizing its binned dinucleotide frequencies. These high-dimensional “sketches” are then indexed for ultrafast similarity search using an Approximate Nearest Neighbor (ANN) library. We benchmarked a single-pass sketch and a “Multi-Sketch Ensemble” against the state-of-the-art aligner VSEARCH on a large, challenging benchmark simulating real-world noise and intra-species variation. The Multi-Sketch Ensemble achieved 100% accuracy, perfectly matching VSEARCH, while delivering a 31-fold speed improvement. The single-pass sketch achieved 99.98% accuracy with a 95-fold speedup. DNA-Sketch resolves the classic speed-versus-accuracy trade-off, demonstrating that by pairing robust feature extraction with high-performance ANN indexing, it is possible to achieve the accuracy of gold-standard alignment at a fraction of the computational cost, providing a powerful and highly scalable solution for modern bioinformatics.