MELO-ED: learning locality-sensitive multi-embeddings for edit distance

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Edit distance is a fundamental metric for quantifying similarity between biological sequences, but its high computational cost limits large-scale applications. Previously, we proposed learned locality-sensitive bucketing (LSB) functions that achieved superior performance and efficiency compared to classical seeding methods for identifying similar and dissimilar sequences. How-ever, each component of an LSB function is represented as a one-dimensional hash value that can only be compared for identity, which constrains the method’s accuracy. Here, we intro-duce MELO-ED, a multi-embedding locality-sensitive framework that upgrades each hash value to a higher-dimensional embedding capable of efficiently approximating edit distance. MELO-ED employs a Siamese convolutional neural architecture that learns complementary embeddings capturing both global sequence context and fine-grained edit operations. By integrating locality-sensitive bucketing with multi-embedding representations, MELO-ED achieves near-perfect ac-curacy without increasing the number of buckets required. Leveraging mature indexing methods in the embedding space, MELO-ED transforms time-consuming edit distance computations into scalable similarity searches across massive genomic databases. Comprehensive evaluations on simulated DNA sequences and real barcode datasets demonstrate that MELO-ED outperforms both traditional alignment-free methods and contemporary machine learning approaches, in-cluding our previously developed learned LSB functions. These results establish MELO-ED as a state-of-the-art framework for fast and accurate classification of similar and dissimilar sequences. MELO-ED is available at https://github.com/Shao-Group/MELO-ED .

Article activity feed