LOCALE: Local-Alignment Embeddings for Noise-Robust DNA Search at SRA Scale

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Searching petabase-scale repositories of raw sequencing data such as the NIH Sequence Read Archive (SRA) could transform biological discovery, but existing methods either do not scale well or rely on exact k-mer matching that is brittle to sequencing errors and biological divergence. We recast sequence search as dense retrieval: we learn vector embeddings whose inner-product similarity ranks locally aligned sequences above unaligned ones. Our key observation is that effective retrieval does not require accurate regression of global edit distance—it only requires that sequences with better local alignments score higher than sequences with worse ones. We train a DNABERT-2 encoder with an InfoNCE objective on biologically informed augmentations: overlapping crops of parent sequences corrupted with substitutions, insertions, and deletions. On a 50-accession SRA benchmark, LOCALE maintains 62.4% average Recall@ R q at a 10% mutation rate, while every baseline we evaluated falls below 60% Recall@ R q in the noisy-query setting. The advantage holds at scale: on a 500-accession, 15-Gbp benchmark, LOCALE achieves AUPRC 0.508 at 10% mutation versus 0.129 for MetaGraph.

Article activity feed