PanSpace: Fast and Scalable Indexing for Massive Bacterial Databases

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Motivation

Species identification is a crucial task in fields such as agriculture, food processing, and healthcare. The rapid expansion of genomics databases, especially with the growing focus on investigating new bacterial genomes in clinical microbiology, has surpassed the capabilities of conventional tools like BLAST for basic search and query procedures. A major bottleneck in microbiome studies is building indexes that enable rapid identification and classification of species from assemblies while scaling efficiently to AllTheBacteria Database, the current larger massive bacterial databases, making large-scale analysis feasible on a common laptop.

Results

We introduce PANSPACE , the first convolutional neural network-based approach that leverages dense vector (embedding) indexing, proven to scale up to 1 billion embeddings, to index and query very large bacterial genome databases. PANSPACE is designed to classify (draft) assemblies of bacteria. Compared to the most recent and competitive tool for this task, our index requires only ∼2GB of disk space for the AllTheBacteria Database, more than 40 × less. Additionally, PANSPACE is ultra-fast in genomic queries, processing over 1,000 queries in under two minutes and half while maintaining high accuracy compared to the current state-of-the-art tool for the same tasks.

Availability

PANSPACE is available at https://github.com/pg-space/panspace .

Article activity feed