PanSpace: Fast and Scalable Indexing for Massive Bacterial Databases

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Motivation

Species identification is a critical task in agriculture, food processing, and health-care. The rapid growth of genomic databases — driven in part by the increasing investigation of bacterial genomes in clinical microbiology — has outpaced the capabilities of conventional tools such as BLAST for basic search and query tasks. A key bottleneck in microbiome studies lies in building indexes that allow rapid species identification and classification from assemblies while scaling efficiently to massive resources such as the AllTheBacteria database, thus enabling large-scale analyses to be performed even on a common laptop.

Results

We introduce PanSpace , the first convolutional neural network–based approach that leverages dense vector (embedding) indexing —– scalable to billions of embeddings —– for indexing and querying massive bacterial genome databases. PanSpace is specifically designed to classify bacterial draft assemblies. Compared to the most recent and competitive tool for this task, PanSpace requires only ~2 GB of disk space to index the AllTheBacteria database, an 8 × reduction relative to existing methods. Moreover, it delivers ultra-fast query performance, processing more than 1,000 assemblies in less than two and a half minutes, while preserving the utmost accuracy of state-of-the-art approaches.

Availability

PanSpace is available at https://github.com/pg-space/panspace .

Article activity feed