REINDEER2: practical abundance index at scale

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Recent advances in biological sequence indexing have enabled the efficient querying of sequence presence across massive genomic data repositories. While presence queries have become tractable at petabyte scale, retrieving quantitative information such as sequence abundances remains a significant algorithmic challenge. Existing abundance-aware indexes are mostly static, difficult to scale, and often trade off completeness, precision, or updatability. We describe a novel discrete abundance index designed for scalability, dynamic updates, and tunable precision. We combine an inverted index with probabilistic and exact structures to support fast, memory-efficient construction and precise high-throughput queries across thousands of RNA datasets.

Our experiments demonstrate that our method REINDEER2 achieves one to two orders of magnitude speedup in construction compared to existing methods, while maintaining comparable or better memory use. Despite using approximate structures for scalability, REINDEER2 achieves sub-1% error on abundance recovery and correlates strongly with reference quantifiers like Kallisto. It also supports sequence-level queries in seconds over thousands of datasets.

Code and experiments github.com/Yohan-HernandezCourbevoie/REINDEER2

Article activity feed