REINDEER2: practical abundance index at scale

Yohan Hernandez-Courbevoie
Mikaël Salson
Chloé Bessière
Hao-liang Xue
Daniel Gautheret
Camille Marchet
Antoine Limasset

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent advances in biological sequence indexing have enabled the efficient querying of sequence presence across massive genomic data repositories. While presence queries have become tractable at petabyte scale, retrieving quantitative information such as sequence abundances remains a significant algorithmic challenge. Existing abundance-aware indexes are mostly static, difficult to scale, and often trade off completeness, precision, or updatability. We describe a novel discrete abundance index designed for scalability, dynamic updates, and tunable precision. We combine an inverted index with probabilistic and exact structures to support fast, memory-efficient construction and precise high-throughput queries across thousands of RNA datasets.

Our experiments demonstrate that our method REINDEER2 achieves one to two orders of magnitude speedup in construction compared to existing methods, while maintaining comparable or better memory use. Despite using approximate structures for scalability, REINDEER2 achieves sub-1% error on abundance recovery and correlates strongly with reference quantifiers like Kallisto. It also supports sequence-level queries in seconds over thousands of datasets.

Code and experiments github.com/Yohan-HernandezCourbevoie/REINDEER2

Version published to 10.1101/2025.06.16.659990v1 on bioRxiv
Jun 17, 2025

Accelerating k -mer-based sequence filtering

This article has 6 authors:
1. Igor Martayan
2. Léa Vandamme
3. Bede Constantinides
4. Bastien Cazaux
5. Charles Paperman
6. Antoine Limasset
This article has no evaluationsLatest version Jun 20, 2025
Tomtom-lite: Accelerating Tomtom enables large-scale and real-time motif similarity scoring

This article has 1 author:
1. Jacob Schreiber
This article has no evaluationsLatest version May 31, 2025
MassiveFold data for CASP16-CAPRI: a systematic massive sampling experiment

This article has 3 authors:
1. Nessim Raouraoua
2. Marc F. Lensink
3. Guillaume Brysbaert
This article has no evaluationsLatest version May 27, 2025

Listed in

Abstract

Article activity feed

Related articles

Accelerating k -mer-based sequence filtering

Tomtom-lite: Accelerating Tomtom enables large-scale and real-time motif similarity scoring

MassiveFold data for CASP16-CAPRI: a systematic massive sampling experiment