Compressed inverted indexes for scalable sequence similarity

Florian Ingels
Léa Vandamme
Mathilde Girard
Clément Agret
Bastien Cazaux
Antoine Limasset

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Modern sequencing continues to drive explosive growth of nucleotide sequence archives, pushing MinHash-derived sketching methods to their practical scalability limits. State-of-the-art tools such as Mash, Dashing2, and Bindash2 provide compact sketches and accurate similarity estimates for large collections, yet ultimately rely on forward indexes that materialize sketches as explicit finger-print vectors. As a result, large-scale similarity search and exhaustive collection-versus-collection comparison still incur quadratic resource usage.

In this work, we revisit the architecture of sketch-based indexes and provide a novel framework for scalable similarity search over massive sketch collections. Our first contribution is a formal cost model for sketch comparison, within which we prove that inverted indexes on sketch fingerprints, equipped with suitably compressed posting lists, achieve the same asymptotic space complexity as standard forward indexes, thereby eliminating the perceived memory penalty traditionally associated with inverted indexes. Building on this model, we design algorithms for all-vs-all comparison between two inverted indexes whose running times are proportional to the total number of matching sketch positions, leading to output-sensitive optimality and enabling efficient large-scale similarity comparisons.

Our second contribution leverages the prevalence of similarity thresholds in downstream applications such as clustering, redundancy filtering, and database search. We introduce two early-pruning schemes: an exact criterion that safely eliminates pairs guaranteed not to reach a target Jaccard similarity, and a probabilistic strategy that exploits partial match statistics to discard pairs unlikely to exceed this threshold. Together, these schemes address both time and memory bottlenecks while maintaining rigorous guarantees on retained high-similarity pairs and providing explicit control of the false-rejection probability for the probabilistic variant.

Finally, we instantiate these ideas in Onika, an open-source Rust implementation based on compressed inverted posting lists available at github.com/Malfoy/Onika . Onika incorporates a similarity-aware document reordering strategy that restructures sketch identifiers to further shrink index size and improve locality, particularly for redundant collections. Experiments on large bacterial genome repositories, synthetic low-redundancy benchmarks, and long-read HiFi sequencing datasets demonstrate that Onika matches or improves upon the sketch sizes of leading tools while accelerating large-scale search and collection-versus-collection comparisons by up to several orders of magnitude in low-redundancy regimes, without compromising sensitivity at practically relevant similarity thresholds.

Version published to 10.1101/2025.11.21.689685 on bioRxiv
Nov 21, 2025

Lossless Pangenome Indexing Using Tag Arrays

This article has 3 authors:
1. Parsa Eskandar
2. Benedict Paten
3. Jouni Sirén
This article has no evaluationsLatest version Jan 18, 2026
The Joinless Partition Pattern: A Novel Approach to Efficient Multi-Dimensional Aggregations in Apache Spark

This article has 1 author:
1. Jaime Ruben Alejandro
This article has no evaluationsLatest version Jan 23, 2026
CSH-256: A Modular Cubing–Based Approach toStrengthening the Critical Path in Hash Functions

This article has 1 author:
1. Ibrahem Aboukila
This article has no evaluationsLatest version Jan 7, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Lossless Pangenome Indexing Using Tag Arrays

The Joinless Partition Pattern: A Novel Approach to Efficient Multi-Dimensional Aggregations in Apache Spark

CSH-256: A Modular Cubing–Based Approach toStrengthening the Critical Path in Hash Functions