Kaminari: a resource-frugal index for approximate colored k -mer queries

Victor Levallois
Yoshihiro Shibuya
Bertrand Le Gal
Rob Patro
Pierre Peterlongo
Giulio Ermanno Pibiri

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

The problem of identifying the set of textual documents from a given database containing a query string has been studied in various fields of computing, e.g., in Information Retrieval, Databases, and Computational Biology. We consider the approximate version of this problem, that is, the result set is allowed to contain some false positive matches (but no false negatives), and focus on the specific case where the indexed documents are DNA strings. In this setting, state-of-the-art solutions rely on Bloom filters as a way to index all k -mers (substrings of length k ) in the documents. To answer a query, the k -mers of the query string are tested for membership against the index and documents that contain at least a user-prescribed fraction of them (e.g., 75–80%) are returned.

Methods and results

Here, we explore an alternative index design based on k -mer minimizers and integer compression methods. We show that a careful implementation of this design outperforms previous solutions based on Bloom filters by a wide margin: the index has lower memory footprint and faster query times, while false positive matches have only a minor impact on the ranking of the documents reported. This trend is robust across genomic datasets of different complexity and query workloads.

Software

The software is implemented in C++17 and available under the MIT license at github.com/yhhshb/kaminari . Reproducibility information and additional results are provided at github.com/vicLeva/benchmarks_kaminari .

Version published to 10.1101/2025.05.16.654317 on bioRxiv
May 21, 2025

Data Structures for Range Sorted Consecutive Occurrence Queries

This article has 2 authors:
1. Waseem Akram
2. Takuya Mieno
This article has no evaluationsLatest version Jan 21, 2026
topSEARCH: a Comprehensive Tool for the Retrieval and Analysis of Multi-Type Online Resources

This article has 6 authors:
1. Ander Cejudo
2. Yone Tellechea
3. Teresa García-Navarro
4. Amaia Calvo
5. Garazi Artola
6. Nekane Larburu
This article has no evaluationsLatest version Jan 20, 2026
Lossless Pangenome Indexing Using Tag Arrays

This article has 3 authors:
1. Parsa Eskandar
2. Benedict Paten
3. Jouni Sirén
This article has no evaluationsLatest version Jan 18, 2026

Discuss this preprint

Listed in

Abstract

Motivation

Methods and results

Software

Article activity feed

Related articles

Data Structures for Range Sorted Consecutive Occurrence Queries

topSEARCH: a Comprehensive Tool for the Retrieval and Analysis of Multi-Type Online Resources

Lossless Pangenome Indexing Using Tag Arrays