Brisk: Exact resource-efficient dictionary for k -mers

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid advancements in DNA sequencing technology have led to an unprecedented increase in the generation of genomic datasets, with modern sequencers now capable of producing up to ten terabases per run. However, the effective indexing and analysis of this vast amount of data pose significant challenges to the scientific community. K-mer indexing has proven crucial in managing extensive datasets across a wide range of applications, including alignment, compression, dataset comparison, error correction, assembly, and quantification. As a result, developing efficient and scalable k -mer indexing methods has become an increasingly important area of research. Despite the progress made, current state-of-the-art indexing structures are predominantly static, necessitating resource-intensive index reconstruction when integrating new data. Recently, the need for dynamic indexing structures has been recognized. However, many proposed solutions are only pseudo-dynamic, requiring substantial updates to justify the costs of adding new datasets. In practice, applications often rely on standard hash tables to associate data with their k -mers, leading to high k -mer encoding rates exceeding 64 bits per k -mer. In this work, we introduce Brisk, a drop-in replacement for most k -mer dictionary applications. This novel hashmap-like data structure provides high throughput while significantly reducing memory usage compared to existing dynamic associative indexes, particularly for large k -mer sizes. Brisk achieves this by leveraging hierarchical minimizer indexing and memory-efficient super- k -mer representation. We also introduce novel techniques for efficiently probing k -mers within a set of super- k -mers and managing duplicated minimizers. We believe that the methodologies developed in this work represent a significant advancement in the creation of efficient and scalable k -mer dictionaries, greatly facilitating their routine use in genomic data analysis.

Article activity feed