DynamicDemiLog: A Single Sketch for Ultrafast Similarity, Frequency, and Cardinality Estimation

Brian Bushnell

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Probabilistic cardinality estimators (HyperLogLog), similarity sketches (MinHash), and frequency estimators (Count-Min Sketch) are fundamental approximate data structures that each target one primary problem. We present DynamicDemiLog (DDL), a sketch that unifies cardinality estimation, set similarity, containment, element frequency and composition in one tiny data structure built from a single pass over the input stream. Using an inverted index over 200,687 RefSeq sketches (159,567 organisms), DDL performs all-to-all sketch similarity comparison of the full database in 30 seconds (128 threads, indexed) — over 375× faster per query than Mash’s brute-force all-to-all comparison of 91,282 sketches, or 31× faster without the index, at double the sketch resolution. DDL extends the LogLog register with a mantissa : each register stores a floating-point-encoded hash value consisting of an integer exponent (the leading-zero count) and a fractional mantissa (the sub-leading-zero bits), rather than the integer leading-zero count alone. This preserves enough hash information for meaningful register-by-register comparison — a property that standard 6-bit registers lack — while improving on LogLog’s cardinality estimation machinery, including DynamicLogLog’s early exit mask for high-throughput streaming.

With a default 10 mantissa bits (16-bit registers, 2,048 buckets, 4 KB), DDL achieves a per-register false-match rate of 0.018% on unrelated random same-size sets (compared to 17.0% for LL6, a basic HyperLogLog implementation), enabling Weighted Kmer Identity (WKID), Average Nucleotide Identity (ANI), containment, and completeness estimation from register comparison alone. A 16-bit per-register observation counter provides element frequency information at trivial additional computation cost, and an additional byte tracks element composition (GC content, for biological data). Furthermore, DDL’s high-specificity registers enable an inverted index structure (DDLIndex) that answers similarity queries against a database of N sketches in O( B + M ) time, where M is the number of matching index entries, compared to O( N × B ) for pairwise comparison.

DDL achieves a 930× reduction in false register matches compared to LL6 (Section 11.1), accurately estimates ANI between full and partial genomes down to approximately 79% identity (at k=25, B=2,048), and maintains near-zero spurious similarity on unrelated inputs — all at similar construction speed to LL6, and 3.5× faster than SetSketch.

Version published to 10.64898/2026.06.12.731986 on bioRxiv
Jun 16, 2026

Fast Set Operations for Compact k -mer Sets

This article has 4 authors:
1. Jarno N. Alanko
2. Lore Depuydt
3. Camille Marchet
4. Simon J. Puglisi
This article has no evaluationsLatest version May 27, 2026
Accelerating String Comparison in RLZ Compressed Sequences via LCE Jumps

This article has 2 authors:
1. Rahul Varki
2. Christina Boucher
This article has no evaluationsLatest version Jun 16, 2026
FastDedup A fast and memory-efficient tool for read deduplication

This article has 3 authors:
1. Raphaël Ribes
2. Céline Mandier
3. Alice Baniel
This article has no evaluationsLatest version May 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Fast Set Operations for Compact k -mer Sets

Accelerating String Comparison in RLZ Compressed Sequences via LCE Jumps

FastDedup A fast and memory-efficient tool for read deduplication