DynamicDemiLog: A Single Sketch for Ultrafast Similarity, Frequency, and Cardinality Estimation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Probabilistic cardinality estimators (HyperLogLog), similarity sketches (MinHash), and frequency estimators (Count-Min Sketch) are fundamental approximate data structures that each target one primary problem. We present DynamicDemiLog (DDL), a sketch that unifies cardinality estimation, set similarity, containment, element frequency and composition in one tiny data structure built from a single pass over the input stream. Using an inverted index over 200,687 RefSeq sketches (159,567 organisms), DDL performs all-to-all sketch similarity comparison of the full database in 30 seconds (128 threads, indexed) — over 375× faster per query than Mash’s brute-force all-to-all comparison of 91,282 sketches, or 31× faster without the index, at double the sketch resolution. DDL extends the LogLog register with a mantissa : each register stores a floating-point-encoded hash value consisting of an integer exponent (the leading-zero count) and a fractional mantissa (the sub-leading-zero bits), rather than the integer leading-zero count alone. This preserves enough hash information for meaningful register-by-register comparison — a property that standard 6-bit registers lack — while improving on LogLog’s cardinality estimation machinery, including DynamicLogLog’s early exit mask for high-throughput streaming.
With a default 10 mantissa bits (16-bit registers, 2,048 buckets, 4 KB), DDL achieves a per-register false-match rate of 0.018% on unrelated random same-size sets (compared to 17.0% for LL6, a basic HyperLogLog implementation), enabling Weighted Kmer Identity (WKID), Average Nucleotide Identity (ANI), containment, and completeness estimation from register comparison alone. A 16-bit per-register observation counter provides element frequency information at trivial additional computation cost, and an additional byte tracks element composition (GC content, for biological data). Furthermore, DDL’s high-specificity registers enable an inverted index structure (DDLIndex) that answers similarity queries against a database of N sketches in O( B + M ) time, where M is the number of matching index entries, compared to O( N × B ) for pairwise comparison.
DDL achieves a 930× reduction in false register matches compared to LL6 (Section 11.1), accurately estimates ANI between full and partial genomes down to approximately 79% identity (at k=25, B=2,048), and maintains near-zero spurious similarity on unrelated inputs — all at similar construction speed to LL6, and 3.5× faster than SetSketch.