Count your bits: more subtle similarity measures using larger radius count vectors
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Quantifying molecular similarity is a cornerstone of cheminformatics, underpinning applications from virtual screening to chemical space visualization. A wide range of molecular fingerprints and similarity metrics, most notably Tanimoto scores, are employed, but their effectiveness is highly context-dependent. In this study, we systematically evaluate several 2D fingerprint types, including circular, path-based, and distance-encoded variants, using both binary and count representations. We highlight the consequences of fingerprint choice, vector folding, and similarity metric selection, revealing critical issues such as fingerprint duplication, mass-dependent score biases, and high bit collision rates. Sparse and count-based fingerprints consistently outperform fixed-size binary vectors in preserving structural distinctions. Furthermore, we introduce percentile-based normalization, propose inverse-document-frequency (IDF) weighting, and benchmark all methods against graph-based MCES similarities. Our results offer practical guidance for selecting molecular similarity measures, emphasizing the need for conscious, task-aware fingerprinting choices in large-scale chemical analyses.