Count your bits: more subtle similarity measures using larger radius count vectors
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Quantifying molecular similarity is a cornerstone of cheminformatics, underpinning applications from virtual screening to chemical space visualization. A wide range of molecular fingerprints and similarity metrics, most notably Tanimoto scores, are employed, but their effectiveness is highly context-dependent. In this study, we systematically evaluate several 2D fingerprint types, including circular, path-based, and distance-encoded variants, using both binary and count representations. We highlight the consequences of fingerprint choice, vector folding, and similarity metric selection, revealing critical issues such as fingerprint duplication, mass-dependent score biases, and high bit collision rates. Sparse and count-based fingerprints consistently outperform fixed-size binary vectors in preserving structural distinctions. Furthermore, we introduce percentile-based normalization, propose inverse-document-frequency (IDF) weighting, and benchmark all methods against graph-based MCES similarities. Our results offer practical guidance for selecting molecular similarity measures, emphasizing the need for conscious, task-aware fingerprinting choices in large-scale chemical analyses.