Count your bits: more subtle similarity measures using larger radius count vectors

Florian Huber
Julian Pollmann

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Quantifying molecular similarity is a cornerstone of cheminformatics, underpinning applications from virtual screening to chemical space visualization. A wide range of molecular fingerprints and similarity metrics, most notably Tanimoto scores, are employed, but their effectiveness is highly context-dependent. In this study, we systematically evaluate several 2D fingerprint types, including circular, path-based, and distance-encoded variants, using both binary and count representations. We highlight the consequences of fingerprint choice, vector folding, and similarity metric selection, revealing critical issues such as fingerprint duplication, mass-dependent score biases, and high bit collision rates. Sparse and count-based fingerprints consistently outperform fixed-size binary vectors in preserving structural distinctions. Furthermore, we introduce percentile-based normalization, propose inverse-document-frequency (IDF) weighting, and benchmark all methods against graph-based MCES similarities. Our results offer practical guidance for selecting molecular similarity measures, emphasizing the need for conscious, task-aware fingerprinting choices in large-scale chemical analyses.

Version published to 10.1101/2025.06.16.659994 on bioRxiv
Jun 22, 2025

SISSI H3 v3 — Spectral Integrity and Structural Similarity Index Official Method Specification (Version 3.0)

This article has 1 author:
1. Giovanni Amato
This article has no evaluationsLatest version Jan 27, 2026
GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

This article has 1 author:
1. Mindaugas Margelevicius
This article has no evaluationsLatest version Jan 22, 2026
What Is the Radius of Convergence in the Sequence Space <em>Seq</em>(R) ?

This article has 1 author:
1. Mohsen Soltanifar
This article has no evaluationsLatest version Dec 16, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

SISSI H3 v3 — Spectral Integrity and Structural Similarity Index Official Method Specification (Version 3.0)

GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

What Is the Radius of Convergence in the Sequence Space <em>Seq</em>(R) ?