The Shape of Chemical Space
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The concept of chemical space is critical in cheminformatics, medicinal chemistry, and machine learning applications. Despite this, the high dimensionality of molecular representations greatly complicates its sampling, analysis, and visualization. A popular approach to overcome problem is to project these representations to a “human-manageable” subspace, usually containing only two dimensions. Non-linear dimensionality reduction techniques are by far the preferred strategy, following the reasoning that their flexibility can accommodate any arbitrary distribution originally present in the high-dimensional space. However, this ignores the elevated computational cost of these methods and the difficulty in tuning their hyper-parameters. Here, we show that basic properties of the metrics used in the original space can be used to infer the shape of the chemical space, which in turns suggests an optimal strategy to project chemical information to lower dimensions. The key insight is to realize that, no matter the set of molecules, their fingerprint representation can be considered to lie on a hyper-spherical surface. The smooth nature of this manifold means that we can use clustering to identify locally-dense sectors of chemical space, and selectively project them simply using linear (hyper-parameter free) methods, like principal component analysis. This approach surpasses non-linear techniques in several neighborhood preservation metrics, while only requiring a fraction of the computational cost. This pipeline is implemented in our N-Ary Mapping Interface (NAMI: https://github.com/mqcomplab/NAMI ), which we tested in the visualization of 10 million molecules.