The Shape of Chemical Space

Krisztina Zsigmond
Akash Surendran
Lexin Chen
Ramón Alain Miranda-Quintana

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The concept of chemical space is critical in cheminformatics, medicinal chemistry, and machine learning applications. Despite this, the high dimensionality of molecular representations greatly complicates its sampling, analysis, and visualization. A popular approach to overcome problem is to project these representations to a “human-manageable” subspace, usually containing only two dimensions. Non-linear dimensionality reduction techniques are by far the preferred strategy, following the reasoning that their flexibility can accommodate any arbitrary distribution originally present in the high-dimensional space. However, this ignores the elevated computational cost of these methods and the difficulty in tuning their hyper-parameters. Here, we show that basic properties of the metrics used in the original space can be used to infer the shape of the chemical space, which in turns suggests an optimal strategy to project chemical information to lower dimensions. The key insight is to realize that, no matter the set of molecules, their fingerprint representation can be considered to lie on a hyper-spherical surface. The smooth nature of this manifold means that we can use clustering to identify locally-dense sectors of chemical space, and selectively project them simply using linear (hyper-parameter free) methods, like principal component analysis. This approach surpasses non-linear techniques in several neighborhood preservation metrics, while only requiring a fraction of the computational cost. This pipeline is implemented in our N-Ary Mapping Interface (NAMI: https://github.com/mqcomplab/NAMI ), which we tested in the visualization of 10 million molecules.

Version published to 10.1101/2025.09.23.678151 on bioRxiv
Sep 25, 2025

Are Energy and Forces Really Enough? Using Structure to Evaluate the Accuracy and Transferability of Machine Learning Potentials of Biomolecules

This article has 3 authors:
1. Lejla S. Biberić
2. Nisarg Joshi
3. Jim Pfaendtner
This article has no evaluationsLatest version Jan 14, 2026
Representation Learning for Long-Chain Hydrocarbon Adsorption in Zeolites

This article has 7 authors:
1. Yachan Liu
2. Ping Yang
3. Gustavo Perez
4. Aaron Sun
5. Wei Fan
6. Subhransu Maji
7. Peng Bai
This article has no evaluationsLatest version Jan 30, 2026
Particle Swarms in N-Dimensional Simplex Conformations Quantum Mechanical and Topological Problems

This article has 1 author:
1. Ramon Carbó-Dorca
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Are Energy and Forces Really Enough? Using Structure to Evaluate the Accuracy and Transferability of Machine Learning Potentials of Biomolecules

Representation Learning for Long-Chain Hydrocarbon Adsorption in Zeolites

Particle Swarms in N-Dimensional Simplex Conformations Quantum Mechanical and Topological Problems