Efficient Search of Ultra-Large Synthesis On-Demand Libraries with Chemical Language Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Ultra-large building block catalogs provide inexpensive access to billions of synthesis- on-demand molecules, but the combinatorial scale renders conventional virtual screening impractical. We present Vector Virtual Screen (VVS), a score-function-agnostic machine learning framework for efficient navigation of combinatorial libraries and rapid identifi- cation of promising molecules for experimental validation. VVS comprises four key innovations: (i) the Embedding Decomposer, which factors molecules into building blocks in latent space; (ii) ChemRank, a correlation-based loss that improves retrieval precision; (iii) BBKNN, an algorithm for nearest-neighbor search directly in building block space; and (iv) a multi-scale hill-climbing algorithm for gradient-based navi- gation of molecular embedding vector databases. Across diverse scoring functions, VVS consistently outperforms existing methods in retrieving high-scoring molecules while evaluating only a fraction of the library, achieving orders-of-magnitude run- time improvements. By turning ultra-large libraries into tractable search spaces, VVS enables virtual screening to keep pace with the rapid expansion of chemical space and adapt seamlessly to future advances in scoring functions.