Systematic Generation of Drug-Like Molecules via Biologically Safe Fragment-Based Rules Reveals Chemical Space Saturation Using RDKit and PubChem
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Finding new compounds that resemble drugs is still a major problem in cheminformatics and pharmaceutical development. Conventional de novo molecular design frequently makes use of intricate generative models, however it is still unclear if straightforward, rule-based techniques can produce chemically new structures. Here, we demonstrate the methodical creation of structurally varied, drug-like compounds through the use of fragment-based, biologically safe criteria that are encoded in Python and verified by RDKit. We created a stochastic SMILES generator that mimics key aspects of drug-likeness while avoiding toxic or unstable chemotypes by combining a small number of atoms (C, N, and O), basic aliphatic and aromatic ring systems, and frequently occurring functional groups like amides, esters, and alkyl chains. After being generated, molecular structures were canonicalized to remove redundant information, filtered for chemical validity, and then evaluated for novelty using the PubChemPy interface against the PubChem database. A very high degree of overlap between randomly built drug-like molecules and the existing chemical space was revealed by the fact that, despite the combinatorial diversity of created structures, the vast majority matched known chemicals in PubChem. This finding implies that a large portion of what is currently known may be replicated using even very basic generating principles; this is known as chemical space saturation. Our results offer a solid foundation for assessing the actual uniqueness of AI-based molecular generators and highlight the significance of comparing such systems to existing chemical repositories in addition to structural validity and drug-likeness. This work also emphasizes the necessity of more sophisticated rules or higher-order logic in order to get around the restrictions of the available public datasets and investigate truly unique areas of chemical space. To encourage openness and reproducibility, all code and datasets are made publicly available.