Machine Learning-Based QSAR Screening of Colombian Medicinal Flora for Potential Antiviral Compounds Against Dengue Virus: An In Silico Drug Discovery Approach

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background/Objectives: Colombia harbors exceptional plant diversity, comprising over 31,000 formally identified species, of which approximately 6000 are classified as useful plants. Among these, 2567 species possess documented food and medicinal applications, with several traditionally utilized for managing febrile illnesses. Despite the global burden of dengue virus infection affecting millions annually, no specific antiviral therapy has been established. This study aimed to identify potential anti-dengue compounds from Colombian medicinal flora through machine learning-based quantitative structure–activity relationship (QSAR) modeling. Methods: An optimized XGBoost algorithm was developed through Bayesian hyperparameter optimization (Optuna, 50 trials) and trained on 2034 ChEMBL-derived activity records with experimentally validated anti-dengue activity (IC50/EC50). The model incorporated 887 molecular features comprising 43 physicochemical descriptors and 844 ECFP4 fingerprint bits selected via variance-based filtering. IC50 and EC50 endpoints were modeled independently based on their pharmacological distinction and negligible correlation (r = −0.04, p = 0.77). Through a systematic literature review, 2567 Colombian plant species from the Humboldt Institute’s official checklist were evaluated (2501 after removing duplicates and infraspecific taxa), identifying 358 with documented antiviral properties. Phytochemical analysis of 184 characterized species yielded 3267 unique compounds for virtual screening. A dual-endpoint classification strategy categorized compounds into nine activity classes based on combined potency thresholds (Low: pActivity ≤ 5.0, Medium: 5.0 < pActivity ≤ 6.0, High: pActivity > 6.0). Results: The optimized model achieved robust performance (Matthews correlation coefficient: 0.583; ROC-AUC: 0.896), validated through hold-out testing (MCC: 0.576) and Y-randomization (p < 0.01). Virtual screening identified 276 compounds (8.4%) with high predicted potency for both endpoints (“High-High”). Structural novelty analysis revealed that all 276 compounds exhibited Tanimoto similarity < 0.5 to the training set (median: 0.214), representing 145 unique Murcko scaffolds of which 144 (99.3%) were absent from the training data. Application of drug-likeness filtering (QED ≥ 0.5) and applicability domain assessment identified 15 priority candidates. In silico ADMET profiling revealed favorable pharmaceutical properties, with Incartine (pIC50: 6.84, pEC50: 6.13, QED: 0.83), Bilobalide (pIC50: 6.78, pEC50: 6.07, QED: 0.56), and Indican (pIC50: 6.73, pEC50: 6.11, QED: 0.51) exhibiting the highest predicted potencies. Conclusions: This systematic computational screening of Colombian medicinal flora demonstrates the untapped potential of regional biodiversity for anti-dengue drug discovery. The identified candidates, representing structurally novel chemotypes, are prioritized for experimental validation.

Article activity feed