Machine Learning-Based QSAR Screening of Colombian Medicinal Flora for Potential Antiviral Compounds Against Dengue Virus: An In Silico Drug Discovery Approach

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background/Objectives: Colombia harbors exceptional plant diversity, comprising over 31,000 formally identified species, of which approximately 6,000 are classified as useful plants. Among these, 2,567 species possess documented food and medicinal applications, with several traditionally utilized for managing febrile illnesses. Despite the global burden of dengue virus infection affecting millions annually, no specific antiviral therapy has been established. This study aimed to identify potential anti-dengue compounds from Colombian medicinal flora through machine learning-based quantitative struc-ture-activity relationship (QSAR) modeling. Methods: An optimized XGBoost algorithm was implemented through Bayesian hyperparameter optimization (Optuna, 50 trials) to develop a QSAR model trained on 2,034 ChEMBL-derived activity records with experi-mentally validated anti-dengue activity (IC₅₀/EC₅₀). The model incorporated 887 molecular descriptors, comprising 43 physicochemical properties and 844 ECFP4 fingerprint bits, selected via variance-based feature selection. Bayesian hyperparameter optimization us-ing Optuna (50 trials) was performed to maximize model performance. Through system-atic literature review, 2,567 Colombian plant species were evaluated, identifying 358 with documented antiviral properties. Phytochemical analysis of 184 species generated 3,267 unique compounds for subsequent virtual screening. Compounds were prioritized based on predicted activity, drug-likeness, and applicability domain assessment for future ex-perimental validation. A dual-endpoint classification strategy was employed to simulta-neously evaluate both IC50 and EC50 activities, with compounds categorized into nine activity classes based on combined potency thresholds (Low: pActivity ≤ 5.0, Medium: 5.0 < pActivity ≤ 6.0, High: pActivity > 6.0). Results: The optimized XGBoost model achieved robust performance with a Matthews correlation coefficient of 0.583 and area under the receiver operating characteristic curve of 0.896. Virtual screening of 3,267 Colombian phytochemicals identified 276 compounds (8.4%) with high predicted potency (pActivity > 6) for both IC₅₀ and EC₅₀ endpoints (classified as "High-High"). Comprehensive struc-ture-activity relationship (SAR) analysis revealed that 239 of these compounds (86.6%) represented structurally novel chemotypes with low similarity (Tanimoto < 0.5) to the training dataset. Application of drug-likeness filters (QED ≥ 0.5) identified 20 priority can-didates (7.2% of high-potency hits), with 12 compounds showing exceptional profiles. In-cartine (pIC50: 6.84, pEC50: 6.13, QED: 0.83), Bilobalide (pIC50: 6.78, pEC50: 6.07, QED: 0.56), and Indican (pIC50: 6.73, pEC50: 6.11, QED: 0.51) exhibited the highest predicted potencies. Descriptor-activity correlation analysis identified QED (ρ = 0.14 with EC50), TPSA (ρ = -0.15), and aromatic rings as key modulators of antiviral activity. Conclusions: This pioneering systematic computational screening of Colombian flora for anti-dengue activity demonstrates the untapped potential of regional biodiversity in pharmaceutical discovery. The identified lead compounds represent prioritized candidates for experi-mental validation and subsequent development of dengue therapeutics, with all compu-tational resources made publicly available to facilitate future research.

Article activity feed