PhytoExtractQSAR: An Automated Pipeline for Literature-Mined Modeling of Phytochemical Extraction Outcomes with Transparent Generalization Assessment

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Predicting phytochemical extraction outcomes from molecular descriptors and process parameters remains challenging due to fragmented literature data and inconsistent reporting. No centralized database exists for extraction outcomes, and existing QSAR approaches rely on small, manually curated datasets that limit generalizability assessment. Results We developed PhytoExtractQSAR, an automated cheminformatics pipeline that integrates PubMed literature mining, PDF data extraction, molecular descriptor computation via RDKit, and machine learning with transparent data provenance tracking. The pipeline assembled 1,877 literature-verified records across 94 compounds and 15 extraction methods. After removing 188 flagged duplicate records, the deduplicated dataset was used for all modeling. Under record-level nested five-fold cross-validation, the best models achieved Q² = 0.448 for total flavonoid content (TFC; Extra Trees, N = 991), Q² = 0.281 for crude extraction yield (Extra Trees, N = 1,857), and exploratory associations for total phenolic content (TPC; Q² = 0.089) and antioxidant activity IC50 (Q² = 0.145). Crucially, leave-one-compound-out (LOCO) cross-validation produced Q² ≤ 0 for all targets, demonstrating that the current models capture within-compound extraction patterns but do not generalize to unseen compounds. A process-only baseline (Q² = 0.154 for yield) confirmed that molecular descriptors add modest predictive value beyond extraction parameters (ΔQ² ≈ 0.07), functioning as study-context proxies. SHAP analysis confirmed that compound identity—proxied by LogP and molecular weight—dominates predictions, explaining the generalization gap. Leverage-based applicability domain analysis confirmed > 93% of samples fall within reliable physicochemical boundaries, yet this physicochemical coverage proved insufficient for cross-compound prediction. Conclusions PhytoExtractQSAR establishes an open, reproducible framework for automated extraction data assembly with rigorous provenance tracking. The pipeline successfully predicts extraction outcomes for compounds represented in the training data but does not yet achieve cross-compound generalization. A process-only baseline model (Q² = 0.154 for yield using only extraction parameters) confirmed that molecular descriptors contribute modestly beyond process conditions (full model Q² = 0.281), functioning primarily as study-context proxies rather than encoding causal structure–property relationships. The LOCO analysis provides the first quantitative evidence of this generalization gap in extraction QSAR. Compound-grouped versus matrix-grouped cross-validation revealed that generalization failure is driven by compound identity memorization rather than plant matrix variation, identifying plant material parameterization and richer structural representations as primary directions for transferable models. Scientific contribution: This work provides the first open-source pipeline automating the full workflow from PubMed mining through QSAR modeling of phytochemical extraction, assembling 1,877 provenance-tracked records across 94 compounds. The systematic comparison of record-level, compound-grouped, and matrix-grouped cross-validation, together with process-only baseline experiments, reveals that molecular descriptors function primarily as compound identity proxies and are insufficient for cross-compound generalization. These findings establish an important benchmark, demonstrate that plant material parameterization rather than richer molecular fingerprints is the primary pathway to transferable extraction models, and identify concrete directions including solid-to-liquid ratio capture and temperature-corrected solvent properties.

Article activity feed