PhytoExtractQSAR: An Automated Pipeline for Literature-Mined Modeling of Phytochemical Extraction Outcomes with Transparent Generalization Assessment

Sharhabil Eltahir

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Predicting phytochemical extraction outcomes from molecular descriptors and process parameters remains challenging due to fragmented literature data and inconsistent reporting. No centralized database exists for extraction outcomes, and existing QSAR approaches rely on small, manually curated datasets that limit generalizability assessment. Results We developed PhytoExtractQSAR, an automated cheminformatics pipeline that integrates PubMed literature mining, PDF data extraction, molecular descriptor computation via RDKit, and machine learning with transparent data provenance tracking. The pipeline assembled 1,877 literature-verified records across 94 compounds and 15 extraction methods. After removing 188 flagged duplicate records, the deduplicated dataset was used for all modeling. Under record-level nested five-fold cross-validation, the best models achieved Q² = 0.448 for total flavonoid content (TFC; Extra Trees, N = 991), Q² = 0.281 for crude extraction yield (Extra Trees, N = 1,857), and exploratory associations for total phenolic content (TPC; Q² = 0.089) and antioxidant activity IC50 (Q² = 0.145). Crucially, leave-one-compound-out (LOCO) cross-validation produced Q² ≤ 0 for all targets, demonstrating that the current models capture within-compound extraction patterns but do not generalize to unseen compounds. A process-only baseline (Q² = 0.154 for yield) confirmed that molecular descriptors add modest predictive value beyond extraction parameters (ΔQ² ≈ 0.07), functioning as study-context proxies. SHAP analysis confirmed that compound identity—proxied by LogP and molecular weight—dominates predictions, explaining the generalization gap. Leverage-based applicability domain analysis confirmed > 93% of samples fall within reliable physicochemical boundaries, yet this physicochemical coverage proved insufficient for cross-compound prediction. Conclusions PhytoExtractQSAR establishes an open, reproducible framework for automated extraction data assembly with rigorous provenance tracking. The pipeline successfully predicts extraction outcomes for compounds represented in the training data but does not yet achieve cross-compound generalization. A process-only baseline model (Q² = 0.154 for yield using only extraction parameters) confirmed that molecular descriptors contribute modestly beyond process conditions (full model Q² = 0.281), functioning primarily as study-context proxies rather than encoding causal structure–property relationships. The LOCO analysis provides the first quantitative evidence of this generalization gap in extraction QSAR. Compound-grouped versus matrix-grouped cross-validation revealed that generalization failure is driven by compound identity memorization rather than plant matrix variation, identifying plant material parameterization and richer structural representations as primary directions for transferable models. Scientific contribution: This work provides the first open-source pipeline automating the full workflow from PubMed mining through QSAR modeling of phytochemical extraction, assembling 1,877 provenance-tracked records across 94 compounds. The systematic comparison of record-level, compound-grouped, and matrix-grouped cross-validation, together with process-only baseline experiments, reveals that molecular descriptors function primarily as compound identity proxies and are insufficient for cross-compound generalization. These findings establish an important benchmark, demonstrate that plant material parameterization rather than richer molecular fingerprints is the primary pathway to transferable extraction models, and identify concrete directions including solid-to-liquid ratio capture and temperature-corrected solvent properties.

Version published to 10.21203/rs.3.rs-8973111/v1 on Research Square
Mar 9, 2026

High Data Quality Enhances Microplastic Toxicity Prediction

This article has 7 authors:
1. Ana Antonio Vital
2. Scott Coffin
3. Andrea Bonisoli-Alquati
4. Maaike Vercauteren
5. Luan de Souza Leite
6. Maximilian Pichler
7. Magdalena Mair
This article has no evaluationsLatest version Mar 23, 2026
Improving package annotation in metabolomics and proteomics via robust, ontology-driven LLM integration

This article has 8 authors:
1. Sebastian Lobentanzer
2. Helge Hecht
3. Vincent J Carey
4. Maria A Doyle
5. Alban Gaignard
6. Hervé MENAGER
7. Júlia Mir
8. Claire Rioualen
This article has no evaluationsLatest version Apr 14, 2026
MDSLabChemBridge: Multi-Engine Molecular Descriptor Generation and ML-Ready Feature Engineering

This article has 1 author:
1. Yogesh Kumar
This article has no evaluationsLatest version Mar 31, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

High Data Quality Enhances Microplastic Toxicity Prediction

Improving package annotation in metabolomics and proteomics via robust, ontology-driven LLM integration

MDSLabChemBridge: Multi-Engine Molecular Descriptor Generation and ML-Ready Feature Engineering