Methods for Continuous-Valued Training Data Generation from Genome-Scale Metabolic Models: Partial-Inhibition FBA with Mixed Essentiality Sampling, Applied to ESKAPE Drug Target Curation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background . Computational antimicrobial target discovery faces three methodological limitations: (i) knockout-only FBA yields binary phenotypes unsuitable for regression training, (ii) no experimentally labeled toxicity datasets exist at the gene-target level, and (iii) pipelines rarely report negative validation results. Methods . We describe a pipeline addressing each limitation. We introduce partial gene inhibition simulation (10-100% flux reduction) applied to mixed essential/non-essential gene sets (39 + 30 genes from 1,516 iML1515 genes), generating 945 continuous-valued FBA simulations as regression training targets (not independent drug response predictions). We describe two ANN architectures: a subsystem-structured ANN (61.5% parameter reduction over fully connected baselines) and a dual-head ANN for joint potency-toxicity regression. We propose an exploratory toxicity labeling heuristic (sequence homology 35%, pathway overlap 30%, conservation 20%, cross-reactivity 15%); weights are an initial proposal pending experimental calibration. These components are integrated with a Neo4j knowledge graph, local LLM literature mining (46% effective precision), and AlphaFold structural analysis. Results . Applied to three ESKAPE pathogen models (iML1515, iYS1720, iYL1228), the pipeline curates 29 targets lacking approved therapeutics from 39 literature-validated essential genes. Sequence-based audit reveals 11 of 21 targets lack detectable human homologs; folA shows 30% identity to human DHFR2, consistent with known trimethoprim cross-reactivity. Prospective-style temporal validation (2020 cutoff) shows the composite scoring heuristic did not exceed a random baseline (F1 = 0.519, z = -0.99), establishing the pipeline as a hypothesis generation tool rather than a predictive model. Double knockout of essential gene pairs produced indistinguishable lethal phenotypes, indicating partial inhibition grids are required for meaningful combination scoring. Conclusions . The methods -- partial inhibition FBA, two ANN architectures, multi-evidence toxicity labeling, and four-way integration -- are individually reusable. The complete pipeline (10-tab dashboard, 40 tests, all code) is released under MIT license at https://github.com/shoo99/ai-drug-target.