Chemoinformatics-guided discovery of food-grade anionic stabilizers for phycocyanin under acidic conditions
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Phycocyanin (PC) is the principal natural blue pigment used in functional beverages, but it rapidly loses color and aggregates under acidic conditions (pH ≈ 3). Experimental screening of stabilizers is costly and combinatorially intractable. Here we develop a chemoinformatics framework — descriptor-based QSPR, a chemistry-prior heuristic, and virtual screening — that learns from three rounds of commissioned screening (48 compounds, 6% hit rate) to predict stabilizer efficacy directly from molecular structure. In this genuinely small-data regime (3 positives), a LightGBM classifier built from 10 RDKit descriptors and 11 domain-expert charge/polymer features attained a leave-one-out AUC of 0.73, only marginally above a single-feature charge-density baseline (AUC 0.67); LOO sensitivity was 1/3 at threshold 0.5. A complementary chemistry-prior heuristic encoding anion-type priors substantially outperformed both, reaching AUC 0.95, indicating that explicit chemical knowledge captures information that descriptor-based ML cannot readily recover at this dataset size. SHAP analysis of the QSPR identified effective negative-charge density per unit, log molecular weight, polyphosphate identity, and functional-group density as the dominant features (jointly ≈97% of mean |SHAP|), recovering the electrostatic-complexation mechanism without it being supplied as a prior. Virtual screening of 30 generally recognized as safe (GRAS) food additives nominated the pyrophosphate family — led by sodium pyrophosphate decahydrate and sodium hexametaphosphate (SHMP), both at P ≈ 0.99 — and the heuristic additionally flagged sodium phytate (IP□), which the descriptor model under-ranked at P = 0.045. Experimental validation at pH 3 and 46 °C for 7 days confirmed SHMP 2:1 (78.1 ± 11.3% color retention), TSPP 2:1 (54.1 ± 10.6%) and IP□ 1:1 (52.7 ± 9.0%), while a ternary IP□ + STPP combination reached 83.8 ± 11.9%, surpassing all single-component formulations. ζ-Potential measurements indicated a predominantly electrostatic origin for the protection (Pearson r = −0.82 between ζ and CR□ □ □; n = 24; p = 1 × 10□ □). The framework, dataset and code are released to accelerate stabilizer discovery for other acid-sensitive food colorants and to provide a candid small-data benchmark.