A GWAS–machine learning framework reveals protein-synthesis pathway signals for yield in Theobroma cacao after population-structure correction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Improving cacao yield, a key objective in post-domestication crop improvement, remains a primary goal for breeders, but progress is often hindered by the confounding effects of population structure. To overcome this, we analyzed 346 diverse cacao accessions using an ML-based association mapping framework (with and without population structure adjustment) and a phenotype-only ML prediction of yield. By correcting for population structure, our Bootstrap Forest-based GWAS revealed association signals that showed consistent enrichment for ribosome and protein-synthesis functions, and a recurrent subset of SNPs with high importance appeared across multiple yield components, including pod index and seed number. In parallel, a Neural Network model was utilized to identify cotyledon mass and length as the most powerful predictors for total wet bean mass (R² = 0.715 by repeated five-fold cross-validation), suggesting a practical, low-cost screening proxy for breeding). Collectively, this study delivers a robust genetic framework and a novel predictive tool to accelerate the development of high-yielding cacao varieties through the early identification of elite clones.