A GWAS–machine learning framework reveals protein-synthesis pathway signals for yield in Theobroma cacao after population-structure correction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Improving cacao yield, a key objective in post-domestication crop improvement, remains a primary goal for breeders, but progress is often hindered by the confounding effects of population structure. To overcome this, we analyzed 346 diverse cacao accessions using an ML-based association mapping framework (with and without population structure adjustment) and a phenotype-only ML prediction of yield. By correcting for population structure, our Bootstrap Forest-based GWAS revealed association signals that showed consistent enrichment for ribosome and protein-synthesis functions, and a recurrent subset of SNPs with high importance appeared across multiple yield components, including pod index and seed number. In parallel, a Neural Network model was utilized to identify cotyledon mass and length as the most powerful predictors for total wet bean mass (R² = 0.715 by repeated five-fold cross-validation), suggesting a practical, low-cost screening proxy for breeding). Collectively, this study delivers a robust genetic framework and a novel predictive tool to accelerate the development of high-yielding cacao varieties through the early identification of elite clones.

Article activity feed