Predicting the natural yeast phenotypic landscape with machine learning

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Most organisms’ traits result from the complex interplay of many genetic and environmental factors, making their prediction from genotypes difficult. Here, we used machine learning models to explore genotype-phenotype connections for 223 life history traits measured across 1011 genome-sequenced Saccharomyces cerevisiae strains. Firstly, we used genome-wide association studies to connect genetic variants with the phenotypes. Next, we benchmarked an automated machine learning pipeline that includes preprocessing, feature selection, and hyperparameters optimization in combination with multiple linear and complex machine learning methods. We determined gradient boosting machines as best performing in 65% of predictions and pangenome as best predictor, suggesting a considerable contribution of the accessory genome in controlling phenotypes. The accuracy broadly varied among the phenotypes (r = 0.2-0.9), consistent with varying levels of complexity, with stress resistance being easier to predict compared to growth across carbon and nitrogen nutrients. While no specific genomic features could be linked to the predictions for most phenotypes, machine learning identifies high-impact variants with established relationships to phenotypes despite being rare in the population. Near-perfect accuracies (r>0.95) were achieved when other phenomics data were used to aid predictions, suggesting shared useful information can be conveyed across phenotypes. Overall, our study underscores the power of machine learning to interpret the functional outcome of genetic variants.

Article activity feed