Inferring genotype-phenotype maps using attention models

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Predicting phenotype from genotype is a central challenge in genetics. Traditional approaches in quantitative genetics typically analyze this problem using methods based on linear regression. These methods generally assume that the genetic architecture of complex traits can be parameterized in terms of an additive model, where the effects of loci are independent, plus (in some cases) pair-wise epistatic interactions between loci. However, these models struggle to analyze more complex patterns of epistasis or subtle gene-environment interactions. Recent advances in machine learning, particularly attention-based models, offer a promising alternative. Initially developed for natural language processing, attention-based models excel at capturing context-dependent interactions and have shown exceptional performance in predicting protein structure and function. Here, we apply attention-based models to quantitative genetics. We analyze the performance of this attention-based approach in predicting phenotype from genotype using simulated data across a range of models with increasing epistatic complexity, and using experimental data from a recent quantitative trait locus mapping study in budding yeast. We find that our model demonstrates superior out-of-sample predictions in epistatic regimes compared to standard methods. We also explore a more general multi-environment attention-based model to jointly analyze genotype-phenotype maps across multiple environments and show that such architectures can be used for “transfer learning” – predicting phenotypes in novel environments with limited training data.

Article activity feed

  1. ATTENTION-BASED ARCHITECTURE FOR G-P MAPPING

    The model is a stack of attention layers, but I was surprised to see it omit all the typical components that brought attention into the limelight via transformers: multi-head attention, residual connections, layer norm, and position-wise FFNs. These have become standard and widely adopted, and largely for good reason, as they've shown to be very effective across many distinct domains.

    Was there a particular reason this specific custom architecture was preferred over implementing or at least comparing to a standard transformer encoder?

  2. Genotype vectors are converted to one-hot embeddings X(g) and transformed into d-dimensional embeddings Z(g)

    Constructing X^(g) is an extremely expensive way to associate an embedding with each locus. You should simply use a lookup table (i.e. nn.Embedding).

  3. Such stacking of attention layers is commonly used in large language models, including those used to model proteins. We use three layers because they collectively capture both pairwise and higher-order interactions, and empirical tests showed that adding more layers did not improve performance.

    Did you consider the use of (potentially gated) residual skip connections (Savarese & Figueiredo 2017)? This (or a related approach) will likely help to increase/improve the expressivity of these attention layers and prevent oversmoothing by allowing for a more persistent signal from first- and second-order epistatic interactions, potentially allowing for the use of additional layers (if necessary).

  4. As expected, the performance of the attention-based model, as characterized by R2 on the test dataset, is much better than that of the linear model (see Fig. 3)

    It would have been interesting to see how a simpler say vanilla MLP based approach would stack here to really sell the advantage of attention over other deep learning approaches.

  5. Predicting phenotype from genotype is a central challenge in genetics. Traditional approaches in quantitative genetics typically analyze this problem using methods based on linear regression.

    I greatly enjoyed reading this paper. The rigorous and rational approach to testing model performance on simulated data, reasonable model architecture, and smart dataset choice are a much needed advance beyond haphazardly applying deep learning networks to G-P datasets with minimal performance gain.

  6. With this in mind, we subsample the loci (effectively combining highly correlated loci) to create a representative set of L = 1, 164

    Did you experiment with how LD based pruning affects model performance? For linear genomic prediction models the relationship between marker number and predictive performance is well characterized, as long as you capture LD structure well, marker number is not very critical. However, this is not characterized well for deep learning models in this context. Epistatic interactions in particular will depend on products of LD between marker/causal QTL's which could cause performance degradation if causal QTL's are not very well tagged.

  7. higher-order epistatic interactions

    I was curious why you chose to simulate fourth order epistatic interactions. Statistically one expects higher order epistatic interactions to contribute progressively less to genetic variance, so most studies tend to focus on pairwise epistasis.