Biophysics-based protein language models for protein engineering

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Log in to save this article

Abstract

Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose mutational effect transfer learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure and energetics. We fine-tune METL on experimental sequence–function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL’s ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.

Article activity feed

  1. We agree it’s worth exploring how to combine evolutionary and biophysical signals, as they contribute complementary information. We’ve started experimenting with ways to combine ESM2 and METL. Our initial experiments finetuning ESM2 up to the 150M parameter model on the METL-Global simulation dataset did not substantially improve the overfitting problem we observed (Figure S1). We believe combining these two types of information will require many changes to our simulation and training process, so we are exploring that as future work.

  2. We added new files to https://github.com/gitter-lab/metl-pub/tree/main/gfp/output to guide this discussion.

    Figure S20 of the low-N GFP training data gives some insights into these patterns, which we looked at more closely here. Positions 175 and 26 contain mutations that are greater than wild type (red in Figure S20, see also https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/mutations_of_importance_in_designs.png). These mutations correspond to variants E140V,Q175L and S26R,I126T,I169T, which are the first and second highest scoring variants in the training set, respectively (https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/df_dms_train_gfp_experiment.tsv). That is most likely why S26R shows up in 8/10 observed designs and Q175L shows up in 10/10 observed designs. However, other mutations in these variants, E140V (1/10 designs), I126T (1/10 designs), and I169T (0/10 designs) show up much less in the observed designs. These positions were also present in variants that scored poorly in the training set. Strangely, I126V, was in the second to worst variant (a 9 mutant) but showed up in 4/10 designs. The reason for this will need to be investigated further, but could be due to the Rosetta pretraining.

    The finetuned model did not select only the highest performing variants. Mutation N196Y (7/10 designs, position in orange in https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/mutations_of_importance_in_designs.png) is a part of variant F6L,N196Y, which has slightly lower than wild type fitness in the training data (ranked 13 out of 64 variants). This mutation was likely included because mutants at position F6 show up in three other variants (including F6L again in one variant) in the training set, all containing much lower than wild type activity. Therefore, the model estimated that the enhanced activity for the double mutant variant was due to N196Y and prioritized it in designs, despite the variant having lower than wild type activity.

    In the observed designs the bias towards positions at the end (C-terminus) is due to the variants with high scores at the end of the sequence in the training data (Figure S20). In the unobserved designs, position 230 (zero indexing, colored in purple in https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/mutations_of_importance_in_designs.png) was highly prioritized, due to the Rosetta pretraining, which highly favored this position.

    We visualized the sequence positions 170-237 that were frequently selected in unobserved and observed designs on the GFP structure (https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/highlighted_domain_gfp.png). It does not overlap with the chromophore, and we cannot immediately determine any biophysical reason why this region would be prioritized over other beta sheets in the beta barrel.

    That new figure (https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/highlighted_domain_gfp.png) shows positions 170-237 in red, which contained a higher amount of suggested mutations compared to the rest of the sequence in the observed and unobserved designs. This region contains 3 beta sheets, 2 disordered - linker regions, and one small alpha helix. The other new figure (https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/mutations_of_importance_in_designs.png) highlights the following positions (zero indexed) with their side chains: 26 (blue), 175 (red), 196 (orange), 226 (yellow), and 230 (purple). Position 226 was not discussed above but appeared in 9/10 observed designs.

  3. We agree the size of METL models is modest compared to other much larger models, which also makes METL models more accessible. Our Discussion was already fairly long so we did not emphasize it again there in v3, but we appreciate you pointing it out here.

  4. We did some limited exploration of dropout rates, but we didn’t perform a systematic sweep to assess their impact on overfitting in METL-Global pretraining. Note we did apply dropout to the attention weights in multi-headed attention, as well to the full outputs of the self-attention and feed forward blocks that make up the transformer encoder layer (see relative_attention.py, classes RelativeTransformerEncoderLayer and RelativeMultiHeadAttention). It’s possible that increasing dropout or applying it differently could improve generalization. However, we suspect the primary factor driving overfitting might be the relatively small number of base proteins used during pretraining.

  5. It was for a relatively small number of variants: 343 in the global dataset and 645 in the GB1 dataset, corresponding to 0.0012% and 0.0051% of each dataset, respectively. We added this information to the Methods section in v3 of the manuscript.

  6. We agree, and we modified the text in v3 of the manuscript to specify “sampled” functional landscape. Although it’s not exactly the scenario you pose in your comment, in Figure S4 we assessed “Train 1” regime extrapolation (where we train on single substitution variants and evaluate on variants with 2+ substitutions) and “Train 1+2” regime extrapolation (where we train on variants with single or double substitutions, and evaluate on variants with 3+ substitutions). For the GFP dataset, there was a notable increase in linear regression performance when training on singles and doubles versus only training on singles. We think it would be interesting to explore regime extrapolation performance broken down by mutational distance like you suggest. You may also be interested in a paper by some of our co-authors where they explored mutational distance extrapolation: https://doi.org/10.1038/s41467-024-50712-3.

  7. 8 attention heads, a 2048 feed forward hidden size, and 0.1 dropout.

    Earlier in the paper you highlight that the METL-Global model tends to overfit fairly strongly to the in-distribution training data. I can't help but wonder whether the increased usage of multi-headed attention here and equal, but still relatively low dropout may be contributing to this. Did you explore the impact of increasing dropout on the ability of this model to generalize? Related, did you explore whether implementing dropout to the attention heads themselves improved the models capacity to generalize? I fully appreciate this can be tricky to nail down, but I can't help but wonder if doing so might help somewhat for out-of-distribution predictions, even if it is to the slight detriment of in-distribution prediction.

  8. dropping variants with NaN values for any of the biophysical attributes

    For how many variants did this occur? Was this concentrated in specific proteins in the global dataset? I'm just wondering/curious if these NaN's might be associated with some specific, biologically relevant phenomena leading to some potential acquisition bias in the simulated training data, or if these instead can be interpreted as an unbiased occurrence.

  9. The strong performance of linear regression, which relies on additive assumptions, suggests the functional landscape is dominated by additive effects.

    I agree with this interpretation insofar as it suggests that the sampled functional landscape for these proteins is dominated by additive effects. But, even in cases where multi-mutants have been sampled, these datasets typically only have sampled multi-mutants at a handful of sites at small hamming distances from WT.

    For instance, in the case of the avGFP dataset which includes mutants up to an impressive hamming distance of 15, the sampling density of multi-mutants steeply drops off at increasing distances. Taking into account the fraction of possible multi-mutants sampled at those increasing distances, it's apparent that we've barely scratched the surface of these protein's sequence space and presumably functional landscapes.

    This is all to say - these datasets (naturally, due to the nature of their generation) have very strongly biased sampling of the functional landscape, being extremely localized around a single WT protein. It may well be that these WT sequences reside on fitness peaks around which local mutational effects are largely additive.

    I would be interested in seeing the regime extrapolation performance of these methods broken down by mutational distance from WT, or by the count of mutations. I can't help but suspect that the strong performance of linear models decays strongly with mutational distance. When considering that these datasets often have sampling biases towards shorter mutational distances, it seems likely that the correlation performance will be dominated/confounded by the high abundance of "easier," local multi-mutants.

  10. This is a really cool approach to bringing biophysical information into protein mutation prediction. It does seem worth exploring whether including the evolutionary information gleaned from LLMs like ESM2 improves the performance of METL. Combining these two approaches seems like it has real potential to leverage different types of information. Have you thought about ways to use embeddings from models like ESM2 in the METL pretraining to try to improve generalizability? It would be cool to see if these embeddings actually improve performance, especially with small experimental training sets. Great work!

  11. 146151156PositionwildtypeO5O5O5O5O5U5U5U5U5U5O10O10O10O10O10U10U10U10U10U10VariantF F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I KVI WYIV VS S VG E VVE T VWY IM W RE M RW161166171176181186191196201206211216221226231236PositionwildtypeO5O5O5O5O5U5U5U5U5U5O10O10O10O10O10U10U10U10U10U10VariantV N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y KR L Y WR L YR L Y T RL W HR L Y R FI K RR W C FR V RV R KM W MR L Y N T GWR L Y R F HL Y T R HL N T GW FR L Y T W HM E Y R M C I L RR V F W W QR V M Q M WV F W S N QK M F R RFigure S20. The 20 engineered GFP variants.

    It's interesting to see how many mutant positions these variants share, even across observed versus unobserved, as well as the preference for later residues in the primary sequence. I'm curious if the model is learning something more significant about the features driving activity, particularly if this pattern holds true for the other ten datasets.

  12. The efficacy of these 3D distance-based relative positional embeddings position embeddings is pretty amazing. I have to read up on relative position embeddings.

  13. With a smaller training size of ∼1M examples and just a single GPU, training times ranged from 6-26 hours for 100 epochs for most proteins (4 to 16 minutes per epoch). Pretraining METL-Global with 20M parameters took ∼50 hours on 4x A100s and ∼142 hours with 50M parameters.

    Given the performance, I'm impressed with the affordability of the models' pre-training.

    In a world of foundation models that cost millions of $ to train, I think it's definitely worth mentioning the frugality of these models in the discussion (if not already mentioned).

  14. We implemented relative position embeddings as described by Shaw et al.

    Thank you for this paragraph and the two that proceed it. A very useful learning resource!