Biophysics-based protein language models for protein engineering

Sam Gelman
Bryce Johnson
Chase R. Freschlin
Arnav Sharma
Sameer D’Costa
John Peters
Anthony Gitter
Philip A. Romero

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose mutational effect transfer learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure and energetics. We fine-tune METL on experimental sequence–function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL’s ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.

Version published to 10.1038/s41592-025-02776-2
Sep 1, 2025
Arcadia Science
May 22, 2025

We agree it’s worth exploring how to combine evolutionary and biophysical signals, as they contribute complementary information. We’ve started experimenting with ways to combine ESM2 and METL. Our initial experiments finetuning ESM2 up to the 150M parameter model on the METL-Global simulation dataset did not substantially improve the overfitting problem we observed (Figure S1). We believe combining these two types of information will require many changes to our simulation and training process, so we are exploring that as future work.

Read the original source
Arcadia Science
May 22, 2025

We added new files to https://github.com/gitter-lab/metl-pub/tree/main/gfp/output to guide this discussion.

Figure S20 of the low-N GFP training data gives some insights into these patterns, which we looked at more closely here. Positions 175 and 26 contain mutations that are greater than wild type (red in Figure S20, see also https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/mutations_of_importance_in_designs.png). These mutations correspond to variants E140V,Q175L and S26R,I126T,I169T, which are the first and second highest scoring variants in the training set, respectively (https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/df_dms_train_gfp_experiment.tsv). That is most likely why S26R shows up in 8/10 observed designs and Q175L shows up in 10/10 observed designs. However, other mutations in these variants, E140V …

We added new files to https://github.com/gitter-lab/metl-pub/tree/main/gfp/output to guide this discussion.

Figure S20 of the low-N GFP training data gives some insights into these patterns, which we looked at more closely here. Positions 175 and 26 contain mutations that are greater than wild type (red in Figure S20, see also https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/mutations_of_importance_in_designs.png). These mutations correspond to variants E140V,Q175L and S26R,I126T,I169T, which are the first and second highest scoring variants in the training set, respectively (https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/df_dms_train_gfp_experiment.tsv). That is most likely why S26R shows up in 8/10 observed designs and Q175L shows up in 10/10 observed designs. However, other mutations in these variants, E140V (1/10 designs), I126T (1/10 designs), and I169T (0/10 designs) show up much less in the observed designs. These positions were also present in variants that scored poorly in the training set. Strangely, I126V, was in the second to worst variant (a 9 mutant) but showed up in 4/10 designs. The reason for this will need to be investigated further, but could be due to the Rosetta pretraining.

The finetuned model did not select only the highest performing variants. Mutation N196Y (7/10 designs, position in orange in https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/mutations_of_importance_in_designs.png) is a part of variant F6L,N196Y, which has slightly lower than wild type fitness in the training data (ranked 13 out of 64 variants). This mutation was likely included because mutants at position F6 show up in three other variants (including F6L again in one variant) in the training set, all containing much lower than wild type activity. Therefore, the model estimated that the enhanced activity for the double mutant variant was due to N196Y and prioritized it in designs, despite the variant having lower than wild type activity.

In the observed designs the bias towards positions at the end (C-terminus) is due to the variants with high scores at the end of the sequence in the training data (Figure S20). In the unobserved designs, position 230 (zero indexing, colored in purple in https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/mutations_of_importance_in_designs.png) was highly prioritized, due to the Rosetta pretraining, which highly favored this position.

We visualized the sequence positions 170-237 that were frequently selected in unobserved and observed designs on the GFP structure (https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/highlighted_domain_gfp.png). It does not overlap with the chromophore, and we cannot immediately determine any biophysical reason why this region would be prioritized over other beta sheets in the beta barrel.

That new figure (https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/highlighted_domain_gfp.png) shows positions 170-237 in red, which contained a higher amount of suggested mutations compared to the rest of the sequence in the observed and unobserved designs. This region contains 3 beta sheets, 2 disordered - linker regions, and one small alpha helix. The other new figure (https://github.com/gitter-lab/metl-pub/blob/main/gfp/output/mutations_of_importance_in_designs.png) highlights the following positions (zero indexed) with their side chains: 26 (blue), 175 (red), 196 (orange), 226 (yellow), and 230 (purple). Position 226 was not discussed above but appeared in 9/10 observed designs.

Read the original source
Arcadia Science
May 22, 2025

We agree the size of METL models is modest compared to other much larger models, which also makes METL models more accessible. Our Discussion was already fairly long so we did not emphasize it again there in v3, but we appreciate you pointing it out here.

Read the original source
Arcadia Science
May 22, 2025

Glad it was helpful! In v3 of the manuscript, Figures S15 and S16 add attention maps for the 1D and 3D relative positional embeddings for more proteins.

Read the original source
Arcadia Science
May 22, 2025

We did some limited exploration of dropout rates, but we didn’t perform a systematic sweep to assess their impact on overfitting in METL-Global pretraining. Note we did apply dropout to the attention weights in multi-headed attention, as well to the full outputs of the self-attention and feed forward blocks that make up the transformer encoder layer (see relative_attention.py, classes RelativeTransformerEncoderLayer and RelativeMultiHeadAttention). It’s possible that increasing dropout or applying it differently could improve generalization. However, we suspect the primary factor driving overfitting might be the relatively small number of base proteins used during pretraining.

Read the original source
Arcadia Science
May 22, 2025

It was for a relatively small number of variants: 343 in the global dataset and 645 in the GB1 dataset, corresponding to 0.0012% and 0.0051% of each dataset, respectively. We added this information to the Methods section in v3 of the manuscript.

Read the original source
Arcadia Science
May 22, 2025

We agree, and we modified the text in v3 of the manuscript to specify “sampled” functional landscape. Although it’s not exactly the scenario you pose in your comment, in Figure S4 we assessed “Train 1” regime extrapolation (where we train on single substitution variants and evaluate on variants with 2+ substitutions) and “Train 1+2” regime extrapolation (where we train on variants with single or double substitutions, and evaluate on variants with 3+ substitutions). For the GFP dataset, there was a notable increase in linear regression performance when training on singles and doubles versus only training on singles. We think it would be interesting to explore regime extrapolation performance broken down by mutational distance like you suggest. You may also be interested in a paper by some of our co-authors where they explored mutational …

We agree, and we modified the text in v3 of the manuscript to specify “sampled” functional landscape. Although it’s not exactly the scenario you pose in your comment, in Figure S4 we assessed “Train 1” regime extrapolation (where we train on single substitution variants and evaluate on variants with 2+ substitutions) and “Train 1+2” regime extrapolation (where we train on variants with single or double substitutions, and evaluate on variants with 3+ substitutions). For the GFP dataset, there was a notable increase in linear regression performance when training on singles and doubles versus only training on singles. We think it would be interesting to explore regime extrapolation performance broken down by mutational distance like you suggest. You may also be interested in a paper by some of our co-authors where they explored mutational distance extrapolation: https://doi.org/10.1038/s41467-024-50712-3.

Read the original source
Arcadia Science
Feb 14, 2025

8 attention heads, a 2048 feed forward hidden size, and 0.1 dropout.

Earlier in the paper you highlight that the METL-Global model tends to overfit fairly strongly to the in-distribution training data. I can't help but wonder whether the increased usage of multi-headed attention here and equal, but still relatively low dropout may be contributing to this. Did you explore the impact of increasing dropout on the ability of this model to generalize? Related, did you explore whether implementing dropout to the attention heads themselves improved the models capacity to generalize? I fully appreciate this can be tricky to nail down, but I can't help but wonder if doing so might help somewhat for out-of-distribution predictions, even if it is to the slight detriment of in-distribution prediction.

Read the original source
Arcadia Science
Feb 14, 2025

dropping variants with NaN values for any of the biophysical attributes

For how many variants did this occur? Was this concentrated in specific proteins in the global dataset? I'm just wondering/curious if these NaN's might be associated with some specific, biologically relevant phenomena leading to some potential acquisition bias in the simulated training data, or if these instead can be interpreted as an unbiased occurrence.

Read the original source
Arcadia Science
Feb 14, 2025

The strong performance of linear regression, which relies on additive assumptions, suggests the functional landscape is dominated by additive effects.

I agree with this interpretation insofar as it suggests that the sampled functional landscape for these proteins is dominated by additive effects. But, even in cases where multi-mutants have been sampled, these datasets typically only have sampled multi-mutants at a handful of sites at small hamming distances from WT.

For instance, in the case of the avGFP dataset which includes mutants up to an impressive hamming distance of 15, the sampling density of multi-mutants steeply drops off at increasing distances. Taking into account the fraction of possible multi-mutants sampled at those increasing distances, it's apparent that we've barely scratched the surface of these protein's sequence …

The strong performance of linear regression, which relies on additive assumptions, suggests the functional landscape is dominated by additive effects.

I agree with this interpretation insofar as it suggests that the sampled functional landscape for these proteins is dominated by additive effects. But, even in cases where multi-mutants have been sampled, these datasets typically only have sampled multi-mutants at a handful of sites at small hamming distances from WT.

For instance, in the case of the avGFP dataset which includes mutants up to an impressive hamming distance of 15, the sampling density of multi-mutants steeply drops off at increasing distances. Taking into account the fraction of possible multi-mutants sampled at those increasing distances, it's apparent that we've barely scratched the surface of these protein's sequence space and presumably functional landscapes.

This is all to say - these datasets (naturally, due to the nature of their generation) have very strongly biased sampling of the functional landscape, being extremely localized around a single WT protein. It may well be that these WT sequences reside on fitness peaks around which local mutational effects are largely additive.

I would be interested in seeing the regime extrapolation performance of these methods broken down by mutational distance from WT, or by the count of mutations. I can't help but suspect that the strong performance of linear models decays strongly with mutational distance. When considering that these datasets often have sampling biases towards shorter mutational distances, it seems likely that the correlation performance will be dominated/confounded by the high abundance of "easier," local multi-mutants.

Read the original source
Arcadia Science
Feb 7, 2025

This is a really cool approach to bringing biophysical information into protein mutation prediction. It does seem worth exploring whether including the evolutionary information gleaned from LLMs like ESM2 improves the performance of METL. Combining these two approaches seems like it has real potential to leverage different types of information. Have you thought about ways to use embeddings from models like ESM2 in the METL pretraining to try to improve generalizability? It would be cool to see if these embeddings actually improve performance, especially with small experimental training sets. Great work!

Read the original source
Arcadia Science
Feb 7, 2025

146151156PositionwildtypeO5O5O5O5O5U5U5U5U5U5O10O10O10O10O10U10U10U10U10U10VariantF F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I KVI WYIV VS S VG E VVE T VWY IM W RE M RW161166171176181186191196201206211216221226231236PositionwildtypeO5O5O5O5O5U5U5U5U5U5O10O10O10O10O10U10U10U10U10U10VariantV N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y KR L Y WR L YR L Y T RL W HR L Y R FI K RR W C FR V RV R KM W MR L Y N T GWR L Y R F HL Y T R HL N T GW FR L Y T W HM E Y R M C I L RR V F W W QR V M Q M WV F W S N QK M F R RFigure S20. The 20 engineered GFP variants.

It's interesting to see how many mutant positions these variants share, …

146151156PositionwildtypeO5O5O5O5O5U5U5U5U5U5O10O10O10O10O10U10U10U10U10U10VariantF F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I KVI WYIV VS S VG E VVE T VWY IM W RE M RW161166171176181186191196201206211216221226231236PositionwildtypeO5O5O5O5O5U5U5U5U5U5O10O10O10O10O10U10U10U10U10U10VariantV N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y KR L Y WR L YR L Y T RL W HR L Y R FI K RR W C FR V RV R KM W MR L Y N T GWR L Y R F HL Y T R HL N T GW FR L Y T W HM E Y R M C I L RR V F W W QR V M Q M WV F W S N QK M F R RFigure S20. The 20 engineered GFP variants.

It's interesting to see how many mutant positions these variants share, even across observed versus unobserved, as well as the preference for later residues in the primary sequence. I'm curious if the model is learning something more significant about the features driving activity, particularly if this pattern holds true for the other ten datasets.

Read the original source
Arcadia Science
Feb 7, 2025

The efficacy of these 3D distance-based relative positional embeddings position embeddings is pretty amazing. I have to read up on relative position embeddings.

Read the original source
Arcadia Science
Feb 7, 2025

Read the original source
Arcadia Science
Feb 7, 2025

With a smaller training size of ∼1M examples and just a single GPU, training times ranged from 6-26 hours for 100 epochs for most proteins (4 to 16 minutes per epoch). Pretraining METL-Global with 20M parameters took ∼50 hours on 4x A100s and ∼142 hours with 50M parameters.

Given the performance, I'm impressed with the affordability of the models' pre-training.

In a world of foundation models that cost millions of $ to train, I think it's definitely worth mentioning the frugality of these models in the discussion (if not already mentioned).

Read the original source
Arcadia Science
Feb 7, 2025

We implemented relative position embeddings as described by Shaw et al.

Thank you for this paragraph and the two that proceed it. A very useful learning resource!

Read the original source
Version published to 10.1101/2024.03.15.585128 on bioRxiv
Mar 17, 2024

Pretrained protein language models choose between sequence novelty and structural completeness

This article has 3 authors:
1. Arjuna M. Subramanian
2. Zachary A. Martinez
3. Matt Thomson
This article has no evaluationsLatest version Oct 3, 2025
Overcoming Extrapolation Challenges of Deep Learning by Incorporating Physics in Protein Sequence-Function Modeling

This article has 5 authors:
1. Shrishti Barethiya
2. Jian Huang
3. Xiao Liu
4. Hui Guan
5. Jianhan Chen
This article has no evaluationsLatest version Nov 11, 2025
FlexRibbon: Joint Sequence and Structure Pretraining for Protein Modeling

This article has 23 authors:
1. Jianwei Zhu
2. Yu Shi
3. Ran Bi
4. Peiran Jin
5. Chang Liu
6. Zhe Zhang
7. Haitao Huang
8. Zekun Guo
9. Pipi Hu
10. Fusong Ju
11. Lin Huang
12. Xinwei Tai
13. Chenao Li
14. Kaiyuan Gao
15. Xinran Wei
16. Huanhuan Xia
17. Jia Zhang
18. Yaosen Min
19. Zun Wang
20. Yusong Wang
21. Liang He
22. Haiguang Liu
23. Tao Qin
This article has no evaluationsLatest version Oct 10, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Pretrained protein language models choose between sequence novelty and structural completeness

Overcoming Extrapolation Challenges of Deep Learning by Incorporating Physics in Protein Sequence-Function Modeling

FlexRibbon: Joint Sequence and Structure Pretraining for Protein Modeling