Generative power of a protein language model trained on multiple sequence alignments

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This valuable paper proposes an innovative iterative masking approach that enables models such as the MSA Transformer to generate new protein sequence designs, which are validated using a wide-ranging set of computational experiments. A key strength of the MSA Transformer is the ability to learn and generalize across protein families, enabling impressive performance across a range of downstream tasks. However, to date, these models have not been used to generate new protein sequence designs. The approach proposed in this paper is quite novel, and a number of metrics are used to examine the resulting performance of the MSA Transformer at generating new protein sequences from specific families.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    Current generative models of protein sequences such as Potts models, Variational autoencoders, or autoregressive models must be trained on MSA data from scratch. Therefore, they cannot learn common substitution or coevolution patterns shared between families, and require a substantial number of sequences, making them less suitable for small protein families (e.g., conserved only for eukaryotes or viruses). MSA transformers are promising alternatives as they can generalize across protein families, but there is no established method to generate samples from them. Here, Sgarbossa et al. propose a simple recursive sampling procedure based on iterative masking to generate novel sequences from an input MSA. The sampling method has three hyperparameters (masking frequency, sampling temperature, and the number of iterations) which are set by rigorous benchmarking. The authors compare their approach to bmDCA, and evaluate i) single sample quality metrics ii) sample diversity and similarity to native sequences iii) similarity between original and generated sequence distribution, and iv) phylogeny/topology in sequence space of the generated distribution.

    Strengths:

    • The proposed sampling approach is simple.
    • The computational benchmarking is thorough.
    • The code is well organized and looks easy to use.

    Weaknesses:

    • There is no experimental data to back up the methodology.
    • It is not clear whether the sampling hyperparameter used is optimal for all protein sizes.
    • I am unsure that the bmDCA baseline method was trained appropriately and that the sampling method was adequate for protein design purposes (regular sampling).
    • Quality assessment of predicted structures is incomplete.
    • The proposed metrics for evaluating the diversity of generated sequences are fairly technical.

    We respond to each of these points below, in the section titled "Recommendations for the authors", since these questions were asked by the reviewer in more detail there.

    Impact assessment: The claim that MSA Transformer could be useful for protein design is supported by the computational benchmark. This work will be useful for researchers interested in applying MSA-Transformer models for protein design

    We thank the reviewer for this encouraging assessment of our work, and for their very interesting suggestions which helped us improve our manuscript.

    Reviewer #2 (Public Review):

    The manuscript by Sgarbossa et al. proposes the use of a machine learning technique used in Language Models (LM) and adapted to protein sequences (PLM) as a means to generate synthetic sequences that retain functional properties contained in the original multiple sequence alignment (MSA) of natural sequences. This technique (or a similar one) called MSA Transformers is also a component of the supervised learning methodology Alphafold which has been successful in predicting protein structures and complexes of proteins. The premise of this study is that an iterative masking approach can be used as a sampling technique to create a diverse set of sequences that still preserve important properties of the original natural sequences. For example, such samples retain homology properties, score well in terms of retaining relevant pairwise or epistatic interactions, and produce "foldable" sequences when used as input for Alphafold and scored via its confidence metric pLDDT. In order to provide support for this claim, the authors compare against Direct Coupling Analysis (DCA), which is a global sequence modeling technique that has shown to be successful in many aspects of the structure and function of proteins and particularly in generating and sampling sequences analogous to the input MSA. Most importantly, DCA and its generative version bmDCA have been shown to produce functional sequences experimentally. The authors then establish that the properties of sequences of the MSA Transformer with iterative masking, have in general better scores in terms of homology, statistical energies, and pLDDT scores than the ones from bmDCA and have spectral, statistical and similarity properties more akin to the natural sequences than those from the bmDCA methodology, except for the reproduction of single and pairwise statistics. The sequences from the MSA Transformer, however, replicate better the three body statistics of the natural sequences. The authors conclude that MSA Transformers with iterative masking is a valid technique for sequence design and it is an important alternative to the use of DCA or de novo physics-based methods or supervised learning techniques.

    Given the success of the use of language models in machine learning and its contributions to the structure prediction of protein and complexes, I see this study as a required follow-up to the breadth of work of amino acid coevolution spearheaded by DCA methodologies. In general, I believe this is a useful and relevant study for the community and opens up several avenues for research connecting Transformers with unsupervised protein design. Although the study provides support for this technique to be potentially useful for protein design, I was not completely convinced that it will yield more transformative results than the ones using Potts models. The differences, although consistent across the study, seem to be within "the margin of error" compared to bmDCA.

    We thank the reviewer for this positive assessment of our work, and for their cogent remarks which helped us improve our manuscript.

    We agree that in the case of large protein families, the main message is that our sequence generation method based on MSA Transformer scores at least as well as bmDCA. Given that bmDCA has been experimentally validated as a generative model, we believe that this is a valuable result. Our revised manuscript makes this point stronger, by showing that our sequence generation method based on MSA Transformer yields sequences that score similarly to those generated by bmDCA at low sampling temperature, while retaining substantially more sequence diversity.

    In addition, following the reviewer's suggestion below, we now present results for smaller protein families, whose shallow MSAs make it difficult to accurately fit Potts models. These results are presented in a new section of Results, titled "Sequence generation by the iterative masking procedure is successful for small protein families", including the new Figure 3. As mentioned there, "Fig. 3 reports all four scores discussed above in the case of these 7 small families, listed in Table S1 (recall that the families considered so far were large, see Table 1). We observe that MSA-Transformer–generated sequences have similar HMMER scores and structural scores to natural sequences. MSA-Transformer–generated sequences also generally have better HMMER scores and structural scores than those generated by bmDCA with default parameters. While low-temperature bmDCA yields better statistical energy scores (as expected), and also gives HMMER scores and structural scores comparable to natural sequences, it in fact generates sequences that are almost exact copies of natural ones (see Fig. 3, bottom row). By contrast, MSA Transformer produces sequences that are quite different from natural ones, and have very good scores." This shows that our method not only performs as well as bmDCA for large families, but also has a broader scope, as it is less limited by MSA depth than bmDCA.

    I also have certain comments related to the use of these 3 metrics to analyze the performance of the sampling. On the one hand, HMMER which has had a great utility for Pfam and the community in general is a score that is not necessarily reflecting the global properties of the sequences. In other words, we might be using a simpler statistical model to evaluate the performance of two other models (MSA Transformers and bmDCA) which are richer and that capture more sequence dependencies than the hidden Markov model.

    We agree with the reviewer that HMMER scores are associated with simpler statistical models, which cannot fully represent the data. We nevertheless believe that these scores remain useful to assess homology. In the framework of our study, they show that the sequences we generate are deemed "good homologs" by HMMER - similarly to natural sequences that would be extracted from a database by this widely-used tool. This said, we agree with the reviewer that one should not overinterpret HMMER scores, and we have reduced our discussion of their correlations with Hamming distances to avoid giving too much importance to this point.

    Moreover, we now present new scores that give a more complete picture of the quality of our generated sequences:

    • Regarding structure, in addition to the AlphaFold pLDDT score, we now also report the RMSD between a reference experimental structure of the relevant family (see Table 1) and the AlphaFold structure predicted for each sequence studied. The results from the RMSD analysis corroborate those obtained with pLDDT and show that predicted structures are indeed similar to the native ones. These results are now discussed in the main text. We believe that this point strengthens our conclusions and we thank the reviewer for suggesting this analysis.

    • We also performed a retrospective validation using published experimental results. For chorismate mutase, a protein family which was experimentally studied in [Russ et al 2020] using bmDCA, we now report estimated relative enrichments for our generated sequences in Figure S8, in addition to our four usual scores now shown for this family in Figure S7. In addition, for protein families PF00595 and PF13354, we now report deep mutational scanning scores for our generated sequences in Figure S9. These results strengthen our conclusion that our sequence generation method based on MSA Transformer is highly promising.

    For the case of the statistical energy score, the authors decided to use a sampling temperature T=1, but the authors note that this temperature can be reduced, as it was done in the experimental paper, to produce sequences with better energies, therefore this metric can be easily improved by modifying the temperature. The authors mentioned that they did try to reduce the temperature and that they also improved their HMMER score, however, they decided against it because the pairwise statistics were affected. However, pairwise statistics was precisely the only factor where bmDCA seemed superior to the MSA transformer, so reducing it should be an acceptable trade-off in order to optimize the other two important metrics.

    We thank both reviewers for raising this very interesting point. As mentioned above in our response to the first reviewer, we have now performed a comprehensive comparison of our MSA Transformer-generated data not only to bmDCA-generated data at sampling temperature T=1 but also at lower sampling temperatures. We considered the two temperature values chosen in [Russ et al 2020], namely T=0.33 and T=0.66. For completeness, we also considered the two values of regularization strength λ from [Russ et al 2020] for these three temperatures, in the case of family PF00072, as reported in Table S5. Given the relatively small impact of λ observed there, we kept only one value of λ for each value of T in the rest of our manuscript namely, λ=0.01 for T=1 to match the parameters in [Figliuzzi et al 2018], and λ=0.001 for T=0.33 and T=0.66 as it gave slightly better scores in Table S5. Note that for our additional study of small protein families, we employed λ=0.01 throughout because it is better suited to small families. In particular, we now include results obtained for bmDCA at λ=0.001 and T=0.33 in all figures of the revised manuscript.

    Our general findings, which are discussed in the revised manuscript, are that decreasing T indeed improves the scores of bmDCA-generated sequences. However, the main improvement regards statistical energy (as expected from lowering T), while the improvements of other scores (HMMER score, and, more importantly, structural scores) are more modest. Even using T=0.33 for bmDCA, our MSA Transformer-generated sequences have similar or better scores compared to bmDCA-generated sequences, apart from statistical energy (see Figure 1 and Tables S2 and S3). Moreover, we find that decreasing T with bmDCA substantially decreases MSA diversity, while MSA Transformer-generated sequences do not suffer from such an issue (see Figure S1). In fact, at low T, bmDCA concentrates on local minima of the statistical energy landscape (see Figures 2, 5 and S5), resulting in low diversity.

    Overall, these new results confirm that our procedure for generating sequences using MSA Transformer is promising, featuring scores comparable with low-temperature bmDCA sequences and high diversity.

    Finally, the use of pLDDT could also present some biases, since Alphafold itself uses transformers, I wonder if this fact could lead to the fact that sequences obtained with transformers simply perform better by definition.

    We thank the reviewer for raising this intriguing point. It is true that MSA Transformer has an architecture that is very similar to that of the EvoFormer module of AlphaFold. However, AlphaFold couples the EvoFormer module to a structural module, and is trained in a supervised way to predict protein structure, which makes it significantly different from MSA Transformer.

    Nevertheless, we agree that the AlphaFold pLDDT score does not give a complete view of structure. As mentioned above, to improve this, in addition to pLDDT, we now also report the RMSD between a reference experimental structure of the relevant family (see Table 1) and the AlphaFold structure predicted for each sequence studied. The results from the RMSD analysis corroborate those obtained with pLDDT and show that predicted structures are indeed similar to the native ones. These results are now discussed in the main text.

    The authors should try to address all these concerns. My assessment is that these concerns do not demerit the relevance and how timely this study is, but I would like to see a more fair comparison of these metrics where more optimizations to bmDCA are made, e.g. lower T, to have a more accurate comparison of the methods, even if that is reflected in lower performance on pairwise statistics.

    We did our best to address all these points. We believe that the additions mentioned above have substantially improved our manuscript.

    My assessment is that this manuscript's main strength is in introducing a state-of-the-art technique that has already been extremely successful in the field of computer science and artificial intelligence into the field of amino acid coevolution. By adapting this technique and creating a sampling version that is compatible with other successful methodologies, this work will lead to many other studies dealing with function and the effects of sequence variation of biomolecules.

    Again, we thank the reviewer for their encouraging assessment.

  2. Evaluation Summary:

    This valuable paper proposes an innovative iterative masking approach that enables models such as the MSA Transformer to generate new protein sequence designs, which are validated using a wide-ranging set of computational experiments. A key strength of the MSA Transformer is the ability to learn and generalize across protein families, enabling impressive performance across a range of downstream tasks. However, to date, these models have not been used to generate new protein sequence designs. The approach proposed in this paper is quite novel, and a number of metrics are used to examine the resulting performance of the MSA Transformer at generating new protein sequences from specific families.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

  3. Reviewer #1 (Public Review):

    Current generative models of protein sequences such as Potts models, Variational autoencoders, or autoregressive models must be trained on MSA data from scratch. Therefore, they cannot learn common substitution or coevolution patterns shared between families, and require a substantial number of sequences, making them less suitable for small protein families (e.g., conserved only for eukaryotes or viruses). MSA transformers are promising alternatives as they can generalize across protein families, but there is no established method to generate samples from them. Here, Sgarbossa et al. propose a simple recursive sampling procedure based on iterative masking to generate novel sequences from an input MSA. The sampling method has three hyperparameters (masking frequency, sampling temperature, and the number of iterations) which are set by rigorous benchmarking. The authors compare their approach to bmDCA, and evaluate i) single sample quality metrics ii) sample diversity and similarity to native sequences iii) similarity between original and generated sequence distribution, and iv) phylogeny/topology in sequence space of the generated distribution.

    Strengths:

    - The proposed sampling approach is simple.
    - The computational benchmarking is thorough.
    - The code is well organized and looks easy to use.

    Weaknesses:

    - There is no experimental data to back up the methodology.
    - It is not clear whether the sampling hyperparameter used is optimal for all protein sizes.
    - I am unsure that the bmDCA baseline method was trained appropriately and that the sampling method was adequate for protein design purposes (regular sampling).
    - Quality assessment of predicted structures is incomplete.
    - The proposed metrics for evaluating the diversity of generated sequences are fairly technical.

    Impact assessment: The claim that MSA Transformer could be useful for protein design is supported by the computational benchmark. This work will be useful for researchers interested in applying MSA-Transformer models for protein design

  4. Reviewer #2 (Public Review):

    The manuscript by Sgarbossa et al. proposes the use of a machine learning technique used in Language Models (LM) and adapted to protein sequences (PLM) as a means to generate synthetic sequences that retain functional properties contained in the original multiple sequence alignment (MSA) of natural sequences. This technique (or a similar one) called MSA Transformers is also a component of the supervised learning methodology Alphafold which has been successful in predicting protein structures and complexes of proteins. The premise of this study is that an iterative masking approach can be used as a sampling technique to create a diverse set of sequences that still preserve important properties of the original natural sequences. For example, such samples retain homology properties, score well in terms of retaining relevant pairwise or epistatic interactions, and produce "foldable" sequences when used as input for Alphafold and scored via its confidence metric pLDDT. In order to provide support for this claim, the authors compare against Direct Coupling Analysis (DCA), which is a global sequence modeling technique that has shown to be successful in many aspects of the structure and function of proteins and particularly in generating and sampling sequences analogous to the input MSA. Most importantly, DCA and its generative version bmDCA have been shown to produce functional sequences experimentally. The authors then establish that the properties of sequences of the MSA Transformer with iterative masking, have in general better scores in terms of homology, statistical energies, and pLDDT scores than the ones from bmDCA and have spectral, statistical and similarity properties more akin to the natural sequences than those from the bmDCA methodology, except for the reproduction of single and pairwise statistics. The sequences from the MSA Transformer, however, replicate better the three body statistics of the natural sequences. The authors conclude that MSA Transformers with iterative masking is a valid technique for sequence design and it is an important alternative to the use of DCA or de novo physics-based methods or supervised learning techniques.

    Given the success of the use of language models in machine learning and its contributions to the structure prediction of protein and complexes, I see this study as a required follow-up to the breadth of work of amino acid coevolution spearheaded by DCA methodologies. In general, I believe this is a useful and relevant study for the community and opens up several avenues for research connecting Transformers with unsupervised protein design. Although the study provides support for this technique to be potentially useful for protein design, I was not completely convinced that it will yield more transformative results than the ones using Potts models. The differences, although consistent across the study, seem to be within "the margin of error" compared to bmDCA. I also have certain comments related to the use of these 3 metrics to analyze the performance of the sampling. On the one hand, HMMER which has had a great utility for Pfam and the community in general is a score that is not necessarily reflecting the global properties of the sequences. In other words, we might be using a simpler statistical model to evaluate the performance of two other models (MSA Transformers and bmDCA) which are richer and that capture more sequence dependencies than the hidden Markov model. For the case of the statistical energy score, the authors decided to use a sampling temperature T=1, but the authors note that this temperature can be reduced, as it was done in the experimental paper, to produce sequences with better energies, therefore this metric can be easily improved by modifying the temperature. The authors mentioned that they did try to reduce the temperature and that they also improved their HMMER score, however, they decided against it because the pairwise statistics were affected. However, pairwise statistics was precisely the only factor where bmDCA seemed superior to the MSA transformer, so reducing it should be an acceptable trade-off in order to optimize the other two important metrics. Finally, the use of pLDDT could also present some biases, since Alphafold itself uses transformers, I wonder if this fact could lead to the fact that sequences obtained with transformers simply perform better by definition. The authors should try to address all these concerns. My assessment is that these concerns do not demerit the relevance and how timely this study is, but I would like to see a more fair comparison of these metrics where more optimizations to bmDCA are made, e.g. lower T, to have a more accurate comparison of the methods, even if that is reflected in lower performance on pairwise statistics.

    My assessment is that this manuscript's main strength is in introducing a state-of-the-art technique that has already been extremely successful in the field of computer science and artificial intelligence into the field of amino acid coevolution. By adapting this technique and creating a sampling version that is compatible with other successful methodologies, this work will lead to many other studies dealing with function and the effects of sequence variation of biomolecules.