Protein Language Model Fitness Is a Matter of Preference

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Log in to save this article

Abstract

Leveraging billions of years of evolution, scientists have trained protein language models (pLMs) to understand the sequence and structure space of proteins aiding in the design of more functional proteins. Although they have shown ability to improve efficiency in engineering, it remains unclear if such models capture true biological patterns or artifacts of the training data. We aim to predict the circumstances in which pLMs can successfully perform zero-shot fitness estimation. Our work studies trends observed over hundreds of deep mutational scans across multiple different fitness objectives. We find that the likelihood, or abstractly, implicit preference of a certain protein sequence imbued during pretraining is predictive of fitness prediction capabilities. Both over-preferred and under-preferred wild type sequences harm performance. Using influence functions to causally understand how individual data points increase protein likelihoods, we find that there exists a power law tail due to sequence homology. Lastly, under-performance on low likelihood wild type proteins can be remedied by unsupervised finetuning. These findings that pLM zero-shot fitness estimation can be predicted by the likelihood of the engineered sequence can motivate and improve pLMs’ deployment in protein maturation campaigns.

Article activity feed

  1. Protein Language Model Fitness Is a Matter of Preference

    I really enjoyed reading your paper and thought it contained many interesting and insightful gems.

    • As someone who has calculated many PLL, which take time and money, I was very interested in your O(1) method for PLL.
    • The predictive power being predicated on wildtype PLL is a very important result.
    • I found Figure 5 to be a beautiful illustration of how homology in training data influences preference
    • In Figure 6, it was incredible to see just how much the Spearman can be increased for the low-likelihood DMS datasets. And surprising to see that low-likelihood DMS datasets do worse. Clearly there is more to learn.

    More broadly, I would be curious to hear your thoughts on alternative PLM training objectives. Specifically, I'm interested in approaches that maintain the BERT-style masked language modeling objective while incorporating additional training signals. One key idea would be to include explicit feedback about sequence fitness ('good' vs 'bad' sequences) alongside the traditional masked prediction task.

    This approach could help move away from preference-oriented behavior. When models are trained solely on naturally occurring proteins, they implicitly learn that all training examples represent 'good' or 'valid' proteins. By incorporating direct fitness measurements as an additional training objective, we could potentially guide the model to learn more nuanced distinctions between functional and non-functional sequences, rather than simply modeling the distribution of extant proteins.

    Thanks again for the insightful paper.

  2. Naive usage of sequence databases and scaling will magnify the biases training data leading to miscalibrated preferences

    This study does a great job illuminating this. Do you guys foresee a method for creating a more balanced, and less biased training dataset? It seems there is an opportunity to do more with less.

  3. Unlike autoregressive language models, masked language models don’t have a natural way to immediately compute the joint likelihood of a sequence. As a result, Wang and Cho (2019) proposed to mask every index of a sequence one-at-a-time then average to derive a PLL (Wang and Cho, 2019): Embedded Image.This formulation suffers from the need to run 𝒪 (L) forward passes to compute a perplexity or log likelihood. In response to this, the community only considers autoregressive pLMs when computing fitness values for proteins containing insertions or deletions.

    There is a lot of overlap between this paragraph and the next.