Protein Language Model Fitness Is a Matter of Preference

Cade Gordon
Amy X. Lu
Pieter Abbeel

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Leveraging billions of years of evolution, scientists have trained protein language models (pLMs) to understand the sequence and structure space of proteins aiding in the design of more functional proteins. Although they have shown ability to improve efficiency in engineering, it remains unclear if such models capture true biological patterns or artifacts of the training data. We aim to predict the circumstances in which pLMs can successfully perform zero-shot fitness estimation. Our work studies trends observed over hundreds of deep mutational scans across multiple different fitness objectives. We find that the likelihood, or abstractly, implicit preference of a certain protein sequence imbued during pretraining is predictive of fitness prediction capabilities. Both over-preferred and under-preferred wild type sequences harm performance. Using influence functions to causally understand how individual data points increase protein likelihoods, we find that there exists a power law tail due to sequence homology. Lastly, under-performance on low likelihood wild type proteins can be remedied by unsupervised finetuning. These findings that pLM zero-shot fitness estimation can be predicted by the likelihood of the engineered sequence can motivate and improve pLMs’ deployment in protein maturation campaigns.

Version published to 10.1101/2024.10.03.616542v1 on bioRxiv
Oct 3, 2024

Protein Language Models: Is Scaling Necessary?

This article has 6 authors:
1. Quentin Fournier
2. Robert M. Vernon
3. Almer van der Sloot
4. Benjamin Schulz
5. Sarath Chandar
6. Christopher James Langmead
This article has no evaluationsLatest version Sep 23, 2024
Efficient Inference, Training, and Fine-tuning of Protein Language Models

This article has 2 authors:
1. Muhammed Hasan Çelik
2. Xiaohui Xie
This article has no evaluationsLatest version Oct 25, 2024
Designing diverse and high-performance proteins with a large language model in the loop

This article has 3 authors:
1. Carlos A. Gomez-Uribe
2. Japheth Gado
3. Meiirbek Islamov
This article has no evaluationsLatest version Oct 29, 2024

Listed in

Abstract

Article activity feed

Related articles

Protein Language Models: Is Scaling Necessary?

Efficient Inference, Training, and Fine-tuning of Protein Language Models

Designing diverse and high-performance proteins with a large language model in the loop