From Likelihood to Fitness: Improving Variant Effect Prediction in Protein and Genome Language Models
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Generative models trained on natural sequences are increasingly used to predict the effects of genetic variation, enabling progress in therapeutic design, disease risk prediction, and synthetic biology. In the zero-shot setting, variant impact is estimated by comparing the likelihoods of sequences, under the assumption that likelihood serves as a proxy for fitness. However, this assumption often breaks down in practice: sequence likelihood reflects not only evolutionary fitness constraints, but also phylogenetic structure and sampling biases, especially as model capacity increases. We introduce Likelihood-Fitness Bridging (LFB), a simple and general strategy that improves variant effect prediction by averaging model scores across sequences subject to similar selective pressures. Assuming an Ornstein-Uhlenbeck model of evolution, LFB can be viewed as a way to marginalize the effects of genetic drift, although its benefits appear to extend more broadly. LFB applies to existing protein and genomic language models without requiring retraining, and incurs only modest computational overhead. Evaluated on large-scale deep mutational scans and clinical benchmarks, LFB consistently improves predictive performance across model families and sizes. Notably, it reverses the performance plateau observed in larger protein language models, making the largest models the most accurate when combined with LFB. These results suggest that accounting for phylogenetic and sampling biases is essential to realizing the full potential of large sequence models in variant effect prediction.
Article activity feed
-
We found a simple minimum percentage identity threshold of 30% performed best
The figure reports an average but my gut feeling is that this threshold should maybe be protein/protein family specific. I imagine the overall shape of the phylogeny/distribution of branch lengths around a focal protein will influence how much predictive gain LFB provides. For example, it might make sense to set this threshold higher for a protein with lots of intermediate divergence homologs, vs one that has few. An explicit analysis of what features of a protein family's phylogeny favour differing thresholds might in itself be a very useful analysis for guiding the application of LFB/LFB-like methods to PLM improvement.
-
The LFB estimators proposed in this work are intentionally simple and serve as a starting point for more sophisticated inference strategies
One alternative/complementary 'modify the model' strategy that might be useful to compare to this method is protein family specific PLM fine-tuning. One could test how fine-tuning on a much narrower region of protein space affects a PLM's ability to soak up phylogenetic signal by testing if a fine-tuned model is similarly improved with LFB in zero-shot fitness prediction tasks.
-
Sketch proof of lower variance under OU model
In addition to this theory calculation, would it be possible with your data to look at empirical variance and that it behaves as expected?
-