VespaG: Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast single amino acid variant effect predictor, leveraging embeddings of protein Language Models as input to a minimal deep learning model. To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. Assessed against the ProteinGym Substitution Benchmark (217 multiplex assays of variant effect with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48±0.01, matching state-of-the-art methods such as GEMME , TranceptEVE , PoET , AlphaMissense, and VESPA . VespaG reached its top-level performance several orders of magnitude faster, predicting all mutational landscapes of the human proteome in 30 minutes on a consumer laptop (12-core CPU, 16 GB RAM).

Availability

VespaG is available freely at https://github.com/JSchlensok/VespaG

Article activity feed