pLM-SAV: A Δ-Embedding Approach for Predicting Pathogenic Single Amino Acid Variants
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Predicting whether single amino acid variants (SAVs) in proteins lead to pathogenic outcomes is a critical challenge in molecular biology and precision medicine. Experimental determination of all possible mutation effects is infeasible, and while state-of-the-art tools such as AlphaMissense show promise, their diagnostic performance is insufficient and they are often difficult to run locally. We developed pLM-SAV, a simple yet effective predictor that leverages protein language models (pLMs). Δ-embeddings, computed as the difference between wild-type and mutant sequence embeddings, are used as input for a convolutional neural network. To prevent data leakage, we trained our model on a well-characterized, labeled set of Eff10k and evaluated it on a non-homologous subset of ClinVar data. Our results demonstrate that this approach performs exceptionally well on the Eff10k test folds and reasonably on ClinVar test sets. Notably, pLM-SAV excels in resolving ambiguous predictions by AlphaMissense. We also found that an ensemble method, REVEL, outperforms both AlphaMissense and pLM-SAV, thus, we integrated these REVEL- enhanced predictions into our widely used AlphaMissense web application. Our results demonstrate that an SAV predictor trained on labeled data can achieve high predictive performance. Unlike previous methods such as VESPA, pLM-SAV uses no handcrafted features or substitution matrices, relying solely on pLM-derived representations. We anticipate that incorporating delta-embeddings into other mutation effect predictors or mutant structure prediction methods will further enhance their accuracy and utility in diverse biological contexts.
Availability and Implementation
Freely available at https://doi.org/10.5281/zenodo.15502498 and https://alphamissense.hegelab.org .