VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Interpreting the pathogenicity of genetic variants remains a critical bottleneck in genomic medicine. Millions of variants of uncertain significance (VUS) hinder the clinical application of genetic findings. Traditional computational approaches often rely on hand-engineered features and fail to capture the complexity of multidimensional genomic annotations fully. Methods We developed VUS.Life, a multi-modal framework that synergizes semantic text embeddings of biological and clinical annotations with protein language modeling. We transformed variant annotations from Variant Effect Predictor (VEP) into natural language descriptions which are then converted into vector embeddings via established Large Language Models (LLMs), namely all-mpnet-base-v2, MedEmbed-large-v0.1, and text-embedding-004. Pathogenicity of a variant of interest is predicted by its proximity in the vector embedding space with variants of known pathogenicity. We further extended VUS.Life by employing residue-level delta embeddings from the ESMC-600M model to capture both clinical context and biophysical constraints. Results We evaluated the framework on > 10,000 variants across BRCA1/2 , FBN1 , ATM , and PALB2 genes. VUS.Life achieved greater than 96% accuracy from using VEP annotations alone across all variant types and disease genes evaluated. Additionally, our unsupervised FBN1 structural analysis using ESMC-600M revealed that delta embeddings disentangled distinct pathogenic mechanisms, topologically separating disulfide bond disruptions from calcium-binding defects. These structural clusters correlated strongly with Zero-Shot Log-Likelihood Ratio (LLR) scores, validating evolutionary fitness as a proxy for pathogenicity. Conclusions This semantic embedding framework, VUS.Life, accurately captures pathogenicity-relevant features from complex variant annotations, enabling high-accuracy (> 96%) automated classification across multiple genes and models. The approach generalizes beyond well-curated genes and supports scalable, interpretable, and representation-based classification of VUS. It holds significant promise for alleviating the variant interpretation bottleneck in clinical genomics.

Article activity feed