Enhancing Clinical Classification of Protein Variants using ESM2 and UMAP
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Protein sequences may vary due to mutations in their coding DNA sequence, leading to differences in structure and function. The same protein may exist in multiple variant forms, each potentially leading to distinct phenotypic consequences depending on how the alterations affect its structure, function, or expression. Missense variants are single nucleotide substitutions in the DNA sequence that result in the replacement of one amino acid with another in the corresponding protein, potentially altering its structure, stability, or function. The clinical interpretation of missense variants in protein-coding regions remains a fundamental challenge in genomic medicine. Recent advances in protein language models and manifold learning provide new opportunities for unsupervised extraction of biologically relevant information from protein sequences. In this work, we integrate representations derived from ESM2 ( spiegare ) with nonlinear dimensionality reduction via UMAP ( spiegare ) to improve the classification of variants of uncertain significance (VUS) in disease-associated proteins. Our results suggest that this approach improves separability of benign and pathogenic variants, offering a scalable and interpretable strategy for variant prioritization in precision medicine.