Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Variants of Uncertain Significance (VUS) represent a critical bottleneck in clinical genetics, with 20–41% of genetic test results yielding inconclusive VUS classifications. Current computational prediction tools, including AlphaMissense, achieve incomplete coverage and show systematic weaknesses in intrinsically disordered protein regions where traditional structure-based features fail. Methods We developed a machine learning framework synergistically integrating ESM-2 protein language model embeddings (1,280 dimensions) with AlphaMissense scores and 34 additional engineered genomic features including gene constraint metrics, amino acid physicochemical properties, and evolutionary conservation scores. An XGBoost classifier was trained on 40,773 ClinVar variants with gene-level clustering to prevent data leakage, and evaluated on a held-out test set of 12,180 variants. Results Our integrated model achieved an AUC-ROC of 0.978 (95% CI: 0.973–0.982), representing a 66% reduction in classification error compared to AlphaMissense alone (0.934, p < 0.001 by DeLong test). Critically, ablation analysis confirmed that ESM-2 embeddings provide independent predictive value: the model without AlphaMissense achieved AUC-ROC of 0.929, still exceeding AlphaMissense alone (p < 0.0001). Temporal validation on 7,891 variants classified after AlphaMissense publication (September 2023) demonstrated robust generalization (AUC-ROC 0.968). The model showed consistent improvement across protein contexts, maintaining performance in both ordered regions (AUC 0.965) and intrinsically disordered regions (AUC 0.982). At 90% sensitivity, our model achieved 55% fewer false positives than AlphaMissense. Applied to 22,927 VUS, 52.5% could potentially be reclassified at conservative probability thresholds. Conclusions Synergistic integration of protein language models with structure-based predictions creates a framework with substantial clinical utility. ESM-2 embeddings provide complementary sequence-based signals that enhance predictions consistently across protein structural contexts.