From Text to Translation: Using Language Models to Prioritize Variants for Clinical Review

Weijiang Li
Xiaomin Li
Ethan Lavallee
Alice Saparov
Marinka Zitnik
Christopher Cassa

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Backgrounds

Despite rapid advances in genomic sequencing, most rare genetic variants remain insufficiently characterized for clinical use, limiting the potential of personalized medicine. When classifying whether a variant is pathogenic, clinical labs adhere to diagnostic guidelines that comprehensively evaluate many forms of evidence including case data, computational predictions, and functional screening. While a substantial amount of clinical evidence has been developed for many of these variants, the majority cannot be definitively classified as ‘pathogenic’ or ‘benign’, and thus persist as ‘Variants of Uncertain Significance’ (VUS).

Methods:

We processed over 2.4 million plaintext variant summaries from ClinVar, employing sentence-level classification to remove content that does not contain evidence and removing uninformative or highly similar summaries. We then trained ClinVar-BERT to discern clinical evidence within these summaries by fine-tuning a BioBERT-based model with labeled records.

Results

We validated ClinVar-BERT model predictions for variant summaries that are classified as uncertain (VUS) using orthogonal functional screening data. ClinVar-BERT significantly separated estimates of functional impact in clinically actionable genes, including BRCA1 (p = 1.90×10 ^{−
20} ), TP53 (p = 1.14×10 ^{−
47} ), and PTEN (p = 3.82 × 10 ^{−
7} ) and achieved an AUROC of 0.927 when classifying whether variants result in loss of function or have uncertain effects.

Conclusion

These findings suggest that ClinVar-BERT is capable of discerning evidence from diagnostic reports and can be useful for prioritizing variants for re-assessment by diagnostic laboratories and expert curation panels.

Version published to 10.1101/2024.12.31.24319792 on medRxiv
Dec 31, 2024

Large Language Models Enhance Molecular Diagnoses of Mendelian Disorders via A Novel Logic

This article has 15 authors:
1. Zefu Chen
2. Jihao Cai
3. Yongxin Yang
4. Sen Zhao
5. Guozhuang Li
6. Kexin Xu
7. Qing Li
8. Timothy Hospedales
9. Lina Zhao
10. Zhongmin Zhang
11. Zhihong Wu
12. Guixing Qiu
13. Terry Jianguo Zhang
14. Pengfei Liu
15. Nan Wu
This article has no evaluationsLatest version Dec 22, 2025
Benchmarking RNA-seq Tools for Real-World Diagnostic Applications

This article has 15 authors:
1. Sarah Silverstein
2. Kaushik Ganapathy
3. Sandra Donkervoort
4. Veronique Bolduc
5. Ying Hu
6. Justin Moy
7. Prech Uapinyoying
8. Svetlana Gorokhova
9. Vijay Ganesh
10. Ben Weisburd
11. Rotem OrBach
12. A. Reghan Foley
13. Pejman Mohammadi
14. David Adams
15. Carsten Bonnemann
This article has no evaluationsLatest version Jan 29, 2026
VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants

This article has 6 authors:
1. Jiawei Wu
2. Marissa Stutzman
3. Michael Muriello
4. Joy Lincoln
5. Donald G. Basel
6. Xiaowu Gai
This article has no evaluationsLatest version Jan 21, 2026

Discuss this preprint

Listed in

Abstract

Backgrounds

Methods:

Results

Conclusion

Article activity feed

Related articles

Large Language Models Enhance Molecular Diagnoses of Mendelian Disorders via A Novel Logic

Benchmarking RNA-seq Tools for Real-World Diagnostic Applications

VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants