A fine-tuned genomic language model adds complementary nucleotide-context information to missense variant interpretation

Yaqi Su
Yu-Jen Lin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Missense variant interpretation remains a central challenge in clinical genomics. Missense pathogenicity predictors achieve strong performance, but many emphasize protein-level consequences or overlapping annotation priors. Whether genomic language models add non-redundant nucleotide-context signal to missense interpretation remains unclear. Here, we systematically adapted genomic language models to ClinVar missense pathogenicity prediction across back-bone architectures, representation strategies, classifier heads, and adaptation regimes. In our analysis, variant-position embeddings consistently outperformed pooled sequence representations, multi-species pretraining provided the strongest backbone-level advantage, and low-rank adaptation generalized better than full fine-tuning. The resulting fine-tuned model, GLM-Missense, substantially outperformed zero-shot scoring from the same pretrained model.

To test whether GLM-Missense contributes information beyond existing methods, we built MetaMissense, an XGBoost ensemble combining GLM-Missense with AlphaMissense, ESM1b, REVEL, CADD, SIFT, and PolyPhen-2. GLM-Missense showed the lowest concordance with other predictors, retained the strongest partial association with pathogenicity after controlling for the other predictors, and ranked as the most informative non-ensemble input to MetaMissense. MetaMissense achieved the best performance in both cross-validation and held-out testing. Analyses of variants correctly classified by GLM-Missense but misclassified by several established predictors suggested two patterns. First, part of the GLM-Missense signal may reflect splice-relevant exonic context. Second, GLM-Missense appears to add value in settings where other predictors may overweight allele frequency, gene-level constraint, or amino-acid-change severity. However, these features explained only about 10% of the distinction between the GLM-Missense-correct subset from the background. Together, our results demonstrate that fine-tuned genomic language models contribute complementary nucleotide-context information to missense variant interpretation.

Version published to 10.64898/2026.05.06.723362 on bioRxiv
May 11, 2026

AnnotateMissense: a genome-wide annotation and benchmarking framework for missense pathogenicity prediction

This article has 2 authors:
1. Muhammad Muneeb
2. David B. Ascher
This article has no evaluationsLatest version May 4, 2026
EVEE: Interpretable variant effect prediction from genomic foundation model embeddings

This article has 22 authors:
1. Michael T. Pearce
2. Thomas Dooms
3. Ryo Yamamoto
4. Joshua Meehl
5. Carl Molnar
6. Mark Bissell
7. Dron Hazra
8. Ching Fang
9. Nam Nguyen
10. Michael Anderson
11. Collin Osborne
12. Patrick Duffy
13. Bridget Toomey
14. Eric Klee
15. Elena Myasoedova
16. Alexander J. Ryu
17. Shant Ayanian
18. Panos Korfiatis
19. Matt Redlon
20. Archa Jain
21. Daniel Balsam
22. Nicholas K. Wang
This article has no evaluationsLatest version Apr 11, 2026
From nucleotides to semantics: genomic representation learning via joint-embedding predictive architecture

This article has 8 authors:
1. Chengsen Wang
2. Qi Qi
3. Haifeng Sun
4. Zirui Zhuang
5. Bo He
6. Siying Liu
7. Jianxin Liao
8. Jingyu Wang
This article has no evaluationsLatest version Apr 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

AnnotateMissense: a genome-wide annotation and benchmarking framework for missense pathogenicity prediction

EVEE: Interpretable variant effect prediction from genomic foundation model embeddings

From nucleotides to semantics: genomic representation learning via joint-embedding predictive architecture