Genomic heterogeneity inflates the performance of variant pathogenicity predictions

Baiyu Lu
Xueshen Liu
Po-Yu Lin
Nadav Brandes

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent studies have reported unprecedented accuracy predicting pathogenic variants across the genome, including in noncoding regions, using large AI models trained on vast genomic data. We present a comprehensive evaluation of these frontier models, showing that performance is inflated by differences in the prevalence of pathogenic variants across genomic contexts. We identify the best-performing models for each variant type and establish a benchmark to guide future progress.

Version published to 10.1101/2025.09.05.674459 on bioRxiv
Sep 8, 2025

VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants

This article has 6 authors:
1. Jiawei Wu
2. Marissa Stutzman
3. Michael Muriello
4. Joy Lincoln
5. Donald G. Basel
6. Xiaowu Gai
This article has no evaluationsLatest version Jan 21, 2026
Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

This article has 1 author:
1. Hayden Farquhar
This article has no evaluationsLatest version Feb 4, 2026
Decoding Complex Genotype-Phenotype Interactions by Discretizing the Genome

This article has 6 authors:
1. Jędrzej Kubica
2. Hetvi Jethwani
3. Krzysztof H. Banecki
4. Mauricio Moldes
5. Dariusz Plewczynski
6. Ben Busby
This article has no evaluationsLatest version Dec 17, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

Decoding Complex Genotype-Phenotype Interactions by Discretizing the Genome