PubMind: Literature-Based Genetic Variant Extraction and Functional Annotation Using Large Language Models

Peng Wang
Kai Wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid growth of biomedical literature has produced extensive functional knowledge on genetic variants, much of which remains buried in unstructured texts. Current databases such as ClinVar and the Human Gene Mutation Database (HGMD) attempt to catalog this knowledge but have significant limitations: ClinVar depends on voluntary submissions and covers only a fraction of published literature, while the academic version of HGMD is updated infrequently and provides limited functional annotation. To address these gaps, we developed PubMind, an AI-driven multi-layer framework that uses large language models (LLMs) to extract variant– function–disease associations and supporting evidence from text. PubMind integrates a fine-tuned BERT model for input triage with instruction-tuned GPT models for inferring disease associations and functional annotations. The system captures diverse variant types—including SNVs, CNVs, SVs, and gene fusions—and normalizes records to genome and transcriptome coordinates. Benchmarking demonstrates >90% accuracy in variant recognition and 99% precision in disease extraction. Application of PubMind on >41 million PubMed abstracts and >5 million open-access full-text articles produced PubMind-DB, a database containing ∼1.3 million unique variants with rich contextual annotations, accessible via a web interface and API. Only ∼10% of PubMind’s variants overlapped with ClinVar entries, yet >80% showed concordant pathogenicity labels, including full agreement with ClinVar’s expert-reviewed variants. Case studies demonstrate PubMind-DB’s ability to uncover supporting evidence for variant pathogenicity that might otherwise be missed by manual searches. Together, these findings establish PubMind as a scalable LLM-based framework that transforms unstructured biomedical text into structured genomic knowledge, advancing variant interpretation for precision medicine.

Version published to 10.1101/2025.10.13.682183 on bioRxiv
Oct 15, 2025

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

This article has 1 author:
1. Hayden Farquhar
This article has no evaluationsLatest version Feb 4, 2026
VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants

This article has 6 authors:
1. Jiawei Wu
2. Marissa Stutzman
3. Michael Muriello
4. Joy Lincoln
5. Donald G. Basel
6. Xiaowu Gai
This article has no evaluationsLatest version Jan 21, 2026
Benchmarking RNA-seq Tools for Real-World Diagnostic Applications

This article has 15 authors:
1. Sarah Silverstein
2. Kaushik Ganapathy
3. Sandra Donkervoort
4. Veronique Bolduc
5. Ying Hu
6. Justin Moy
7. Prech Uapinyoying
8. Svetlana Gorokhova
9. Vijay Ganesh
10. Ben Weisburd
11. Rotem OrBach
12. A. Reghan Foley
13. Pejman Mohammadi
14. David Adams
15. Carsten Bonnemann
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants

Benchmarking RNA-seq Tools for Real-World Diagnostic Applications