PubMind: Literature-Based Genetic Variant Extraction and Functional Annotation Using Large Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid growth of biomedical literature has produced extensive functional knowledge on genetic variants, much of which remains buried in unstructured texts. Current databases such as ClinVar and the Human Gene Mutation Database (HGMD) attempt to catalog this knowledge but have significant limitations: ClinVar depends on voluntary submissions and covers only a fraction of published literature, while the academic version of HGMD is updated infrequently and provides limited functional annotation. To address these gaps, we developed PubMind, an AI-driven multi-layer framework that uses large language models (LLMs) to extract variant– function–disease associations and supporting evidence from text. PubMind integrates a fine-tuned BERT model for input triage with instruction-tuned GPT models for inferring disease associations and functional annotations. The system captures diverse variant types—including SNVs, CNVs, SVs, and gene fusions—and normalizes records to genome and transcriptome coordinates. Benchmarking demonstrates >90% accuracy in variant recognition and 99% precision in disease extraction. Application of PubMind on >41 million PubMed abstracts and >5 million open-access full-text articles produced PubMind-DB, a database containing ∼1.3 million unique variants with rich contextual annotations, accessible via a web interface and API. Only ∼10% of PubMind’s variants overlapped with ClinVar entries, yet >80% showed concordant pathogenicity labels, including full agreement with ClinVar’s expert-reviewed variants. Case studies demonstrate PubMind-DB’s ability to uncover supporting evidence for variant pathogenicity that might otherwise be missed by manual searches. Together, these findings establish PubMind as a scalable LLM-based framework that transforms unstructured biomedical text into structured genomic knowledge, advancing variant interpretation for precision medicine.