Large Language Models Enhance Molecular Diagnoses of Mendelian Disorders via A Novel Logic
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Mendelian disorders are a class of heritable conditions caused by mutations in a single gene. To date, a total of 7,623 distinct Mendelian disorders and 4,989 related genes have been identified, accounting for over 18% of pediatric hospitalizations. An accurate molecular diagnosis is critical for guiding clinical management. The conventional diagnostic logic involves classifying a patient's clinical phenotype into a well-defined disorder prior to identifying the causative gene. However, the expanding knowledge of intricate relationships between phenotypes, diseases, and genes has created a many-to-many mapping challenge. This complexity introduces significant ambiguity into the phenotype-driven prioritization of candidate genes. To address these challenges, we propose a novel diagnostic logic that bypasses the predefined disease entities. By directly comparing a proband’s phenotypic profile with those of molecularly diagnosed patients documented in peer-reviewed literature, our approach identifies “phenotypic twins” to prioritize candidate genes. This novel logic is enabled by three key advances:1) We fine-tuned sentence transformers to perform semantic searches across medical databases, retrieving a total of 2,305,927 publications in Mendelian disorders; 2) We developed a large language model-based pipeline to extract information from full-text publications, constructing a database with 1,252,565 phenotype-genotype associations; 3) We designed a retrieval augmentation-based prioritization method, PhenoGemini, to identify “phenotypic twins” among 382,474 individuals in the database. Using gene ranking within top 10 lists as the evaluation metric, PhenoGemini outperformed existing orthogonal phenotype-driven gene prioritization tools, achieving a relative performance increase ranging from 68.01% to 215.87% across diverse real-world, multicenter validation cohorts (N = 160,000; N = 20,641; N = 1,589; N = 1,191, spanning 108 countries/regions and encompassing 2,534 genes). Furthermore, when combined with sequencing data, PhenoGemini successfully ranked the correct gene within the top 10 in 98.41% of cases (N=189). For the integration of clinical practice, we have further developed PhenoGemini to function as an AI agent, which can extract phenotypes from clinical notes, prioritize candidate genes, and provide a detailed rationale to enhance interpretability. PhenoGemini has already been deployed at PhenoGemini.org.