Large Language Models Enhance Molecular Diagnoses of Mendelian Disorders via A Novel Logic

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Mendelian disorders are a class of heritable conditions caused by mutations in a single gene. To date, a total of 7,623 distinct Mendelian disorders and 4,989 related genes have been identified, accounting for over 18% of pediatric hospitalizations. An accurate molecular diagnosis is critical for guiding clinical management. The conventional diagnostic logic involves classifying a patient's clinical phenotype into a well-defined disorder prior to identifying the causative gene. However, the expanding knowledge of intricate relationships between phenotypes, diseases, and genes has created a many-to-many mapping challenge. This complexity introduces significant ambiguity into the phenotype-driven prioritization of candidate genes. To address these challenges, we propose a novel diagnostic logic that bypasses the predefined disease entities. By directly comparing a proband’s phenotypic profile with those of molecularly diagnosed patients documented in peer-reviewed literature, our approach identifies “phenotypic twins” to prioritize candidate genes. This novel logic is enabled by three key advances:1) We fine-tuned sentence transformers to perform semantic searches across medical databases, retrieving a total of 2,305,927 publications in Mendelian disorders; 2) We developed a large language model-based pipeline to extract information from full-text publications, constructing a database with 1,252,565 phenotype-genotype associations; 3) We designed a retrieval augmentation-based prioritization method, PhenoGemini, to identify “phenotypic twins” among 382,474 individuals in the database. Using gene ranking within top 10 lists as the evaluation metric, PhenoGemini outperformed existing orthogonal phenotype-driven gene prioritization tools, achieving a relative performance increase ranging from 68.01% to 215.87% across diverse real-world, multicenter validation cohorts (N = 160,000; N = 20,641; N = 1,589; N = 1,191, spanning 108 countries/regions and encompassing 2,534 genes). Furthermore, when combined with sequencing data, PhenoGemini successfully ranked the correct gene within the top 10 in 98.41% of cases (N=189). For the integration of clinical practice, we have further developed PhenoGemini to function as an AI agent, which can extract phenotypes from clinical notes, prioritize candidate genes, and provide a detailed rationale to enhance interpretability. PhenoGemini has already been deployed at PhenoGemini.org.

Article activity feed