PheCode-guided multi-modal topic modeling of electronic health records improves disease incidence prediction and GWAS discovery from UK Biobank

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Phenome-wide association studies (PheWAS) rely on disease definitions derived from diagnostic codes, often failing to leverage the full richness of electronic health records (EHR). We present MixEHR-SAGE, a PheCode-guided multi-modal topic model that integrates diagnoses, procedures, and medications to enhance phenotyping from large-scale EHRs. By combining expert-informed priors with probabilistic inference, MixEHR-SAGE identifies over 1,000 interpretable phenotype topics from UK Biobank (UKB) data. Applied to 350,000 individuals with high-quality genetic data, MixEHR-SAGE-derived risk scores accurately predict future diagnoses of Type 2 Diabetes (T2D) and leukemia. Subsequent genome-wide association studies (GWAS) using these continuous risk scores uncovered novel loci, including PPP1R15A for T2D and JMJD6 / SRSF2 for leukemia, that were missed by traditional binary case definitions. These results highlight the potential of probabilistic phenotyping from multi-modal EHRs to improve genetic discovery. The MixEHR-SAGE software is publicly available at: https://github.com/li-lab-mcgill/MixEHR-SAGE .

Article activity feed