Aladynoulli: A Bayesian approach to disease progression modeling for genomic discovery and clinical prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Understanding how disease patterns evolve over a lifetime remains a key challenge in medicine. While electronic health records provide rich longitudinal data, existing models typically analyze each disease in isolation, missing the complex interplay between conditions and genetic factors. Here, we present aladynoulli, a dynamic Bayesian framework that integrates longitudinal health records with genetic data to identify latent disease signatures while modeling individual-specific trajectories. Applied in three biobanks with up to 52 years of follow-up, our model discovers clinically interpretable disease signatures that show remarkable cross-population consistency (median 80\% composition preservation) and reveal distinct biological subtypes within traditional diagnostic categories, with large effect sizes for signature differences between patient clusters (Cohen's d up to 4.25, p < 1 x 10^-8 for 95% of comparisons). Genetic validation demonstrates biological relevance through multiple complementary approaches: enrichment in known risk populations (familial hypercholesterolemia carriers, clonal hematopoiesis carriers), 151 genome-wide significant loci including novel cardiovascular associations in our dataset, rare variant associations with established disease genes LDLR, TTN, BRCA2, and heritability exceeding component diseases. We also include a non-specific low-incidence signature which captures resistance across many disease conditions. The model's explicit likelihood formulation enables principled corrections for selection bias through inverse probability weighting while preserving biological signal. For clinical prediction, aladynoulli substantially outperforms established risk scores (PCE, PREVENT, GAIL) across 28 conditions over both 1-year and 10-year horizons. By jointly modeling genetics and longitudinal diagnoses, aladynoulli achieves enhanced biological discovery and improved disease prediction through a unified, interpretable framework. Code and interactive results are available https://surbut.github.io/aladynoulli2/index.html with application at http://aladynoulli.hms.harvard.edu.

Article activity feed