Machine learning models for dementia risk prediction: Evidence from the Sydney Memory and Ageing Study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Early dementia risk stratification remains challenging despite advances in biomarker development. We evaluated machine learning approaches for predicting incident dementia using routinely available clinical measures from the Sydney Memory and Ageing Study. From 1037 community-dwelling Australians aged ≥ 70 years at baseline, 119 developed dementia and 313 remained dementia-free at 10-year follow-up. We compared logistic regression, LASSO-penalized regression, random forest, and XGBoost algorithms using baseline demographic, cognitive, cardiovascular, metabolic, and inflammatory markers. Models were trained on 70% of participants and evaluated on a 30% held-out test set. LASSO regression achieved superior discrimination (AUC = 0.752) compared to logistic regression (0.707), random forest (0.657), and XGBoost (0.589). The LASSO model retained only four predictors: age, global cognition score, glucose levels, and cardiovascular disease risk score. At the Youden-optimal threshold, LASSO demonstrated balanced sensitivity (0.698) and specificity (0.736), with favourable positive and negative predictive values. Decision-curve analysis confirmed greatest net clinical benefit across relevant risk thresholds. Notably, incorporating APOE ε4 carrier status did not improve prediction (AUC = 0.704), suggesting that current genetic testing may be unnecessary for initial risk stratification. The final model equation enables direct implementation in clinical settings using standard Excel calculators, with provisions for recalibration to different populations and age groups and can be useful for prediction at an individual level. These findings demonstrate that parsimonious machine learning models using four routinely collected variables can meaningfully predict dementia risk a decade before onset, offering a pragmatic approach for population-level screening without requiring specialized biomarkers or genetic testing.