Machine Learning for Missing Data Imputation in Alzheimer’s Research: Predicting Medial Temporal Lobe Flexibility

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

BACKGROUND

Alzheimer’s disease (AD) begins years before symptoms appear, making early detection essential. The medial temporal lobe (MTL) is one of the earliest regions affected, and its network flexibility, a dynamic measure of brain connectivity, may serve as a sensitive biomarker of early decline. Cognitive (acquisition, generalization), genetic (APOE, ABCA7), and biochemical (P-tau217) markers may predict MTL dynamic flexibility. Given the high rate of missing data in AD research, this study uses machine learning with advanced imputation methods to predict MTL dynamic flexibility from multimodal predictors in an aging cohort.

METHODS

In an ongoing study at Rutgers’s Aging and Brain Health Alliance, data from 656 participants are utilized, including cognitive assessments, genetic and blood-derived biomarkers, and demographics. Due to MRI-related constraints, only 34.15% of participants had measurable MTL dynamic flexibility from resting-state fMRI. To estimate MTL dynamic flexibility from available data, we evaluated four missing data handling methods (case deletion, MICE, MissForest, and GAIN), and trained five regression models: Ridge, k-NN, SVR, regression trees (bagging, random forest, boosting), and ANN. Hyperparameters were optimized via grid search with 3-fold cross-validation. Model performance was assessed using mean absolute error (MAE), root mean squared error (RMSE), and runtime through 5-fold cross-validation repeated 25 times to ensure robustness in clinical data settings.

RESULTS

A total of 1,866 missing values (25.86%) were identified in the dataset, with only 42 complete cases (6.40%) remaining after listwise deletion, highlighting the need for effective imputation. In the initial analysis using only complete cases, support vector regression (SVR) achieved the lowest mean absolute error (MAE = 0.184), though overall performance was limited due to small sample size. In the second phase, three imputation techniques were applied, significantly improving model accuracy. MissForest combined with Random Forest produced the best results (MAE = 0.083), representing a 54.7% improvement over case deletion. Statistical analysis confirmed significant differences in performance across imputation methods (p < 0.001), with MissForest outperforming GAIN and MICE. GAIN was the fastest imputation method.

DISCUSSION

The findings underscore the importance of using robust imputation strategies to maximize data utility and model reliability in studies with high missingness. Further research is needed, particularly incorporating additional neuroimaging measures, to localize the brain regions most affected by biomarker-driven changes and to refine predictive models for clinical applications.

Article activity feed