Integrated Bioinformatics and Ensemble Learning Reveal Diagnostic Modeling and Drug Discovery in Alzheimer’s Disease
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Alzheimer’s disease (AD) is driven by complex molecular and immune dysregulation, yet reliable diagnostic biomarkers and druggable targets remain limited. This study aimed to identify key AD-associated regulatory genes, characterize their immune and spatial expression features, and prioritize small-molecule compounds with therapeutic potential. Methods: Multiple AD-related transcriptomic datasets—including bulk RNA-seq, microarray, and spatial transcriptomic profiles—were retrieved from GEO and systematically partitioned into discovery (GSE5281, GSE66333), validation (GSE110226, GSE28146, GSE29378), independent testing (GSE29378), and spatial validation cohorts (GSE147047). Differential expression analysis and weighted gene co-expression network analysis (WGCNA) were used to construct co-expression networks and define AD-associated gene modules. Protein–protein interaction (PPI) analysis and multiple network centrality measures were then applied to prioritize candidate key genes. Twelve machine-learning algorithms were combined into 127 classification models, and SHAP-based interpretability analysis was used to quantify feature contributions and identify diagnostic genes. Single-cell and spatial transcriptomic data were further used to validate the cell type specificity and spatial localization of the hub genes. Drug–gene enrichment analysis (DSigDB), compound retrieval (PubChem), ADMET and drug-likeness profiling, and molecular blind docking were integrated to screen and evaluate potential lead compounds. Results: We identified 2,534 differentially expressed genes (DEGs) between AD and control samples, and their intersection with WGCNA-derived modules yielded 848 candidate genes. PPI-based network analysis prioritized 15 key genes, on which 127 machine-learning models were constructed; the random forest model achieved the best overall performance with an average AUC of 0.957. SHAP analysis identified 11 key diagnostic genes, among which IGF1R and SPP1 emerged as stable hub genes with AUCs greater than 0.70 across multiple external cohorts. Immune infiltration, single-cell, and spatial transcriptomic analyses demonstrated distinct immune associations and cell type– and region-specific expression patterns of these hub genes. Drug–gene enrichment identified 176 drug signatures and 445 related compounds, of which 37 grade-A molecules remained after ADMET and drug-likeness filtering. Molecular docking revealed four top-ranked compounds with binding energies better than −9.0 kcal/mol, including one ligand with a minimum binding energy of −10.5 kcal/mol and extensive non-covalent interactions with the target protein. Conclusion: A systematic methodological framework from gene discovery and diagnostic modeling to lead drug screening was developed in this study. IGF1R and SPP1 were identified as stable and biologically interpretable AD hub genes, which can be used as potential diagnostic markers, and various high-affinity small molecule compounds based on the hub genes provide new drug candidates for targeted AD therap.