A Reproducible and Unified Benchmark of Deep Learning Feature Selection Across Simulations and Multi-Omics datasets
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Reliable feature selection is critical for extracting small, interpretable biomarker panels from high-dimensional omics, yet deep learning–based feature selection methods have rarely been compared systematically. We present a unified and reproducible benchmark of post-hoc explainers—attribution (GradientSHAP, and DeepLIFT), perturbation (LIME, Feature Ablation, and Occlusion) and embedded selectors (CancelOut, EAR-FS, and GRACES). Using standardized preprocessing, a shared neural network, and Optuna tuning, we evaluate feature selection performance and computational efficiency in simulations and downstream predictive accuracy on diverse real datasets, including gene expression datasets with binary outcomes, five TCGA projects spanning mRNA, methylation, and SNP modalities with multi-class outcomes, and ADNI gene expression with continuous neuroimaging phenotypes. EAR-FS and GRACES consistently perform best: GRACES is most robust but computationally intensive, whereas EAR-FS achieves similar accuracy with much lower computational cost. Classical methods remain competitive when signals are sparse and near-linear. Post-hoc explainers contribute most as interpretability tools and model auditors rather than as primary subset selectors. To enable reproducibility and broad adoption, we provide a software hub and website implementing all these methods with standardized pipelines and evaluation routines, facilitating efficient feature selection under typical constraints of budget, sample size, and turnaround time.