CausalDRIFT: Causal Dimensionality Reduction via Inference of Feature Treatments for Robust Healthcare Machine Learning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
High-dimensional medical datasets present challenges in feature selection, where traditional methods often prioritize spurious correlations over causally relevant variables, compromising model interpretability and clinical utility. We introduce CausalDRIFT, a causal feature selection algorithm grounded in the Frisch-Waugh-Lovell theorem and Double Machine Learning, which estimates the Average Treatment Effect (ATE) of each feature on clinical outcomes while adjusting for confounders. We evaluated CausalDRIFT against seven baseline methods (PCA, ICA, Elastic Net, RFE, etc.) across four datasets (Heart Disease, Diabetes, Breast Cancer, and PCOS) using XGBoost classifiers, with performance metrics including accuracy, precision, recall, and F1-score. CausalDRIFT achieved competitive performance, notably excelling in datasets with strong causal structure (e.g., 90% accuracy and F1-score of 0.90 on PCOS, outperforming most other methods). It demonstrated superior consistency (lowest standard deviation: 1.19 in Breast Cancer, lowest recall spread in Heart Disease) and robustness to confounding, though it traded marginal predictive gains for interpretability in correlation-dominated datasets. CausalDRIFT excels in high-dimensional, low-sample size (HDLSS) settings, such as the Breast Cancer dataset (569×32), where it achieves 93.9% accuracy and 91.8% F1-score. Statistical analysis (ANOVA, Tukey HSD) confirmed its recall performance was non-inferior to top methods (all p > 0.15), while unsupervised techniques like ICA significantly underperformed (p < 0.05). CausalDRIFT bridges the gap between causal inference and scalable feature selection, offering clinically interpretable and generalizable models. Its ability to prioritize causally actionable features is critical for high-stakes decision-making, and it makes it a promising tool for healthcare AI, particularly in settings like PCOS where etiology is complex.