A Basic Trustworthy Machine Learning Framework for Early Diabetes Detection
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This research presents a comprehensive trustworthy machine learning framework for early diabetes detection, addressing critical gaps in reliability, interpretability, and fairness in clinical AI systems. The study integrates causal inference, modern ensemble methods (LightGBM, XGBoost-DART, HistGBM), and TabNet for tabular deep learning to enhance predictive performance while ensuring transparency. A novel Causal-guided Stacking Classifier (CGSC) is introduced, utilizing LightGBM as a meta-learner trained on causally relevant features identified through Causal Forests. The framework emphasizes interpretability through SHAP-based global and local explanations and leverages TabNet’s intrinsic attention mechanism for feature-level insights. Counterfactual reasoning (DiCE) enables personalized risk mitigation strategies by identifying minimal feature changes to alter predictions. To promote fairness, gender is excluded as a direct feature, reducing demographic bias. Experimental results demonstrate robust performance: CGSC achieves the highest recall (0.81), critical for early warning systems, while TabNet attains superior precision (0.79). Uncertainty quantification reveals stable F1-scores (0.73 ± 0.03) across ensemble models. Key causal drivers include general health (ATE = 0.1392) and cardiovascular factors, while counterintuitive findings like alcohol consumption’s negative association (ATE = -0.1875) warrant further investigation. The framework’s emphasis on causal feature selection, model transparency, and actionable explanations aligns with healthcare requirements for trustworthy AI, offering a reproducible solution for diabetes risk stratification with potential clinical applicability. All experiments are fully reproducible, with resources available at the GitHub repository.