Explainability in action: A metric-driven assessment of local explanations for healthcare tabular models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Explainable AI (XAI) is essential in clinical machine learning, yet quantitative evaluation of explanation quality is rarely reported in a reproducible and comparable way. We address this gap with a reproducible, metric-driven evaluation framework for comparing XAI methods on healthcare tabular data, consolidating six established, family-specific quantitative metrics (fidelity, simplicity, consistency, robustness, precision, and coverage) into explicit equations, pairing them with a pre-specified focal-model selection protocol, and releasing open-source code together with a method–metric applicability map. All quantitative results are summarized as mean ± standard deviation over test instances to capture both average behavior and instance-level variability. Global summaries (e.g., aggregated SHAP importances, EBM main-effect shapes, or TabNet aggregated importances) are reported descriptively only. Using the framework, we evaluate five widely used approaches, LIME, SHAP, Anchors, EBM, and TabNet, across four healthcare tabular datasets spanning post-hoc feature attribution (LIME, SHAP), post-hoc rule extraction (Anchors), and inherently interpretable models (EBM, TabNet). For tree ensembles, we additionally report Random Forest global importances (Gini/MDI and permutation) as descriptive cross-checks alongside EBM/SHAP/TabNet global profiles. Empirically, SHAP (TreeSHAP) attains exact score fidelity (1.0) and near-perfect decision fidelity for tree ensembles; LIME yields simpler but less robust, lower-fidelity explanations with substantially higher instance-level variability in decision fidelity; TabNet most often produces the simplest explanations across thresholds but with high variance in some datasets; EBM and TabNet offer the most robust explanations under small perturbations; and Anchors returns high-precision, human-readable rules whose coverage decreases as precision thresholds tighten. LIME and SHAP show moderate-to-high agreement on salient features, and global profiles (reported descriptively) align with known risk factors. Why this matters: the framework enables apples-to-apples comparisons, reduces confounds, and turns narrative guidance into testable, quantitative practice, helping practitioners choose XAI methods by application priority (e.g., fidelity, robustness, rule precision/coverage). Although demonstrated in healthcare, the framework generalizes to other high-stakes tabular machine learning domains.

Source code

https://github.com/matifq/XAI_Tab_Health .

Article activity feed