Ensemble Post-hoc Explainable AI in Multivariate Time Series: Identifying Medical Features Driving Disease Prediction

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Despite the growing success of deep learning (DL) in multivariate time-series classification, such as 12-lead electrocardiography (ECG), widespread integration into clinical practice has yet to be achieved. The limited transparency of DL hinders clinical adoption, where understanding model decisions is crucial for trust and compliance with regulations such as the General Data Protection Regulation (GDPR) or the EU AI Act.

To tackle this challenge, we implemented a state-of-the-art 1D-ResNet in Pytorch that was trained on the large-scale Brazilian CODE dataset to classify six different ECG abnormalities. We employed the model on the German PTB-XL dataset, and evaluated its decision-making processes using 16 post-hoc explainable AI (XAI) methods. To assess the clinical relevance of the model’s attributions, we conducted a Wilcoxon signed-rank test to identify features with significantly higher relevance for each XAI method. We used an ensemble majority vote approach to validate whether the model has learned clinically meaningful features for each abnormality. Additionally, a Mann–Whitney U test was employed to detect significant differences in relevance attributions between correctly and incorrectly classified ECGs.

Overall, the model achieved sensitivity scores above 0.9 for most abnormalities in the PTB-XL dataset. However, our XAI analysis showed that the model struggled to capture clinically relevant features for some diseases. Certain XAI methods, including DeepLift, DeepLiftShap, and Occlusion, consistently highlighted clinically meaningful features across abnormalities, while others, such as LIME, KernelShap, and LRP, failed to do so. Moreover, some XAI methods demonstrated significant differences in attributions between correctly and incorrectly classified ECGs, highlighting their potential for enhancing model robustness and interpretability.

In conclusion, our findings underscore the importance of selecting suitable XAI methods tailored to specific model architectures and data types to ensure transparency and reliability. By identifying effective XAI techniques, this study contributes to closing the gap between DL advancements and their clinical implementation, paving the way for more trustworthy AI-driven healthcare solutions.

Article activity feed