Evaluating the Predictive Accuracy of Deep Learning Algorithms for Postoperative Mortality in Cardiac Surgery: A Systematic Review and Meta-Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Risk stratification in cardiac surgery has long depended on logistic regression models built from a fixed set of preoperative variables an approach that, while extensively validated, cannot capture the complexity of real patient physiology. Deep learning (DL) offers a fundamentally different paradigm, one capable of detecting non-linear interactions across high-dimensional datasets. We conducted this systematic review and meta-analysis to quantify whether that theoretical advantage translates into measurably better prediction of postoperative mortality after cardiac SurgeryMethods: We searched PubMed/MEDLINE, Embase, and IEEE Xplore following PRISMA 2020 and Cochrane Prognosis Methods Group guidelines. Eligible studies directly compared DL architectures against established risk scores namely EuroSCORE II or STS-PROM for short-term mortality in adult cardiac surgery populations. Methodological quality was assessed with PROBAST+AI. Because raw AUC values are bounded and violate normality assumptions required for standard pooling, all estimates were logit-transformed prior to meta-analysis using a restricted maximum likelihood random-effects model.Results: Six studies met inclusion criteria, representing 250,560 patients across markedly different clinical settings. Deep learning models shows to have achieved a pooled AUC of 0.856 (95% CI: 0.774 - 0.913). This came with a caveat: between-study heterogeneity was substantial (I² = 91.3%), reflecting the diversity of architectures, cohort sizes, and institutional contexts included. Traditional risk scores yielded a pooled AUC of 0.815 (95% CI: 0.754–0.864; I² = 77.9%).Conclusion: DL models outperform conventional risk scores on discrimination. The gap, however, sits alongside serious unresolved questions heterogeneity is high, calibration data are largely absent from the primary literature, and most evidence comes from retrospective single-centre cohorts. Standardized reporting frameworks are a prerequisite, not a recommendation, before these models enter routine clinical practice.