TOWARDS AN AI-DRIVEN REGISTRY FOR POSTOPERATIVE COMPLICA-TIONS: A PROOF-OF-CONCEPT STUDY EVALUATING THE OPPORTUNITIES AND CHALLENGES OF AI-MODELS
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Continuous quality improvement is essential in surgery, with clinical registries and quality im-provement programs (QIPs) playing a key role. Postoperative complications (PCs) require sub-stantial resources to manage, yet traditional QIPs are expensive and often lays a significant labor burden on clinicians in data collection. Artificial intelligence (AI), particularly natural language processing (NLP), offers a potential solution by automating and streamlining these processes, but models can be optimized for optimal sensitivity or positive predictive value. This study aimed to develop a mock-up automated registry for PCs using NLP algorithms and evaluate the effects of optimization strategies for surgical quality control. We hypothesized using NLP to obtain longitudinal overviews of key quality metrics is feasible, but that optimization strategies impacted on the observed rate of PCs and thus how quality management and surveillance would be affected in a real-world setting. Methods We analyzed 100,505 surgical cases from 12 Danish hospitals between 2016 and 2022. Previ-ously validated NLP models were applied to detect seven types of PCs, using two different threshold settings: a set of thresholds optimized for positive predictive value (PPV or Preci-sion), referred to as F-score of 0.5, and a set of thresholds optimized for sensitivity, referred to as F-score of 2. Trends in PC rates over time were assessed, and hospital-level variations were examined using logistic regression models adjusted for age, sex, and comorbidity. Results The NLP models detected 8,512 or 15,892 PCs, depending on threshold selection, correspond-ing to total PC rates of 9.14% and 17.1%, respectively. Most PCs showed stable or increasing trends over time, regardless of threshold setting. Hospital-level analyses similarly revealed sta-ble or rising PC rates in most institutions. Regression analyses demonstrated that threshold se-lection significantly influenced findings, impacting hospital comparisons. Conclusion This study demonstrates that NLP can be used for automated PC detection in surgical quality monitoring. However, threshold selection and additional performance metrics, such as preci-sion-recall curves (PPV-Sensitivity curves), must be carefully considered to ensure reliable and meaningful results beyond traditional Receiver Operator Area Under the Curve (ROC AUC) evaluation.