Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis

Sebastian Sanduleanu
Koray Ersahin
Johannes Bremm
Narmin Talibova
Tim Damer
Merve Erdogan
Jonathan Kottlors
Lukas Goertz
Christiane Bruns
David Maintz
Nuran Abdullayev

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Nonsurgical treatment of uncomplicated appendicitis is a reasonable option in many cases despite the sparsity of robust, easy access, externally validated, and multimodally informed clinical decision support systems (CDSSs). Developed by OpenAI, the Generative Pre-trained Transformer 3.5 model (GPT-3) may provide enhanced decision support for surgeons in less certain appendicitis cases or those posing a higher risk for (relative) operative contra-indications. Our objective was to determine whether GPT-3.5, when provided high-throughput clinical, laboratory, and radiological text-based information, will come to clinical decisions similar to those of a machine learning model and a board-certified surgeon (reference standard) in decision-making for appendectomy versus conservative treatment. Methods: In this cohort study, we randomly collected patients presenting at the emergency department (ED) of two German hospitals (GFO, Troisdorf, and University Hospital Cologne) with right abdominal pain between October 2022 and October 2023. Statistical analysis was performed using R, version 3.6.2, on RStudio, version 2023.03.0 + 386. Overall agreement between the GPT-3.5 output and the reference standard was assessed by means of inter-observer kappa values as well as accuracy, sensitivity, specificity, and positive and negative predictive values with the “Caret” and “irr” packages. Statistical significance was defined as p < 0.05. Results: There was agreement between the surgeon’s decision and GPT-3.5 in 102 of 113 cases, and all cases where the surgeon decided upon conservative treatment were correctly classified by GPT-3.5. The estimated model training accuracy was 83.3% (95% CI: 74.0, 90.4), while the validation accuracy for the model was 87.0% (95% CI: 66.4, 97.2). This is in comparison to the GPT-3.5 accuracy of 90.3% (95% CI: 83.2, 95.0), which did not perform significantly better in comparison to the machine learning model (p = 0.21). Conclusions: This study, the first study of the “intended use” of GPT-3.5 for surgical treatment to our knowledge, comparing surgical decision-making versus an algorithm found a high degree of agreement between board-certified surgeons and GPT-3.5 for surgical decision-making in patients presenting to the emergency department with lower abdominal pain.

Version published to 10.3390/ai5040096
Oct 16, 2024
Version published to 10.20944/preprints202409.2358.v1
Sep 30, 2024

A machine learning-based predictive model for long-term complications of totally implantable venous access ports in cancer patients: a tool for risk-stratified nursing care

This article has 4 authors:
1. Mengjuan Yu
2. Zhiyong Tao
3. Weipeng Yan
4. Jing Jing
This article has no evaluationsLatest version Apr 3, 2026
Develop and validate clinical-radiomics models to predict the risk of postoperative bleeding after percutaneous nephrolithotomy for single stone

This article has 6 authors:
1. Dan Zeng
2. HongJin Shi
3. Ming Qiu
4. Haifeng Wang
5. Bing Hai
6. Jinsong Zhang
This article has no evaluationsLatest version Mar 31, 2026
Predictors of ISUP upgrading from biopsy to radical prostatectomy: development and internal validation of a preoperative model in a single-center cohort

This article has 1 author:
1. Eduardo Rodríguez Araujo
This article has no evaluationsLatest version Apr 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A machine learning-based predictive model for long-term complications of totally implantable venous access ports in cancer patients: a tool for risk-stratified nursing care

Develop and validate clinical-radiomics models to predict the risk of postoperative bleeding after percutaneous nephrolithotomy for single stone

Predictors of ISUP upgrading from biopsy to radical prostatectomy: development and internal validation of a preoperative model in a single-center cohort