Performance of Chatgpt in Simulated Anesthesia Scenarios: A Prospective Comparison with Expert Clinicians

Agah Abdullah Kahramanlar
Ramazan Ince
Habip Burak Ozgodek

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: This study aimed to evaluate the diagnostic accuracy and clinical validity of ChatGPT’s responses in standardized anesthesia-related scenarios by directly comparing them with expert anesthesiologists' assessments. Methods: A prospective comparative study was conducted using sixteen hypothetical clinical scenarios reflecting common and critical perioperative conditions (e.g., anaphylaxis, malignant hyperthermia, pulmonary embolism). Two anesthesiologists independently evaluated the scenarios, and their responses were compared with those generated by ChatGPT (OpenAI, San Francisco, USA). A structured framework assessed diagnosis accuracy, treatment appropriateness, and compliance with international guidelines. Ratings were assigned using a 4-point Likert scale. Inter-rater agreement was analyzed using Cohen’s kappa and weighted kappa statistics. Descriptive statistics were used for categorical variables, and a p-value < 0.05 was considered statistically significant. Results: ChatGPT correctly identified the diagnosis in 88% (14/16) of scenarios, recognized treatment necessity in 93% (15/16), and recommended the correct first-line treatment in 81% (13/16), yielding an overall concordance of 87%. Inter-rater reliability between the two experts was almost perfect (κ = 0.82). Substantial agreement was observed between ChatGPT and Expert 1 (κ = 0.74) and Expert 2 (κ = 0.71). ChatGPT performed best in life-threatening emergencies but showed limitations in therapeutic sequencing and drug dosage specification. Conclusions: ChatGPT demonstrated substantial agreement with expert anesthesiologists in high-stakes scenarios, suggesting potential as an adjunctive tool for education and simulation. However, its current limitations in therapeutic nuance and prioritization indicate that it should not be used as an independent clinical decision-making resource in anesthesia practice.

Version published to 10.21203/rs.3.rs-8384638/v1 on Research Square
Mar 20, 2026

Positive effects of early clinical exposure on medical students: a comparative study

This article has 6 authors:
1. Sihan Yang
2. Yongyou Ye
3. Zijun Zhang
4. Zhiwei Wang
5. Shuting Xian
6. Zhendong Jiang
This article has no evaluationsLatest version Apr 8, 2026
Diagnostic Performance and Cost-Efficiency of Large Language Models in Secondary Hypertension: A Blinded Comparative Study

This article has 4 authors:
1. Asena Gökçay Canpolat
2. Özge Baş Aksu
3. Rıfat Emral
4. Uğur Canpolat
This article has no evaluationsLatest version Mar 18, 2026
Diagnostic Accuracy of Large Language Models Versus Clinicians in Severe Preeclampsia: A Cross-Sectional Study

This article has 3 authors:
1. Diah Putri
2. Ferry Achmad Firdaus
3. Akhmad Yogi Pramatirta¹
This article has no evaluationsLatest version Apr 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Positive effects of early clinical exposure on medical students: a comparative study

Diagnostic Performance and Cost-Efficiency of Large Language Models in Secondary Hypertension: A Blinded Comparative Study

Diagnostic Accuracy of Large Language Models Versus Clinicians in Severe Preeclampsia: A Cross-Sectional Study