Performance of Chatgpt in Simulated Anesthesia Scenarios: A Prospective Comparison with Expert Clinicians

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: This study aimed to evaluate the diagnostic accuracy and clinical validity of ChatGPT’s responses in standardized anesthesia-related scenarios by directly comparing them with expert anesthesiologists' assessments. Methods: A prospective comparative study was conducted using sixteen hypothetical clinical scenarios reflecting common and critical perioperative conditions (e.g., anaphylaxis, malignant hyperthermia, pulmonary embolism). Two anesthesiologists independently evaluated the scenarios, and their responses were compared with those generated by ChatGPT (OpenAI, San Francisco, USA). A structured framework assessed diagnosis accuracy, treatment appropriateness, and compliance with international guidelines. Ratings were assigned using a 4-point Likert scale. Inter-rater agreement was analyzed using Cohen’s kappa and weighted kappa statistics. Descriptive statistics were used for categorical variables, and a p-value < 0.05 was considered statistically significant. Results: ChatGPT correctly identified the diagnosis in 88% (14/16) of scenarios, recognized treatment necessity in 93% (15/16), and recommended the correct first-line treatment in 81% (13/16), yielding an overall concordance of 87%. Inter-rater reliability between the two experts was almost perfect (κ = 0.82). Substantial agreement was observed between ChatGPT and Expert 1 (κ = 0.74) and Expert 2 (κ = 0.71). ChatGPT performed best in life-threatening emergencies but showed limitations in therapeutic sequencing and drug dosage specification. Conclusions: ChatGPT demonstrated substantial agreement with expert anesthesiologists in high-stakes scenarios, suggesting potential as an adjunctive tool for education and simulation. However, its current limitations in therapeutic nuance and prioritization indicate that it should not be used as an independent clinical decision-making resource in anesthesia practice.

Article activity feed