Evaluation of ChatGPT-5 Responses in Obstetric and Gynecological Emergencies: Concordance, Readability, and Clinical Reliability
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background This study aimed to evaluate the compliance with guidelines, clinical safety, and applicability of ChatGPT-5 responses in obstetric and gynecological emergency scenarios. With the increasing role of AI-powered large language models (LLMs) in healthcare, there is a need to examine their performance in obstetric emergencies systematically. Methods This study was designed as a prospective, scenario-based, double-blind study. A total of 15 obstetric and gynecologic emergency scenarios were created based on the literature and current international guidelines (ACOG, RCOG, WHO). Five standard questions were posed to ChatGPT-5 for each scenario: (1) Most likely diagnosis, (2) Investigations to confirm the diagnosis, (3) Hemodynamic stability assessment, (4) Initial treatment approach, and (5) Advanced management options. The same scenarios were independently answered by two obstetricians, an emergency medicine specialist, and an anesthesiologist, and were considered the "gold standard." Responses were scored for guideline compliance, patient safety, and critical information gaps. In addition, quality and understandability were evaluated with modified DISCERN (mDISCERN), Global Quality Score (GQS), and readability indexes [Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), Coleman–Liau Index (CLI)]. Results A total of 75 responses were reviewed. High agreement (5/5) was observed in 5 scenarios (33.3%), moderate agreement (4/5) in 7 scenarios (46.7%), and low agreement (≤ 3/5) in 3 scenarios (20.0%). High agreement was particularly evident for well-defined guideline algorithms, such as postpartum hemorrhage, eclampsia, HELLP syndrome, shoulder dystocia, and ruptured ectopic pregnancy. Deficiencies in moderate agreement scenarios included insufficient emphasis on mortality risk, omission of scoring systems, incomplete steps in sepsis management, and inadequate specification of fertility-sparing approaches. Low agreement scenarios included severe vaginal hemorrhage, acute bleeding due to malignancy, and traumatic gynecologic emergencies. The mean mDISCERN score of the responses was 4.0 ± 0.7, and the mean GQS was 4.1 ± 0.7. Readability analyses showed that responses contained a moderate amount of technical language (FRES score = 40.5 ± 2.5; FKGL score = 11.6 ± 1.2; SMOG score = 10.9 ± 0.8; and CLI score = 10.9 ± 0.8). The mean lexical density was 0.63. Conclusions ChatGPT-5 generally produced moderate to good guideline-compliant and confident responses in obstetric and gynecological emergency scenarios. However, its performance was limited in complex cases requiring a multidisciplinary approach. The findings suggest that AI-powered large language models can be a complementary tool in obstetric emergency management, but should not be used alone without expert clinician supervision. Larger, comparative, and multidisciplinary studies will provide more reliable evidence for the clinical integration of these technologies.