Diagnostic Accuracy of Large Language Models Versus Clinicians in Severe Preeclampsia: A Cross-Sectional Study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Preeclampsia is still one of the most significant causes of maternal and perinatal morbidity and mortality. Multiorgan dysfunction on severe preeclampsia needs an early and correct assessment. The field of Medicine AI, and especially LLMs, could increase precision in diagnosis, but validating in obstetrical emergencies is scarce. Objective To establish the concordance between clinical judgment and three LLMs: ChatGPT, DeepSeek, and Gemini, in the diagnosis of severe preeclampsia using standardized clinical case data. Methods A cross-sectional analytic study was performed on 133 de-identified suspected cases of severe preeclampsia. Each case was individually diagnosed by clinicians and later by the three LLMs. Level of agreement was estimated using Cohen’s Kappa and diagnostic disagreement was evaluated with McNemar’s Test. Results ChatGPT had clinician agreement of moderate strength (Kappa = 0.593, p < 0.001), indicating statistically significant diagnostic concordance. DeepSeek showed very low agreement (Kappa = 0.178, p = 0.037), while Gemini demonstrated a negative correlation (Kappa = − 0.240, p = 0.006), suggesting systematic disagreement. According to McNemar’s test, there were no statistically significant differences in diagnosis between clinicians and any of the LLMs (ChatGPT p = 0.122; DeepSeek p = 0.105; Gemini p = 0.824), indicating similarity in overall diagnostic proportions despite variance in diagnostic accuracy. ChatGPT also had the highest sensitivity (76.6%) and specificity (83.9%). Conclusion Of all the evaluated LLMs, only ChatGPT consistently diagnosed severe preeclampsia in alignment with clinical evaluations, thereby validating its prospective functionality as a clinical decision support system. DeepSeek and Gemini’s lack of concordant diagnosis demonstrated the obstetric requirements for enhanced algorithm refinement and validation.