Diagnostic Accuracy of Large Language Models Versus Clinicians in Severe Preeclampsia: A Cross-Sectional Study

Diah Putri
Ferry Achmad Firdaus
Akhmad Yogi Pramatirta¹

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Preeclampsia is still one of the most significant causes of maternal and perinatal morbidity and mortality. Multiorgan dysfunction on severe preeclampsia needs an early and correct assessment. The field of Medicine AI, and especially LLMs, could increase precision in diagnosis, but validating in obstetrical emergencies is scarce. Objective To establish the concordance between clinical judgment and three LLMs: ChatGPT, DeepSeek, and Gemini, in the diagnosis of severe preeclampsia using standardized clinical case data. Methods A cross-sectional analytic study was performed on 133 de-identified suspected cases of severe preeclampsia. Each case was individually diagnosed by clinicians and later by the three LLMs. Level of agreement was estimated using Cohen’s Kappa and diagnostic disagreement was evaluated with McNemar’s Test. Results ChatGPT had clinician agreement of moderate strength (Kappa = 0.593, p < 0.001), indicating statistically significant diagnostic concordance. DeepSeek showed very low agreement (Kappa = 0.178, p = 0.037), while Gemini demonstrated a negative correlation (Kappa = − 0.240, p = 0.006), suggesting systematic disagreement. According to McNemar’s test, there were no statistically significant differences in diagnosis between clinicians and any of the LLMs (ChatGPT p = 0.122; DeepSeek p = 0.105; Gemini p = 0.824), indicating similarity in overall diagnostic proportions despite variance in diagnostic accuracy. ChatGPT also had the highest sensitivity (76.6%) and specificity (83.9%). Conclusion Of all the evaluated LLMs, only ChatGPT consistently diagnosed severe preeclampsia in alignment with clinical evaluations, thereby validating its prospective functionality as a clinical decision support system. DeepSeek and Gemini’s lack of concordant diagnosis demonstrated the obstetric requirements for enhanced algorithm refinement and validation.

Version published to 10.21203/rs.3.rs-9244478/v1 on Research Square
Apr 9, 2026

Artificial Intelligence in Feto-maternal Health: A Systematic Review of Predictive Models, Validation, and Clinical Translation

This article has 6 authors:
1. Oluwafunmilola Deborah Awe
2. Ana Paula Morais e Oliveira
3. Charles M'poca Charles
4. Kelechi Elizabeth Oladimeji
5. Cristiano Torezzan
6. Rodolfo Carvalho Pacagnella
This article has no evaluationsLatest version Mar 20, 2026
Development and External Validation of a Risk Prediction Model for Hyperuricemia in Adolescents: A Multicenter Clinical Study

This article has 7 authors:
1. Zixing Wu
2. Dan Liu
3. Xun Li
4. Yuqiao Tang
5. Ruitao Min
6. Yunkai Mu
7. Guibo Feng
This article has no evaluationsLatest version Apr 6, 2026
Screening of key variables and development and validation of a prognostic model for hepatocellular carcinoma

This article has 8 authors:
1. Jiang Chen
2. Hangyu Zhi
3. Mian Guo
4. Xin Meng
5. Yibo Zhang
6. Huan Xia
7. Cong Yao
8. Kai Qu
This article has no evaluationsLatest version Mar 23, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Artificial Intelligence in Feto-maternal Health: A Systematic Review of Predictive Models, Validation, and Clinical Translation

Development and External Validation of a Risk Prediction Model for Hyperuricemia in Adolescents: A Multicenter Clinical Study

Screening of key variables and development and validation of a prognostic model for hepatocellular carcinoma