Performance of Next-Generation AI Chatbots in Gynecological Knowledge Assessment: A Comparative Pilot Study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Purpose As artificial intelligence (AI) models evolve into their next generations, their application in specialized medical fields requires rigorous validation. While large language models (LLMs) have shown promise in general medicine, their reliability in complex gynecological clinical reasoning remains under-explored. This pilot study aimed to comparatively assess the knowledge retention, safety, and reasoning limitations of advanced AI chatbots in gynecology using a constrained zero-shot multiple-choice question (MCQ) format. Methods A total of 70 text-based MCQs covering seven core gynecological modules were adapted from USMLE Step 2 CK standards. The questions were administered to four advanced AI models: ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus. To simulate a rapid-retrieval clinical scenario, models were tested under "zero-shot" conditions with a constrained prompt prohibiting reasoning steps. We performed both quantitative statistical analysis (Kruskal–Wallis, Cochran’s Q) and qualitative error analysis to identify specific failure modes. Results Contrary to expectations for advanced models, overall accuracy was unsatisfactory: Gemini-3 (32.86%), DeepSeek-V3.2 (30.00%), ChatGPT-5 (25.71%), and Claude-4.5-Opus (21.43%). Significant performance disparities were observed across modules. Notably, ChatGPT-5 scored 0.00% in Infertility , while DeepSeek-V3.2 reached 70.00% in Common Benign Conditions . Qualitative analysis revealed three critical failure patterns: (1) Semantic Association Bias (confusing high-probability diseases with symptom-specific diagnoses), (2) Spatial Anatomy Confusion, and (3) Genetic Logic Reversal. No significant correlation was found between item difficulty and accuracy (p > 0.05). Conclusion Under constrained non-reasoning prompts, even next-generation AI chatbots demonstrate unsatisfactory performance in gynecology. The qualitative analysis suggests that models often rely on probabilistic keyword matching rather than physiological simulation, leading to dangerous clinical errors (e.g., misdiagnosing adrenal enzymes). While potential exists, current reliability is insufficient for unsupervised use in gynecological education. These findings highlight the critical need for "Chain-of-Thought" prompting and human expert oversight.