Evaluation of ChatGPT-4o’s and DeepSeek R1’s responses to urological problems: A comparative study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Urology presents unique challenges for AI systems, requiring both extensive medical knowledge and advanced reasoning. While large language models (LLMs) like GPT-4 have shown promise in medical education and decision support, their performance in urology remains underexplored. Objective To compare the performance of two advanced large language models (LLMs), ChatGPT-4o and DeepSeek R1, in answering urology-related single-choice questions, and to evaluate their accuracy, stability, and reasoning capability across different response configurations. Methods A total of 809 single-choice questions from the Chinese National Qualification Examination for Attending Physicians in Urology were administered to ChatGPT-4o and DeepSeek R1. Each model was tested under three configurations: standard mode, advanced reasoning mode, and retrieval-augmented generation (RAG). Accuracy was calculated for each configuration, and statistical comparisons were performed using McNemar’s test with effect sizes expressed as Cohen’s h. Stability across reasoning modes was assessed by comparing performance variability. Additional analyses examined performance differences between short-answer and case-based clinical questions. Results ChatGPT-4o achieved accuracy rates of 78.12%, 73.79%, and 78.99% in standard, advanced reasoning, and RAG modes, respectively. DeepSeek R1 outperformed ChatGPT-4o across all configurations, with accuracy rates of 83.19%, 81.46%, and 84.55%, respectively. All between-model differences were statistically significant (p < 0.001), with small-to-medium effect sizes (Cohen’s h = 0.129, 0.185, and 0.144). DeepSeek R1 demonstrated substantially greater internal stability across reasoning modes, whereas ChatGPT-4o showed notable variability. In subgroup analyses, DeepSeek R1 exhibited a more pronounced advantage in complex, case-based clinical questions. Both models performed consistently across urological disease categories, and findings were limited to the Chinese-language context in which the evaluation was conducted. Conclusion DeepSeek R1 showed superior performance compared with ChatGPT-4o in both accuracy and stability when answering urology-related examination questions, particularly in complex case-based scenarios. These results suggest that optimized LLMs may serve as valuable tools in medical education and clinical decision support, especially within Chinese-language environments. Further research is needed to assess their generalizability across languages, clinical settings, and more diverse task formats.

Article activity feed