Evaluation of ChatGPT-4o’s and DeepSeek R1’s responses to urological problems: A comparative study

Hanbo Lu
Yusa Zhang
Zhan Wang
Yang Zhao
Jiang Liu
Dongxu Qiu
Yushi Zhang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Urology presents unique challenges for AI systems, requiring both extensive medical knowledge and advanced reasoning. While large language models (LLMs) like GPT-4 have shown promise in medical education and decision support, their performance in urology remains underexplored. Objective To compare the performance of two advanced large language models (LLMs), ChatGPT-4o and DeepSeek R1, in answering urology-related single-choice questions, and to evaluate their accuracy, stability, and reasoning capability across different response configurations. Methods A total of 809 single-choice questions from the Chinese National Qualification Examination for Attending Physicians in Urology were administered to ChatGPT-4o and DeepSeek R1. Each model was tested under three configurations: standard mode, advanced reasoning mode, and retrieval-augmented generation (RAG). Accuracy was calculated for each configuration, and statistical comparisons were performed using McNemar’s test with effect sizes expressed as Cohen’s h. Stability across reasoning modes was assessed by comparing performance variability. Additional analyses examined performance differences between short-answer and case-based clinical questions. Results ChatGPT-4o achieved accuracy rates of 78.12%, 73.79%, and 78.99% in standard, advanced reasoning, and RAG modes, respectively. DeepSeek R1 outperformed ChatGPT-4o across all configurations, with accuracy rates of 83.19%, 81.46%, and 84.55%, respectively. All between-model differences were statistically significant (p < 0.001), with small-to-medium effect sizes (Cohen’s h = 0.129, 0.185, and 0.144). DeepSeek R1 demonstrated substantially greater internal stability across reasoning modes, whereas ChatGPT-4o showed notable variability. In subgroup analyses, DeepSeek R1 exhibited a more pronounced advantage in complex, case-based clinical questions. Both models performed consistently across urological disease categories, and findings were limited to the Chinese-language context in which the evaluation was conducted. Conclusion DeepSeek R1 showed superior performance compared with ChatGPT-4o in both accuracy and stability when answering urology-related examination questions, particularly in complex case-based scenarios. These results suggest that optimized LLMs may serve as valuable tools in medical education and clinical decision support, especially within Chinese-language environments. Further research is needed to assess their generalizability across languages, clinical settings, and more diverse task formats.

Version published to 10.21203/rs.3.rs-9012062/v1 on Research Square
Apr 8, 2026

Performance of Vision–Language Models Compared with 252 Medical Students on Text-only and Image-based Dermatology Examinations

This article has 8 authors:
1. Ozan Erdem
2. Abdurrahim Yilmaz
3. Ahmet Sait Sahin
4. Bugra Burc Dagtas
5. Ece Gokyayla
6. Melek Aslan Kayıran
7. Vefa Aslı Erdemir
8. Mehmet Salih Gurel
This article has no evaluationsLatest version Apr 9, 2026
Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

This article has 7 authors:
1. Michela Quaranta
2. Yong Sheng Tan
3. Areti Karamanou
4. Evangelos Kalampokis
5. Nicolas M Orsi
6. Diederick DeJong
7. Alexandros Laios
This article has no evaluationsLatest version Apr 11, 2026
Concordance Between the DeepSeek-V3 Language Model and Multidisciplinary Team Recommendations in Lung Cancer: A Retrospective Study

This article has 7 authors:
1. Yihan ZHao
2. Fangqi Yuan
3. Lingli Wang
4. Meifang Wang
5. Long Zhang
6. Tao Ren
7. Hansheng Wang
This article has no evaluationsLatest version Apr 10, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Performance of Vision–Language Models Compared with 252 Medical Students on Text-only and Image-based Dermatology Examinations

Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

Concordance Between the DeepSeek-V3 Language Model and Multidisciplinary Team Recommendations in Lung Cancer: A Retrospective Study