ChatGPT vs DeepSeek: A Comparative Study of Diagnostic Accuracy and Clinical Reasoning in Rare and Complex Diseases

Jialin Liu
Weiping Cao
Bo Yuan
Wenyi Xie
Changyu Wang
Siru Liu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Diagnostic errors in rare and complex diseases contribute significantly to morbidity and mortality. The ability of large language models (LLMs) to enhance diagnostic performance in such cases remains uncertain. This study compares the diagnostic accuracy, clinical reasoning quality, and inference efficiency of three ChatGPT variants (o3-mini, o3-mini-high, o1) and DeepSeek-R1 using 30 English-language case reports of rare and complex diseases from 26 specialties across 15 countries, sourced from PubMed and Web of Science Core Collection databases. Cases were selected to avoid overlap with model training data. Each case was processed once by each model, with outputs anonymized and evaluated in a double-blind manner by two board-certified physicians (each with >15 years’ clinical experience) and ChatGPT-4o. Diagnostic accuracy, the primary outcome, ranged between 30.0% and 40.0% with no significant differences observed among models (Cochran’s Q test, P = 0.16). ChatGPT-o1 achieved the highest accuracy (12/30, 40.0%; 95% CI, 24.6%+/-57.7%), followed by ChatGPT-o3-mini and o3-mini-high (each 11/30, 36.7%), and DeepSeek-R1 (9/30, 30.0% for each English and Chinese language inputs). Mean reasoning scores differed significantly (P < 0.05): ChatGPT-o1, 4.08 +/- 0.82; DeepSeek-R1 (English), 3.86 +/- 0.86; ChatGPT-o3-mini, 3.71 +/- 0.90; ChatGPT-o3-mini-high, 3.69 +/- 0.80; DeepSeek-R1 (Chinese), 3.67 +/- 0.84. Inter-evaluator agreement was high (ICC = 0.84; 95% CI, 0.80-0.88). Inference times varied significantly (P < 0.001), with ChatGPT-o3-mini being fastest (7.0 +/- 3.8 s) and DeepSeek-R1 (English) slowest (46.5 +/- 32.5 s). Advanced LLMs demonstrate potential to support diagnosis of rare and complex diseases, with transparent reasoning processes that may aid clinical decision-making and medical education. Further domain-specific refinement and prospective clinical validation are essential for safe and effective integration into clinical practice.

Highlights

While LLMs showed similar diagnostic accuracy (30-40%) in rare and complex diseases, ChatGPT-o1 significantly excelled in the quality of its clinical reasoning.
Inference speeds varied dramatically (7s-47s), highlighting a critical trade-off between model performance and real-world utility.
The transparent reasoning of LLMs shows clear promise as a tool to support clinical decision-making and medical education.
Safe clinical implementation is dependent on future domain-specific refinement and prospective validation.

Version published to 10.1101/2025.08.28.25331796 on medRxiv
Aug 28, 2025

Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

This article has 5 authors:
1. Rutger Chris Neeleman
2. Berke Yazan
3. Emily Westerbeek
4. Wouter van Ballegooijen
5. Rens van de Schoot
This article has no evaluationsLatest version Jan 26, 2026
Performance of Next-Generation AI Chatbots in Gynecological Knowledge Assessment: A Comparative Pilot Study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus

This article has 2 authors:
1. Huan Out
2. Zhen Wang
This article has no evaluationsLatest version Dec 16, 2025
When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models

This article has 1 author:
1. Binesh Sadanandan
This article has no evaluationsLatest version Feb 3, 2026

Discuss this preprint

Listed in

Abstract

Highlights

Article activity feed

Related articles

Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

Performance of Next-Generation AI Chatbots in Gynecological Knowledge Assessment: A Comparative Pilot Study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus

When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models