ChatGPT vs DeepSeek: A Comparative Study of Diagnostic Accuracy and Clinical Reasoning in Rare and Complex Diseases

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Diagnostic errors in rare and complex diseases contribute significantly to morbidity and mortality. The ability of large language models (LLMs) to enhance diagnostic performance in such cases remains uncertain. This study compares the diagnostic accuracy, clinical reasoning quality, and inference efficiency of three ChatGPT variants (o3-mini, o3-mini-high, o1) and DeepSeek-R1 using 30 English-language case reports of rare and complex diseases from 26 specialties across 15 countries, sourced from PubMed and Web of Science Core Collection databases. Cases were selected to avoid overlap with model training data. Each case was processed once by each model, with outputs anonymized and evaluated in a double-blind manner by two board-certified physicians (each with >15 years’ clinical experience) and ChatGPT-4o. Diagnostic accuracy, the primary outcome, ranged between 30.0% and 40.0% with no significant differences observed among models (Cochran’s Q test, P = 0.16). ChatGPT-o1 achieved the highest accuracy (12/30, 40.0%; 95% CI, 24.6%+/-57.7%), followed by ChatGPT-o3-mini and o3-mini-high (each 11/30, 36.7%), and DeepSeek-R1 (9/30, 30.0% for each English and Chinese language inputs). Mean reasoning scores differed significantly (P < 0.05): ChatGPT-o1, 4.08 +/- 0.82; DeepSeek-R1 (English), 3.86 +/- 0.86; ChatGPT-o3-mini, 3.71 +/- 0.90; ChatGPT-o3-mini-high, 3.69 +/- 0.80; DeepSeek-R1 (Chinese), 3.67 +/- 0.84. Inter-evaluator agreement was high (ICC = 0.84; 95% CI, 0.80-0.88). Inference times varied significantly (P < 0.001), with ChatGPT-o3-mini being fastest (7.0 +/- 3.8 s) and DeepSeek-R1 (English) slowest (46.5 +/- 32.5 s). Advanced LLMs demonstrate potential to support diagnosis of rare and complex diseases, with transparent reasoning processes that may aid clinical decision-making and medical education. Further domain-specific refinement and prospective clinical validation are essential for safe and effective integration into clinical practice.

Highlights

  • While LLMs showed similar diagnostic accuracy (30-40%) in rare and complex diseases, ChatGPT-o1 significantly excelled in the quality of its clinical reasoning.

  • Inference speeds varied dramatically (7s-47s), highlighting a critical trade-off between model performance and real-world utility.

  • The transparent reasoning of LLMs shows clear promise as a tool to support clinical decision-making and medical education.

  • Safe clinical implementation is dependent on future domain-specific refinement and prospective validation.

Article activity feed