DeepSeek Outperforms GPT-4o in Multispecialty Ophthalmic Diagnosis: A Blinded Expert Evaluation of 33 Complex Cases
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Purpose: To compare the diagnostic and treatment performance of DeepSeek (DS) and GPT-4o large language models (LLMs) in ophthalmology using standardized residency examination cases. Design: Cross-sectional comparative study. Participants: Thirty-three representative cases drawn from the Chinese Ophthalmology Residency Examination Database, covering 8 subspecialties. Methods: Each case was processed by DS and GPT-4o with identical prompts to act as senior ophthalmologists.Three independent ophthalmologists conducted double-blind evaluations of each model’s outputs. Accuracy was scored on a 10-point Likert scale and completeness on a 6-point Likert scale for diagnosis, differential diagnosis, and treatment. Mean scores were compared using paired statistical tests and two-way ANOVA. Main Outcome Measures: Accuracy and completeness scores across diagnostic, differential diagnostic, and treatment tasks. Results: Across all cases, DS achieved significantly higher accuracy for diagnosis (8.04 vs 6.46, P < 0.0001), differential diagnosis (7.52 vs 5.50, P < 0.0001), and treatment (7.62 vs 6.65, P = 0.002) compared with GPT-4o. Completeness scores were also superior for DS in diagnosis (4.86 vs 3.69, P < 0.0001), differential diagnosis (4.44 vs 3.24, P < 0.0001), and treatment (4.61 vs 3.90, P = 0.0001). Subspecialty analyses revealed the largest advantage for DS in retinal diseases, glaucoma, strabismus & amblyopia, and optic nerve disorders. Conclusions: In standardized ophthalmology case evaluations, DS outperformed GPT-4o in both accuracy and completeness, particularly in subspecialties requiring complex reasoning. These findings support the potential role of domain-optimized LLMs as adjuncts in ophthalmic education and clinical decision support, with further research warranted in multimodal and real-world clinical settings.