Open-source DeepSeek-R1 Outperforms Proprietary Non-Reasoning Large Language Models With and Without Retrieval-Augmented Generation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective
To compare reasoning large language models (LLMs) vs. non-reasoning LLMs and open-source DeepSeek models vs. proprietary LLMs in answering ophthalmology board-style questions. To quantify the impact of retrieval-augmented generation (RAG).
Design
Cross-sectional evaluation of LLM performance before and after RAG integration.
Subjects
Seven LLMs: Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4 Turbo, GPT-4o, DeepSeek-V3, OpenAI-o1, and DeepSeek-R1.
Methods
A RAG-integrated LLM workflow was developed using the American Academy of Ophthalmology’s Basic and Clinical Science Course ( Section 12: Retina and Vitreous ) as an external knowledge source. The text was embedded into a Faiss vector database for retrieval. A curated set of 250 retina-related multiple-choice questions from OphthoQuestions was used for evaluation. Each model was tested under both pre-RAG (question-only) and post-RAG (question + retrieved context) conditions across 4 independent runs on the question set. Accuracy was calculated as the proportion of correct answers. Statistical analysis included paired t-tests, two-way ANOVA, and Tukey’s HSD test.
Main Outcome Measures
Accuracy (percentage of correct answers).
Results
RAG integration significantly improved accuracy across all models (p < 0.01). Two-way ANOVA confirmed significant effects of LLM choice (p < 0.001) and RAG status (p < 0.001) on model accuracy. Accuracy ranged from 56.8% (Gemini 1.5 Pro) to 87.5% (OpenAI-o1) pre-RAG, and improved post-RAG to 76.3% and 89.8%, respectively. Reasoning models (OpenAI-o1, DeepSeek-R1) significantly outperformed non-reasoning models. Open-source models achieved near parity with proprietary counterparts: DeepSeek-V3 with RAG (80.7%) performed comparably with GPT-4o with RAG (80.9%). DeepSeek-R1 with RAG slightly underperformed compared to OpenAI-o1 with RAG (86.0% vs 89.8%), but otherwise outperformed all other evaluated models (p < 0.001).
Conclusion
Our findings demonstrate that reasoning models significantly outperformed non-reasoning models, and RAG significantly enhanced accuracy across all models. Open-source models, trained at significantly lower cost, achieved near parity with proprietary systems. The performance of DeepSeek-V3 and DeepSeek-R1 highlighted the viability of cost-efficient, customizable, locally deployable LLMs for clinical applications. Future research should explore model fine-tuning, prompt engineering, and alternative retrieval methods to further improve LLM accuracy and reliability in medicine.