Open-source DeepSeek-R1 Outperforms Proprietary Non-Reasoning Large Language Models With and Without Retrieval-Augmented Generation

Scott Song
Kenneth C. Peng
Elizabeth T. Wang
T.Y. Alvin Liu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

To compare reasoning large language models (LLMs) vs. non-reasoning LLMs and open-source DeepSeek models vs. proprietary LLMs in answering ophthalmology board-style questions. To quantify the impact of retrieval-augmented generation (RAG).

Design

Cross-sectional evaluation of LLM performance before and after RAG integration.

Subjects

Seven LLMs: Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4 Turbo, GPT-4o, DeepSeek-V3, OpenAI-o1, and DeepSeek-R1.

Methods

A RAG-integrated LLM workflow was developed using the American Academy of Ophthalmology’s Basic and Clinical Science Course ( Section 12: Retina and Vitreous ) as an external knowledge source. The text was embedded into a Faiss vector database for retrieval. A curated set of 250 retina-related multiple-choice questions from OphthoQuestions was used for evaluation. Each model was tested under both pre-RAG (question-only) and post-RAG (question + retrieved context) conditions across 4 independent runs on the question set. Accuracy was calculated as the proportion of correct answers. Statistical analysis included paired t-tests, two-way ANOVA, and Tukey’s HSD test.

Main Outcome Measures

Accuracy (percentage of correct answers).

Results

RAG integration significantly improved accuracy across all models (p < 0.01). Two-way ANOVA confirmed significant effects of LLM choice (p < 0.001) and RAG status (p < 0.001) on model accuracy. Accuracy ranged from 56.8% (Gemini 1.5 Pro) to 87.5% (OpenAI-o1) pre-RAG, and improved post-RAG to 76.3% and 89.8%, respectively. Reasoning models (OpenAI-o1, DeepSeek-R1) significantly outperformed non-reasoning models. Open-source models achieved near parity with proprietary counterparts: DeepSeek-V3 with RAG (80.7%) performed comparably with GPT-4o with RAG (80.9%). DeepSeek-R1 with RAG slightly underperformed compared to OpenAI-o1 with RAG (86.0% vs 89.8%), but otherwise outperformed all other evaluated models (p < 0.001).

Conclusion

Our findings demonstrate that reasoning models significantly outperformed non-reasoning models, and RAG significantly enhanced accuracy across all models. Open-source models, trained at significantly lower cost, achieved near parity with proprietary systems. The performance of DeepSeek-V3 and DeepSeek-R1 highlighted the viability of cost-efficient, customizable, locally deployable LLMs for clinical applications. Future research should explore model fine-tuning, prompt engineering, and alternative retrieval methods to further improve LLM accuracy and reliability in medicine.

Version published to 10.1101/2025.09.12.25334809 on medRxiv
Sep 14, 2025

Open-Source vs. Commercial Coding Assistants: A 2025 Comparison of DeepSeek R1, Qwen 2.5 and Claude 3.7

This article has no evaluationsLatest version Aug 28, 2025
Evaluation of DeepSeek-R1 and ChatGPT on the Chinese National Medical Licensing Examination: A Multi-Year Comparative Study

This article has 7 authors:
1. Xinran WANG
2. Ziwen LONG
3. Boran ZHU
4. Yan CAO
5. Hanfei TANG
6. Ke HE
7. Shu ZHANG
This article has no evaluationsLatest version Sep 9, 2025
Large Language Models (LLMs) for Evidence Synthesis: An Exploratory Evaluation and A New Approach for Automated Data Extraction

This article has 10 authors:
1. Yuchen Zhang
2. Nanyu Luo
3. Hajung Kim
4. Linxin Li
5. Linfeng Gao
6. Jiayi Han
7. Shiting Chen
8. Xiaoya Zhang
9. Jinbo He
10. Feng Ji
This article has no evaluationsLatest version Oct 16, 2025

Discuss this preprint

Listed in

Abstract

Objective

Design

Subjects

Methods

Main Outcome Measures

Results

Conclusion

Article activity feed

Related articles

Open-Source vs. Commercial Coding Assistants: A 2025 Comparison of DeepSeek R1, Qwen 2.5 and Claude 3.7

Evaluation of DeepSeek-R1 and ChatGPT on the Chinese National Medical Licensing Examination: A Multi-Year Comparative Study

Large Language Models (LLMs) for Evidence Synthesis: An Exploratory Evaluation and A New Approach for Automated Data Extraction