Evaluating a Large Reasoning Model’s Performance on Open-Ended Medical Scenarios
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) have emerged as a dominant form of generative artificial intelligence (GenAI) in multiple domains. In early 2025, DeepSeek R1 was released, which is a new large reasoning model (LRM) that includes CoT (CoT) reasoning, Mixture of Experts (MoE), and reinforcement learning. As these technologies continue to improve, evaluating the accuracy and reliability of LLMs and LRMs in medicine remains a crucial challenge. This paper reports on a follow-up study using DeepSeek R1 to evaluate medical scenarios contained in the MMLU-Pro benchmark, an enhanced benchmark designed to evaluate language understanding models across broader and more challenging tasks. In the previously reported study, the accuracy rate was 96% when multiple-choice MMLU-Pro answers were provided. In the current study, we evaluated DeepSeek R1 on 162 medical scenarios, but without multiple-choice answers provided. The overall accuracy was 92%. This approach mirrors a more realistic clinical scenario where the clinician must decide on the most likely diagnosis and differential diagnoses without any clues. Further research is necessary to determine how to deploy LRMs in clinical medicine, given their high accuracy rate, both with and without answers provided.