Evaluating Large Reasoning Model Performance on Complex Medical Scenarios In The MMLU-Pro Benchmark
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) have emerged as a major force in artificial intelligence, demonstrating remarkable capabilities in natural language processing, comprehension, and text and image generation. Recent advancements have led to the development of LLMs specifically designed for medical applications, showcasing their potential to revolutionize healthcare. These models can analyze complex medical scenarios, assist in diagnoses, and provide treatment recommendations. However, evaluating the accuracy and reliability of LLMs in medicine remains a crucial challenge. The output may not be current and could suffer from inaccurate information, known as hallucinations. In early 2025, DeepSeek R1 was released, which is a large reasoning model (LRM) that includes the “chain of thought” reasoning that made it more transparent than any LLM that preceded it. This study utilized the new MMLU-Pro benchmark, which is a more complex Q&A dataset compared to the Massive Multitask Language Understanding (MMLU). DeepSeek R1 was used to analyze the Q&A dataset primarily for accuracy, but medical scenario Q&As are only one facet of a comprehensive assessment. The study found that DeepSeek R1 had an accuracy rate of 95.1% on 162 medical scenarios after reconciliation with subject matter experts on 23 questions. Our findings contribute to the growing body of knowledge on LLM applications in healthcare and provide insights into the strengths and limitations of DeepSeek R1 in this domain. DeepSeek R1 demonstrates excellent accuracy along with unique transparency. Our analysis also highlights the need for multifaceted evaluation methods that go beyond simple accuracy metrics to ensure the safe and effective deployment of LLMs in medical settings.