A Novel Framework for Evaluating the Clinical Reasoning Process of Large Language Models: A Comparative Study in Nephrology

Yuichiro Yano
Hiroaki Kakizaki
Hajime Nagasu
Seiji Kishi
Takeo Koshida
Yoshihito Nihei
Akira Hirano
Masaomi Nangaku
Hirotake Mori
Toshio Naito
Mizuki Ohashi
Shoichi Maruyama
Isao Matsui
Yoshitaka Isaka
Yusuke Suzuki
Naoki Kashihara

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Although interest in the application of large language models (LLMs) in medicine is growing, accuracy evaluations have largely relied on static knowledge tests. However, discussions on clinical reasoning, the process most critical to real-world practice, remain limited. In this study, we propose a novel framework to evaluate not the final diagnosis generated by AI, but the reasoning process itself.

This study proposes a novel framework that systematically evaluates the capabilities of LLMs (OpenAI GPT-o3, Gemini 2.5 Pro, DeepSeek-R1, Llxsama4-Marveric) by deconstructing the clinical reasoning process into discrete cognitive steps. We focused on nephrology cases, which often involve multiple organ systems and diverse pathologies, thus requiring a high level of reasoning. The four nephrologists independently evaluated the outputs. Our evaluation of four leading LLMs revealed that while Gemini 2.5 Pro demonstrated the best overall performance, all models exhibited common weaknesses in advanced, synthetic tasks such as “formulating differential diagnoses with rationale” and “treatment planning,” particularly in dynamically changing clinical scenarios. Furthermore, a notable finding of our research is that the highest-performing model was not the most computationally intensive, demonstrating that reasoning quality and computational efficiency are not in a simple trade-off.

In conclusion, our step-by-step evaluation method is an effective approach for identifying the specific strengths and weaknesses in an LLM’s clinical reasoning. The weaknesses identified, particularly in formulating a differential diagnosis with a clear rationale and developing comprehensive treatment plans for dynamic scenarios, should become a primary target for future model development and for the creation of support system.

Version published to 10.1101/2025.09.04.25334460 on medRxiv
Sep 7, 2025

A statistical framework for evaluating repeatability and reproducibility of large language models in diagnostic reasoning

This article has 8 authors:
1. Cathy Shyr
2. Boyu Ren
3. Chih-Yuan Hsu
4. Rory J. Tinker
5. Rizwan Hamid
6. Adam Wright
7. Bradley A. Malin
8. Hua Xu
This article has no evaluationsLatest version Aug 8, 2025
Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

This article has 5 authors:
1. Vanessa D’Amario
2. Randy Daniel
3. Dhruv Edamadaka
4. Nitya Alaparthy
5. Joshua Tarkoff
This article has no evaluationsLatest version Aug 27, 2025
CLEVER: Clinical Large Language Model Evaluationby Expert Review

This article has 4 authors:
1. Veysel Kocaman
2. Mustafa Kaya
3. Andrei Ferrer
4. David Talby
This article has no evaluationsLatest version Jul 23, 2025

Listed in

Abstract

Article activity feed

Related articles

A statistical framework for evaluating repeatability and reproducibility of large language models in diagnostic reasoning

Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

CLEVER: Clinical Large Language Model Evaluationby Expert Review