Do Language Models Think Like Doctors?
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
While large language models (LLMs) are being increasingly deployed for clinical decision support, existing evaluation methods like medical licensing exams fail to capture critical aspects of clinical reasoning including reasoning in dynamic clinical circumstances. Script Concordance Testing (SCT), a decades-old medical assessment tool, offers a nuanced way to assess how new information influences diagnostic and therapeutic decisions under uncertainty.
Methods
We developed a comprehensive and publicly available benchmark comprising 750 SCT questions from 10 internationally diverse medical datasets—9 previously unreleased—spanning multiple specialties and institutions. Each question presents a clinical scenario then asks how new information affects the likelihood of a diagnosis or management decision, scored against expert panels (Figure 1). We evaluated four state-of-the-art LLMs against the combined responses of 1070 medical students, 193 resident physicians, and 300 attending physicians in total across all datasets.
Results
LLMs demonstrated markedly lower performance on SCTs compared to their typical achievement on medical multiple choice benchmarks. GPT-4o achieved the highest performance (63.6% ± 1.2%), significantly outperforming other models (Claude-3.5 Sonnet: 58.8% ± 1.2%, o1-preview: 58.5% ± 1.3%, Gemini-1.5-Pro: 54.4% ± 1.4%). Models matched or exceeded student performance on multiple examinations, but did not reach the level of senior residents or attending physicians (Figure 2). Surprisingly, the integrated-chain-of-thought o1-preview model underperformed compared to GPT-4o, a contrast with their relative performance on other medical benchmarks.
Conclusions
SCT represents a challenging and distinctive benchmark for evaluating LLM clinical reasoning capabilities, revealing limitations not apparent in traditional MCQ-based assessments. This work demonstrates the value of SCT in providing a more nuanced evaluation of medical AI systems and highlights specific areas where current models may fall short in clinical reasoning tasks. We are making our benchmark publicly available in a secure format to foster collaborative improvement of clinical reasoning capabilities in LLMs.
Brief Summary
While large language models excel at traditional medical knowledge tests, their performance on our new public Script Concordance Test benchmark reveals important limitations in clinical reasoning capabilities, particularly in processing new information under uncertainty.