Evaluating Accuracy and Reasoning Capabilities of Large Language Models for Acute Ischemic Stroke Management
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Acute ischemic stroke (AIS) management has evolved substantially over the past two decades, with mechanical thrombectomy adding complexity that requires specialized centers. As many patients initially present to primary care facilities, rapihtmld and accurate triage is critical. Large language models (LLMs) may help bridge expertise gaps, especially where stroke specialists are not immediately available. This study evaluates the diagnostic accuracy and reasoning quality of four LLMs in determining eligibility for intravenous thrombolysis (IVT) and mechanical thrombectomy (MT), compared with experienced clinicians and real-world treatment decisions.
Methods
We retrospectively collected 80 acute ischemic stroke cases from two stroke centers. Cases were presented to LLMs as well to clinicians as clinical vignettes containing demographic, clinical, and imaging data. Four LLMs (DeepSeek R1, OpenAI o3 mini, Gemini 2.0, LLaMA 3.3) and six stroke experts (two neurologists, four neuroradiologists) independently reviewed the cases and recommended one or more treatment strategies including IVT and MT. The ground truth was defined as the institutional treatment decision. Accuracy for MT and IVT recommendations was calculated for both LLMs and clinicians. Additionally, a qualitative error analysis evaluated the reasoning ability of LLMs.
Results
Open-source reasoning model DeepSeek R1 outperformed all other LLMs and clinicians for MT (87% accuracy) and achieved 78% accuracy for IVT. Across models, accuracy was generally higher for MT than for IVT. Neurologists reached 81% (MT) and 80% (IVT), while neuroradiologists achieved 84% (MT) and 76% (IVT). Reasoning analysis for MT recommendations showed that most errors were clinically reasonable but differed from real-world decision, whereas IVT errors were primarily due to guideline non-adherence.
Conclusions
LLMs can match or even exceed expert clinician performance in MT and IVT eligibility decisions, while providing transparent reasoning. These findings support prospective evaluation of LLM-based decision support in acute stroke care, especially in settings without immediate specialist expertise.