Cross-Model Variability in Large Language Model Triage Behavior for Potential Stroke Symptoms

Daniel A Dworkis
Jon Stenstrom
Ayan Sen
Richard T Lucarelli

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Stroke is a time-sensitive neurological emergency in which early EMS activation and presentation to definitive care are cornerstones of effective therapy. Large language models (LLMs) are increasingly consulted by the public for medical advice, but the veracity of the guidance provided by commercially available models responding to potential stroke symptoms is not well understood.

Methods

We performed a cross-model benchmarking study comparing the triage choices of three frontier LLMs (Claude Sonnet 4.6, GPT-4o, and Llama 3.3-70b-versatile) on first-person vignettes describing a unilateral arm symptom on waking, across 10 symptom descriptors, and two clinical phases (before and after a partially reassuring self-examination), with or without a clinical distractor (n=50 per condition).

Results

Claude sought emergency care most often, Llama least, and GPT-4o in between, diverging most sharply in the post-examination phase where Claude called 911 in 100% of runs, Llama called for non-emergency help in 100%, and GPT-4o was symptom-dependent. A distractor shifted behavior away from emergency care in almost all conditions: calling 911 fell from 37.9% to 14.6% and waiting rose from 0% to 45.9% in the post-examination vignette. Responses were also sensitive to symptom word: weak, limp, heavy, and clumsy generated higher alarm, whereas numb, tingly, odd, strange, and weird generated less urgent responses.

Conclusions

The increasing use of LLMs for medical advice has significant public health implications. Commercially available LLMs show significant model-to-model variability and framing sensitivity when confronted with potential stroke symptoms, including under-recognition of canonical CDC warning descriptors, underscoring the need for systematic benchmarking as these tools become de facto first points of contact for patients experiencing neurological emergencies.

Version published to 10.64898/2026.05.22.26353904 on medRxiv
May 25, 2026

A real-world feasibility evaluation of LLM-based clinical prediction: emergency department return visit admission across two academic medical centers

This article has 12 authors:
1. Jinsong Liu
2. Katherine Brown
3. Michelle J. Ma
4. Arindam RoyChoudhury
5. Bradley A Malin
6. Allison McCoy
7. Adam Wright
8. Jessica S. Ancker
9. Tony Rosen
10. Jin Ho Han
11. Peter A D Steel
12. Yiye Zhang
This article has no evaluationsLatest version Jun 2, 2026
Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

This article has 5 authors:
1. Jussi Leinonen
2. Juha Knuuttila
3. Siina Pamilo
4. Samu Kurki
5. Miika Koskinen
This article has no evaluationsLatest version Jul 9, 2026
Role-Prompting in Frontier Large Language Models Influences Clinical Reasoning in Complex Medical Cases

This article has 8 authors:
1. Chintan Dave
2. Adrianna Diviero
3. Tashni Dassanayake
4. Salman J. Alshahrani
5. Anas Al Mardini
6. Widad Khadir
7. Ashaki D. Patel
8. Adithya Srivastava
This article has no evaluationsLatest version Jul 1, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

A real-world feasibility evaluation of LLM-based clinical prediction: emergency department return visit admission across two academic medical centers

Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

Role-Prompting in Frontier Large Language Models Influences Clinical Reasoning in Complex Medical Cases