Clinically guided self-refining large language model for automated code stroke activation decision support in the emergency department

Jihan Heo
Gil Joon Suh
Han-Yeong Jeong
Jinwook Choi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Code stroke activations are widely implemented to accelerate diagnostic evaluation of suspected acute stroke in the emergency department. While effective in expediting care, false positive activations disrupt workflows, consume resources, and delay evaluation for other patients. We evaluated whether adaptations of a large language model could support code stroke activation decisions using information from initial emergency department documentation and compared results with conventional machine learning models. Methods We analyzed initial ED clinical notes from Seoul National University Hospital (January 2018 to December 2022). Notes contained code-mixed (Korean and English), semi-structured and unstructured free text describing presenting illness, past medical history, and neurological deficits of patients presenting with acute neurological symptoms within 24 hours of onset. LLaMA-3.1-70B-Instruct-GPTQ-INT4 was used to translate each text to English and augment negative labels to obtain a balanced dataset. LLaMA-3.1-8B-Instruct was fine-tuned using (1) quantized low-rank adaptation and (2) a self-refinement instruction tuning strategy, applied with and without clinical rule of thumb augmentation. Performance was evaluated using AUROC, AUPRC, F1, accuracy, precision, recall, specificity, and Brier score, averaged across 25 evaluations (five-fold cross-validation with five random seeds). To enhance interpretability, we conducted both token level and section level ablation experiments to assess the contribution of different components to model performance. Results Machine learning classifiers yielded F1 scores ranging from 0.7100 (0.6992–0.7208, 95% CI) to 0.8251 (0.8172–0.8330, 95% CI). Self-refinement, instruction tuning of quantized low-rank adaptation with clinical rule of thumb augmentation yielded an F1 of 0.8880 (0.8823–0.8937, 95% CI) and recall of 0.9526 (0.9427–0.9625, 95% CI). Token and section level ablation experiments suggested that the model relied on higher-level clinical concepts rather than isolated lexical cues, consistent with physician reasoning. Conclusion Self-refinement instruction tuning combined with quantized low-rank adaptation and clinical rule of thumb guidance offers a scalable approach for automated binary classification of code stroke activation decisions using initial emergency department documentation. Although the model outperformed traditional machine learning approaches, external validation is still required prior to deployment as a bedside clinical decision support tool. Our approach may improve emergency department efficiency, and better prioritize patients requiring urgent evaluation.

Version published to 10.21203/rs.3.rs-8791664/v1 on Research Square
Feb 17, 2026

Standardizing in-hospital cause-of-death data using large language models and neural machine translation

This article has 3 authors:
1. Ji Hyun Lee
2. Borim Ryu
3. Yu Kyeong Kim
This article has no evaluationsLatest version Mar 18, 2026
Large Language Model–Assisted Radiology Reporting: A Retrospective Cohort Study Using the UTAUT Framework to Analyze Workflow Integration and Efficiency Gains

This article has 1 author:
1. Nelly Tan
This article has no evaluationsLatest version Feb 22, 2026
Diagnostic Performance of Large Language Models and Radiologists in Case-Based Radiology Questions

This article has 5 authors:
1. Raif Can Yarol
2. Ali Cantürk
3. Kenan Kadirli
4. Aslı Suner Karakulah
5. Oğuz Dicle
This article has no evaluationsLatest version Mar 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Standardizing in-hospital cause-of-death data using large language models and neural machine translation

Large Language Model–Assisted Radiology Reporting: A Retrospective Cohort Study Using the UTAUT Framework to Analyze Workflow Integration and Efficiency Gains

Diagnostic Performance of Large Language Models and Radiologists in Case-Based Radiology Questions