Clinically guided self-refining large language model for automated code stroke activation decision support in the emergency department
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Code stroke activations are widely implemented to accelerate diagnostic evaluation of suspected acute stroke in the emergency department. While effective in expediting care, false positive activations disrupt workflows, consume resources, and delay evaluation for other patients. We evaluated whether adaptations of a large language model could support code stroke activation decisions using information from initial emergency department documentation and compared results with conventional machine learning models. Methods We analyzed initial ED clinical notes from Seoul National University Hospital (January 2018 to December 2022). Notes contained code-mixed (Korean and English), semi-structured and unstructured free text describing presenting illness, past medical history, and neurological deficits of patients presenting with acute neurological symptoms within 24 hours of onset. LLaMA-3.1-70B-Instruct-GPTQ-INT4 was used to translate each text to English and augment negative labels to obtain a balanced dataset. LLaMA-3.1-8B-Instruct was fine-tuned using (1) quantized low-rank adaptation and (2) a self-refinement instruction tuning strategy, applied with and without clinical rule of thumb augmentation. Performance was evaluated using AUROC, AUPRC, F1, accuracy, precision, recall, specificity, and Brier score, averaged across 25 evaluations (five-fold cross-validation with five random seeds). To enhance interpretability, we conducted both token level and section level ablation experiments to assess the contribution of different components to model performance. Results Machine learning classifiers yielded F1 scores ranging from 0.7100 (0.6992–0.7208, 95% CI) to 0.8251 (0.8172–0.8330, 95% CI). Self-refinement, instruction tuning of quantized low-rank adaptation with clinical rule of thumb augmentation yielded an F1 of 0.8880 (0.8823–0.8937, 95% CI) and recall of 0.9526 (0.9427–0.9625, 95% CI). Token and section level ablation experiments suggested that the model relied on higher-level clinical concepts rather than isolated lexical cues, consistent with physician reasoning. Conclusion Self-refinement instruction tuning combined with quantized low-rank adaptation and clinical rule of thumb guidance offers a scalable approach for automated binary classification of code stroke activation decisions using initial emergency department documentation. Although the model outperformed traditional machine learning approaches, external validation is still required prior to deployment as a bedside clinical decision support tool. Our approach may improve emergency department efficiency, and better prioritize patients requiring urgent evaluation.