Cloud vs. On-Premise Large Language Models for Urgent Patient- Portal Message Screening: A Comparative Evaluation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Importance Patient portal messaging has become a core feature of outpatient care, particularly in neurology. In epilepsy care, timely triage of urgent symptoms — such as breakthrough seizures or adverse medication effects — and efficient evaluation of urgency level are critical to patient safety. However, increasing message volume and a nationwide neurologist shortage have intensified clinician burden and delayed response times. Large language models (LLMs) may offer a scalable solution. A key step to achieving this goal is to compare performance across cloud-based and locally deployable models and to estimate the impact of the differences in high-stakes clinical contexts. Objective To evaluate the urgency and message-type classification performance of six LLMs - three commercial cloud-hosted (GPT-4o, GPT-5, GPT-5 Mini) and three locally deployable open-weight models (Llama 4 Scout, GPT-OSS 20B, Gemma 3 27B) - against a reference standard in outpatient epilepsy care. Design, Setting, and Participants: Retrospective diagnostic accuracy study of 503 de-identified patient portal messages from adult outpatients at a tertiary epilepsy clinic. Five epilepsy fellowship-trained neurologists independently annotated each message using a standard operating procedure (SOP) with high inter-rater reliability (Fleiss’ κ ≥ 0.80). Analyses were stratified by three non–mutually exclusive levels of physician consensus: Unanimous (5/5), Majority (≥ 3/5), and Any MD Match. Main Outcomes and Measures: Primary outcomes included sensitivity and negative predictive value (NPV) for urgency classification under Unanimous or Majority reference strata. Secondary outcomes included specificity, positive predictive value (PPV), overall accuracy, and message-type classification accuracy. Results Under the Unanimous reference standard, five of six models achieved perfect sensitivity and NPV, indicating safe rule-out performance. Under the Majority consensus, GPT-5 achieved the highest sensitivity (0.98) and NPV (1.00), while GPT-4o and Llama 4 Scout offered balanced performance with strong specificity (0.87–0.88) and NPV (≥ 0.97). GPT-OSS 20B demonstrated high specificity (0.95) but lower sensitivity (0.57), while Gemma 3 27B provided intermediate performance and supports full on-premise deployment. GPT-5 Mini offered a cost-efficient cloud alternative with solid overall performance, though reproducibility was limited by non-configurable decoding. Conclusions and Relevance: In high-risk outpatient neurology, both cloud-hosted and locally deployed LLMs demonstrated screening-level performance comparable to epilepsy fellowship-trained neurologists. Performance trade-offs between sensitivity and specificity allow institutions to tailor model selection to operational goals - whether minimizing false negatives, reducing alert burden, or ensuring Protected Health Information (PHI) containment. These results support the safe, scalable, and privacy-preserving deployment of LLM-powered triage systems across digitally burdened clinical neurology settings.

Article activity feed