Cloud vs. On-Premise Large Language Models for Urgent Patient- Portal Message Screening: A Comparative Evaluation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Importance Patient portal messaging has become a core feature of outpatient care, particularly in neurology. In epilepsy care, timely triage of urgent symptoms — such as breakthrough seizures or adverse medication effects — and efficient evaluation of urgency level are critical to patient safety. However, increasing message volume and a nationwide neurologist shortage have intensified clinician burden and delayed response times. Large language models (LLMs) may offer a scalable solution. A key step to achieving this goal is to compare performance across cloud-based and locally deployable models and to estimate the impact of the differences in high-stakes clinical contexts. Objective To evaluate the urgency and message-type classification performance of six LLMs - three commercial cloud-hosted (GPT-4o, GPT-5, GPT-5 Mini) and three locally deployable open-weight models (Llama 4 Scout, GPT-OSS 20B, Gemma 3 27B) - against a reference standard in outpatient epilepsy care. Design, Setting, and Participants: Retrospective diagnostic accuracy study of 503 de-identified patient portal messages from adult outpatients at a tertiary epilepsy clinic. Five epilepsy fellowship-trained neurologists independently annotated each message using a standard operating procedure (SOP) with high inter-rater reliability (Fleiss’ κ ≥ 0.80). Analyses were stratified by three non–mutually exclusive levels of physician consensus: Unanimous (5/5), Majority (≥ 3/5), and Any MD Match. Main Outcomes and Measures: Primary outcomes included sensitivity and negative predictive value (NPV) for urgency classification under Unanimous or Majority reference strata. Secondary outcomes included specificity, positive predictive value (PPV), overall accuracy, and message-type classification accuracy. Results Under the Unanimous reference standard, five of six models achieved perfect sensitivity and NPV, indicating safe rule-out performance. Under the Majority consensus, GPT-5 achieved the highest sensitivity (0.98) and NPV (1.00), while GPT-4o and Llama 4 Scout offered balanced performance with strong specificity (0.87–0.88) and NPV (≥ 0.97). GPT-OSS 20B demonstrated high specificity (0.95) but lower sensitivity (0.57), while Gemma 3 27B provided intermediate performance and supports full on-premise deployment. GPT-5 Mini offered a cost-efficient cloud alternative with solid overall performance, though reproducibility was limited by non-configurable decoding. Conclusions and Relevance: In high-risk outpatient neurology, both cloud-hosted and locally deployed LLMs demonstrated screening-level performance comparable to epilepsy fellowship-trained neurologists. Performance trade-offs between sensitivity and specificity allow institutions to tailor model selection to operational goals - whether minimizing false negatives, reducing alert burden, or ensuring Protected Health Information (PHI) containment. These results support the safe, scalable, and privacy-preserving deployment of LLM-powered triage systems across digitally burdened clinical neurology settings.