Cloud vs. On-Premise Large Language Models for Urgent Patient- Portal Message Screening: A Comparative Evaluation

Valdery Moura Junior
Susanna Gallani
Lara Basovic
Majed Alomar
Jason C. You
Lipika Samal
Elyse R. Park
Louisa G. Sylvia
Gaurdia Banister
Peter Hadar
Shawn Murphy
Lidia MVR Moura

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Importance Patient portal messaging has become a core feature of outpatient care, particularly in neurology. In epilepsy care, timely triage of urgent symptoms — such as breakthrough seizures or adverse medication effects — and efficient evaluation of urgency level are critical to patient safety. However, increasing message volume and a nationwide neurologist shortage have intensified clinician burden and delayed response times. Large language models (LLMs) may offer a scalable solution. A key step to achieving this goal is to compare performance across cloud-based and locally deployable models and to estimate the impact of the differences in high-stakes clinical contexts. Objective To evaluate the urgency and message-type classification performance of six LLMs - three commercial cloud-hosted (GPT-4o, GPT-5, GPT-5 Mini) and three locally deployable open-weight models (Llama 4 Scout, GPT-OSS 20B, Gemma 3 27B) - against a reference standard in outpatient epilepsy care. Design, Setting, and Participants: Retrospective diagnostic accuracy study of 503 de-identified patient portal messages from adult outpatients at a tertiary epilepsy clinic. Five epilepsy fellowship-trained neurologists independently annotated each message using a standard operating procedure (SOP) with high inter-rater reliability (Fleiss’ κ ≥ 0.80). Analyses were stratified by three non–mutually exclusive levels of physician consensus: Unanimous (5/5), Majority (≥ 3/5), and Any MD Match. Main Outcomes and Measures: Primary outcomes included sensitivity and negative predictive value (NPV) for urgency classification under Unanimous or Majority reference strata. Secondary outcomes included specificity, positive predictive value (PPV), overall accuracy, and message-type classification accuracy. Results Under the Unanimous reference standard, five of six models achieved perfect sensitivity and NPV, indicating safe rule-out performance. Under the Majority consensus, GPT-5 achieved the highest sensitivity (0.98) and NPV (1.00), while GPT-4o and Llama 4 Scout offered balanced performance with strong specificity (0.87–0.88) and NPV (≥ 0.97). GPT-OSS 20B demonstrated high specificity (0.95) but lower sensitivity (0.57), while Gemma 3 27B provided intermediate performance and supports full on-premise deployment. GPT-5 Mini offered a cost-efficient cloud alternative with solid overall performance, though reproducibility was limited by non-configurable decoding. Conclusions and Relevance: In high-risk outpatient neurology, both cloud-hosted and locally deployed LLMs demonstrated screening-level performance comparable to epilepsy fellowship-trained neurologists. Performance trade-offs between sensitivity and specificity allow institutions to tailor model selection to operational goals - whether minimizing false negatives, reducing alert burden, or ensuring Protected Health Information (PHI) containment. These results support the safe, scalable, and privacy-preserving deployment of LLM-powered triage systems across digitally burdened clinical neurology settings.

Version published to 10.21203/rs.3.rs-7830207/v1 on Research Square
Nov 18, 2025

Protocol for Implementation of an AI-Integrated Patient Monitoring and Diagnostic Model in Smart Hospital Ecosystems: A Hybrid Type 2 Study

This article has 1 author:
1. DUROJAIYE ILESANMI
This article has no evaluationsLatest version Jan 13, 2026
A hybrid-reasoner LLM framework toward real-world clinical decision- making support in acute ischemic stroke

This article has 14 authors:
1. Bicong Yan
2. Ruipeng Zhang
3. Yanfeng Fan
4. Ying Li
5. Li Chen
6. Xinyu Song
7. Yixiao Tang
8. Yifan Tu
9. Zhongzheng Cao
10. Li Shen
11. Mengfei Wang
12. Zhuo Li
13. Yijia Xiong
14. Yue-Hua LI
This article has no evaluationsLatest version Dec 11, 2025
When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models

This article has 1 author:
1. Binesh Sadanandan
This article has no evaluationsLatest version Feb 3, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Protocol for Implementation of an AI-Integrated Patient Monitoring and Diagnostic Model in Smart Hospital Ecosystems: A Hybrid Type 2 Study

A hybrid-reasoner LLM framework toward real-world clinical decision- making support in acute ischemic stroke

When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models