Poetic or Prosaic? Evaluating the Linguistic Quality of AI-Generated Draft Replies to Patient Portal Messages
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background The use of generative artificial intelligence (genAI) in healthcare is increasing, including the use of GPT-generated draft replies (GDRs) to patient messages via Epic Systems’ electronic health record (EHR). We evaluated GDR use, quality, and impact in a large academic health system. Methods Thirty primary care physicians received GDRs from September 2023 to August 2024 during a staged rollout. Messages were grouped into baseline (GDRs not shown) and intervention (GDRs used). We evaluated messages using BLEU, ROUGE, cosine similarity, BERTScore, token counts and Flesch Reading Ease. We compared baseline and intervention groups, and across prompt refinement phases (Phases 2–4 vs. Phase 1). Blinded evaluations of message quality were conducted via surveys, and BERTScores were correlated with physician evaluations on effectiveness, misunderstanding, and harm. Results Of 66,200 GDRs generated, 21,073 were presented, and 2,264 (11%) were used. Used GDRs showed alignment with final messages [(BLEU 0.49 (95% CI: 0.43–0.56), ROUGE-L 0.60 (0.54–0.66)], with high BERTScores (F1 > 0.9). Final messages were longer and more readable. Prompt refinements increased token retention. GDR usage declined over time, yet providers reported time savings and reduced cognitive load. BERTScores correlated strongly with physician feedback on effectiveness and safety in the intervention group. Conclusions GPT-generated drafts show strong semantic alignment with physician messages and may support efficient communication. However, usage trends and readability challenges underscore the need for improved prompt design and better workflow integration. Quantitative metrics like BERTScore, when paired with physician feedback, offer a scalable framework for evaluating AI-assisted messaging in healthcare.