Physician- versus Large Language Model-Generated Clinical Summaries in the Emergency Department

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

As part of routine practice and documentation, emergency department (ED) clinicians routinely construct “one-liner” summaries—brief, information-rich statements distilling a patient’s history and presentation to support rapid decision-making. Producing these summaries is cognitively demanding and contributes to documentation burden. Large language models (LLMs) may assist by synthesizing longitudinal electronic health record (EHR) data.

Methods

We conducted a blinded, within-subject study of 99 ED encounters from March 2022–March 2024 at the University of California, San Francisco. We used an LLM to generate one-liner summaries using a k-nearest-neighbor few-shot prompting approach and clinical notes spanning multiple prior encounters. Twenty-one emergency physicians evaluated paired LLM- and physician-authored summaries in randomized order, rating accuracy, completeness, and clinical utility on 5-point Likert scales and indicating their overall preference with optional free-text explanation. Ratings were analyzed using linear mixed-effects models with summary type as a fixed effect and reviewer as a random intercept. Secondary analyses examined the LLM’s note-selection behavior and how inclusion of specific note types affected summary quality. We used rapid content analysis to review free-text explanations, identifying recurrent themes among reviewer preferences.

Results

Across all dimensions, LLM-generated summaries received higher ratings than physician-authored summaries. Mean (SE) estimated marginal means for accuracy were 4.18 (0.09) vs 3.40 (0.11) (β = 0.78; 95% CI 0.50–1.07; p < .001), for completeness 3.69 (0.10) vs 3.25 (0.12) (β = 0.44; 95% CI 0.14–0.74; p = .005), and for clinical utility 3.88 (0.10) vs 3.21 (0.12) (β = 0.67; 95% CI 0.35–0.99; p < .001). LLM-generated summaries were preferred in 50.5% of encounters, physician summaries in 38.4%, and 11.1% were rated equivalent. Qualitative analysis indicated that LLM summaries were often more inclusive and neutrally phrased, whereas physician summaries exhibited greater contextual nuance but occasionally omitted key details.

Conclusions

In this blinded evaluation of ED encounters, LLM-generated one-liner summaries outperformed physician-authored summaries on accuracy, completeness, and clinical utility. Patterns in note utilization suggest that models selectively integrate high-yield clinical sources, which may have important implications for the cost and efficiency in future healthcare deployment. These findings represent an important first step toward leveraging LLMs to aid rapid synthesis of complex EHR data in high-stakes environments.

Article activity feed