Scalable screening for emergency department missed opportunities for diagnosis using sequential eTriggers and large language models

Clifford Marks
Sean Gibney
Bryan Stenson
Deesha Sarma
Cynthia Gaudet
Haadi Mombini
Thomas Buckley
Laura Burke
Nathan I. Shapiro
Jonathan L. Burstein
Shamai A Grossman
Anika Parab
Alexander T. Janke
Arjun Manrai
Richard Andrew Taylor
Carlo L. Rosen
Adam Rodman
Adrian D. Haimovich

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Importance

Missed opportunities for diagnosis (MODs), sometimes termed diagnostic errors, are a major cause of patient morbidity and mortality in the emergency department (ED). EDs have employed eTriggers, rule-based case collections likely to have a higher than average error rate (e.g. 72 hour returns with admission), but their utility is limited by low error yields. Large language models (LLMs) offer new opportunities to identify MODs and contribute to both individual- and systems-level quality improvement.

Objective

To determine whether sequential screening of ED cases with eTriggers and an LLM can more efficiently identify MODs compared to eTriggers alone.

Design

Retrospective observational cohort study of ED encounters collected between March 2015 and June 2025.

Setting

10 EDs (2 academic, 8 community) in a single US health system.

Participants

Emergency physicians reviewed and adjudicated random samples of cases identified by 3 previously validated eTriggers (72-hour return with admission, 10-day return with ICU admission, and floor-to-ICU escalation within 24 hours) using the SaferDX instrument. An ED physician also evaluated a novel hybrid eTrigger combining an LLM adjudicator with a rules engine for 9-day return admissions with emergency care– sensitive conditions (ECSCs).

Exposures

LLM MOD adjudication of ED cases with Claude Sonnet 4 using an iteratively-developed, standardized prompt incorporating the SaferDx instrument.

Main Outcome(s) and Measure(s)

Positive predictive value (PPV), sensitivity, specificity, negative predictive value (NPV), and number needed to screen (NNS) for MODs. Reviewer time to adjudicate cases and quality improvement stakeholder assessments of LLM case summaries were also measured.

Results

Of the 357 encounters (mean [SD] age, 65.2 [17.8] years; 47.1% female) reviewed, adjudicated MOD PPV ranged from 11.0% to 18.6% across traditional eTriggers. For 72-hour return admissions, the LLM achieved sensitivity 85.7% (95% CI, 65.4%-95.0%), specificity 56.8% (95% CI, 49.3%-64.0%), PPV 19.8%, and NPV 97.0%. For 10-day ICU returns, sensitivity was 100% (95% CI, 56.6%-100%), specificity 43.5% (95% CI, 25.6%-63.2%), PPV 27.8%, and NPV 100%. For floor-to-ICU escalations, sensitivity was 55.6% (95% CI, 33.7%-75.4%), specificity 64.6% (95% CI, 53.6%-74.2%), PPV 26.3%, and NPV 86.4%. The hybrid ECSC eTrigger identified 110 MODs (53.1% of 207 encounters), with blinded review of a stratified sample estimating PPV 45% and NPV 100%. Expert reviewers required a median of 5 minutes per case; restricting review to LLM-positive charts reduced review time by up to 50% without missed errors for these triggers. In stakeholder review, LLM-generated case summaries were rated highly actionable for individual clinician feedback (mean, 4.1 of 5) but less so for systems-level interventions (mean, 1.4 of 5).

Conclusions and Relevance

In this multisite retrospective study, LLMs demonstrated high NPVs across multiple eTrigger criteria. Sequential use of LLM and human review improved efficiency and detection compared with traditional eTriggers, and narrative case summaries offered a novel method to identify opportunities for clinician-level feedback. These findings suggest that LLM-based approaches may provide scalable diagnostic quality oversight in the ED.

Key Points

Question

Can sequential screening with eTriggers and a large-language-model (LLM) identify missed opportunities for diagnosis (MODs) in the emergency department, improving screening efficiency versus traditional eTriggers?

Findings

In a multicenter retrospective cohort (10 EDs; 317 reviewed encounters), LLM adjudication showed high sensitivity and NPV across three established eTriggers (e.g., 72-hour returns: sensitivity 85.7%, NPV 97.0; 10-day ICU returns: sensitivity 100%, NPV 100%). A sequential approach was validated on a novel eTrigger for 9-day returns for select emergency care sensitive conditions, achieving PPV 45% and NPV 100% in 40 blinded samples.

Meaning

LLM-augmented eTrigger screening offers scalable, efficient MOD detection to support diagnostic quality oversight in EDs.

Version published to 10.1101/2025.10.06.25337201 on medRxiv
Oct 7, 2025

The ACT and the ADJUST framework: Layering structured detection and clinical judgment in emergency department triage

This article has 8 authors:
1. Gustav Siqueland
2. Jostein Singstad
3. Ingvild Billehaug Norum Viken
4. Eilin Solberg
5. Siren Jess
6. Vetle Ellingsen Hauge
7. Therese Hamre Leet
8. Vidar Ruddox
This article has no evaluationsLatest version Nov 21, 2025
Development and validation of an algorithm to identify severe sepsis onset from electronic medical records

This article has 5 authors:
1. Ramin Homayouni
2. Shane Morrell
3. Joel Karsten
4. Riya Chahbra
5. Paul D. Bozyk
This article has no evaluationsLatest version Oct 7, 2025
Key Predictive Features in the Emergency Department for Healthcare-Associated Infections

This article has 6 authors:
1. Andrea Fabbri
2. Ayca Begum Tascioglu
3. Flavio Bertini
4. Barbara Benazzi
5. Roberto Martello
6. Danilo Montesi
This article has no evaluationsLatest version Oct 19, 2025