A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety

Joshua Proulx
Bryce Daines
Michael Barton
Molly E. Leonard
Joseph A. Garcia
Bronson Young
Quinn Snell
Timothy W. West
Sam R. Watson
Maryam AlQaseer
Mathieu Louiset
Muhammad Bilal Maqsood
Mary J. Voutt-Goos
Caryn Douma
Nishaminy Kasbekar
Jaclyn Jeffries
Wadie Abu-Rahmeh
Karen Frush
Darshan K. Grewal
Mouna Bahsoun
Michael Leonard
Allan Frankel
David C. Classen
Stanley L. Pestotnik

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

To introduce PsiBench, a clinically validated medication-safety benchmark for evaluating large language models (LLMs) against the standards used to certify hospital computerized provider order entry (CPOE) and electronic health record (EHR) systems, and a non-overlapping three-tier evaluation framework separating highest-stakes discrimination, the operational CDS regime, and category-correct alerting.

Materials and Methods

PsiBench comprises 492 medication-safety scenarios across 11 safety categories, created by clinical pharmacology experts whose work underpins an annualized testing procedure used by more than 2,000 U.S. hospitals. The three-tier framework partitions the scenarios non-overlappingly: Discrimination (98 scenarios, 50 fatal vs 48 deception, near-balanced 51%/49%); Operational (394 scenarios, 261 serious unsafe plus 133 safe including 41 Excessive Alerts reclassified as operational negatives); and Attribution (311 alert-required scenarios). We evaluated 40 frontier LLMs from 10 providers over 3 runs per scenario at temperature 0.2 (or the provider default where temperature is not configurable), yielding 59,040 evaluations conducted April 21–23, 2026.

Results

Headline binary performance on the full benchmark spans a wide range across the 40 models: F1 78.5%–92.3%, accuracy 65.4%–89.8%, sensitivity 81.4%–100.0%, specificity 6.1%–81.8%. Leading models by F1 (o4-mini 92.3%; o3 92.2%) pair high sensitivity with meaningful specificity; three models saturate sensitivity at 100% but fall below 25% specificity, indistinguishable from a naive always-alert classifier. The wide spread on a single headline metric motivates tier-specific analyses, developed in a separate clinical paper.

Discussion and Conclusion

PsiBench and the three-tier framework operationalize a rigorous evaluation rubric for LLM medication safety, grounded in two decades of national hospital audit experience. The framework generalizes to any binary medication-safety classifier (rule-based, conventional ML, or LLM-driven), supporting tier-aware model selection and post-deployment surveillance.

Version published to 10.64898/2026.06.05.26354271 on medRxiv
Jun 10, 2026

Performance evaluation and benchmarking across 16 large language models on a comprehensive real-world emergency department triage data set

This article has 10 authors:
1. Leo Benning
2. Anja Hirsch
3. Matthias Groeschel
4. Tobias Roeschl
5. Martin Spott
6. Felix Patricius Hans
7. Tim Urban
8. Hans-Joerg Busch
9. Alexander Meyer
10. Julian Madrid
This article has no evaluationsLatest version Jun 5, 2026
Audited large language model triage for systematic review screening in national clinical guideline production: validation and prospective deployment

This article has 9 authors:
1. Petter Fagerberg
2. Oscar Sallander
3. Kim Vikhe Patil
4. Charlotta Thunborg
5. Lina Lundström
6. Anders Berg
7. Anastasia Nyman
8. Natalia Borg
9. Thomas Lindén
This article has no evaluationsLatest version Jun 3, 2026
MedMisBench: Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

This article has 22 authors:
1. Hongjian Zhou
2. Xinyu Zou
3. Jinge Wu
4. Sean Wu
5. Junchi Yu
6. Bradley Max Segal
7. Tobias Erich Niebuhr
8. Sara Amro
9. Michael Petrus
10. Sheikh Momin
11. Alexandra Cardoso Pinto
12. Rachel Niesen
13. Laura Sophie Wegner
14. Dhruv Darji
15. Jung Moses Koo
16. Joshua Fieggen
17. Kapil Narain
18. Mingde Zeng
19. Lei Clifton
20. Linda Shapiro
21. Fenglin Liu
22. David A. Clifton
This article has no evaluationsLatest version May 28, 2026

Discuss this preprint

Listed in

Abstract

Objective

Materials and Methods

Results

Discussion and Conclusion

Article activity feed

Related articles

Performance evaluation and benchmarking across 16 large language models on a comprehensive real-world emergency department triage data set

Audited large language model triage for systematic review screening in national clinical guideline production: validation and prospective deployment

MedMisBench: Measuring Epistemic Resilience of LLMs Under Misleading Medical Context