Large Language Model Recommendations for Empiric Antibiotics Versus Clinician Prescribing: A Non-Interventional Paired Retrospective Antimicrobial Stewardship Analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background/Objectives: Antimicrobial resistance (AMR) remains a major global health threat, strengthening the case for antimicrobial stewardship that limits unnecessary broad-spectrum empiric therapy while preserving timely coverage in severe infection. Large language models (LLMs) are being explored for decision support, but require rigorous offline evaluation before any clinical implementation. Methods: Single-center retrospective paired evaluation at Clinical Emergency Hospital of Bucharest (Internal Medicine, 2020–2024). The unit of analysis was the admission (N = 493), with paired 24 h empiric regimens (clinician-prescribed vs post hoc LLM-recommended via OpenAI API; not visible to clinicians; no influence on care). Local laboratory-derived epidemiology was precomputed from microbiology exports and provided as structured prompt context to approximate information parity with clinicians’ implicit local ecology knowledge. Primary (prespecified) endpoint: any contextual guardrail violation (unjustified carbapenem/antipseudomonal/anti-MRSA under prespecified structured severity/MDR-risk rules), exact McNemar. Key secondary (prespecified): Δ contextual guardrail penalty (LLM − Clin), sign test and Wilcoxon signed-rank (ties reported). Ethics committee approval was obtained. Results: Guardrail violations occurred in 17.0% of clinician regimens vs 4.9% of LLM regimens (paired RD −12.2%; matched OR 0.216, 95% CI 0.127–0.367; McNemar exact p = 1.60 × 10⁻¹⁰). Δ penalty had median 0 with 398/493 ties; among non-ties, improvements (Δ < 0) exceeded adverse shifts (79 vs 16; sign-test p = 3.47 × 10⁻¹¹). Conclusions: In this offline, non-interventional paired evaluation, LLM regimens were associated with fewer prespecified contextual guardrail violations compared to clinician empiric regimens under a rule-based stewardship benchmarking framework. These endpoints strictly quantify concordance with stewardship constraints rather than patient outcomes, necessitating cautious interpretation of secondary and subset analyses. Ultimately, reproducible guardrail-based benchmarking may support subsequent prospective, safety-governed evaluations.

Article activity feed