Large Language Model Recommendations for Empiric Antibiotics Versus Clinician Prescribing: A Non-Interventional Paired Retrospective Antimicrobial Stewardship Analysis

Ninel Iacobus Antonie
Vlad Alexandru Ionescu
Gina Gheorghe
Crista-Loredana Tiuca
Camelia Cristina Diaconu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background/Objectives: Antimicrobial resistance (AMR) remains a major global health threat, strengthening the case for antimicrobial stewardship that limits unnecessary broad-spectrum empiric therapy while preserving timely coverage in severe infection. Large language models (LLMs) are being explored for decision support, but require rigorous offline evaluation before any clinical implementation. Methods: Single-center retrospective paired evaluation at Clinical Emergency Hospital of Bucharest (Internal Medicine, 2020–2024). The unit of analysis was the admission (N = 493), with paired 24 h empiric regimens (clinician-prescribed vs post hoc LLM-recommended via OpenAI API; not visible to clinicians; no influence on care). Local laboratory-derived epidemiology was precomputed from microbiology exports and provided as structured prompt context to approximate information parity with clinicians’ implicit local ecology knowledge. Primary (prespecified) endpoint: any contextual guardrail violation (unjustified carbapenem/antipseudomonal/anti-MRSA under prespecified structured severity/MDR-risk rules), exact McNemar. Key secondary (prespecified): Δ contextual guardrail penalty (LLM − Clin), sign test and Wilcoxon signed-rank (ties reported). Ethics committee approval was obtained. Results: Guardrail violations occurred in 17.0% of clinician regimens vs 4.9% of LLM regimens (paired RD −12.2%; matched OR 0.216, 95% CI 0.127–0.367; McNemar exact p = 1.60 × 10⁻¹⁰). Δ penalty had median 0 with 398/493 ties; among non-ties, improvements (Δ < 0) exceeded adverse shifts (79 vs 16; sign-test p = 3.47 × 10⁻¹¹). Conclusions: In this offline, non-interventional paired evaluation, LLM regimens were associated with fewer prespecified contextual guardrail violations compared to clinician empiric regimens under a rule-based stewardship benchmarking framework. These endpoints strictly quantify concordance with stewardship constraints rather than patient outcomes, necessitating cautious interpretation of secondary and subset analyses. Ultimately, reproducible guardrail-based benchmarking may support subsequent prospective, safety-governed evaluations.

Version published to 10.20944/preprints202603.0720.v1
Mar 10, 2026

Large Language Models in Infectious Diseases: A Systemic Review

This article has 7 authors:
1. Alon Gorenshtein
2. Eyal Klang
3. Jacob J. Smith
4. Richard Dzeng
5. Mark C. Poznansky
6. Girish N Nadkarni
7. Mahmud Omar
This article has no evaluationsLatest version Feb 18, 2026
Association Between Time-to-Antibiotics and Clinical Outcomes in Neutropenic Fever After Hematopoietic Stem Cell Transplant

This article has 5 authors:
1. Samuel L Windham
2. Christopher A Guidry
3. Anthony D Sung
4. Marin H Kollef
5. Steven Q Simpson
This article has no evaluationsLatest version Mar 8, 2026
Prescribing Patterns and Clinical Effectiveness of Ceftolozane/Tazobactam for ESBL-Producing Enterobacterales: A SPECTRA Real‑World Multi‑Country Analysis

This article has 8 authors:
1. Emre Yucel
2. Alex Soriano
3. Florian Thalhammer
4. Stefan Kluge
5. Mike Allen
6. Jessica Levy
7. Huina Yang
8. Sunny Kaul
This article has no evaluationsLatest version Feb 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large Language Models in Infectious Diseases: A Systemic Review

Association Between Time-to-Antibiotics and Clinical Outcomes in Neutropenic Fever After Hematopoietic Stem Cell Transplant

Prescribing Patterns and Clinical Effectiveness of Ceftolozane/Tazobactam for ESBL-Producing Enterobacterales: A SPECTRA Real‑World Multi‑Country Analysis