From Clinical Judgment to Large Language Models: Benchmarking Predictive Approaches for Unplanned Hospital Admissions

Bernardo Neves
Mário J. Silva

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

While machine learning (ML) models show strong performance for predicting unplanned hospital visits, their clinical utility relative to physician judgment remains unclear. Large language models (LLMs) offer a promising middle ground, potentially combining algorithmic accuracy with human-interpretable reasoning.

Objective

To directly compare the predictive performance of physicians, structured ML models, and LLMs for forecasting 30-day emergency department (ED) visits and unplanned hospital admissions under equivalent data conditions.

Methods

We selected 404 cases from structured EHR data and converted them into synthetic clinical vignettes using GPT-5. Thirty-five physicians evaluated these vignettes, while CLMBR-T (a machine learning model trained on structured EHR data) was applied to the original data. Eight LLMs evaluated the same vignettes. We compared discriminative performance (AUROC, AUPRC), calibration (Brier score, Expected Calibration Error), and confidence-performance relationships across all methods.

Results

CLMBR-T achieved the highest discriminative performance (AUROC 0.79, 95% CI: 0.75-0.83; AUPRC 0.78, 95% CI: 0.72-0.83), followed by large LLMs (DeepSeek V3, Claude 4.1 Opus, GPT-5; AUROC 0.74). Pooled physicians performed lowest (AUROC 0.65, 95% CI: 0.59-0.70; AUPRC 0.61, 95% CI: 0.54-0.68). However, LLMs showed stronger alignment with physician reasoning (correlation r=0.51-0.65) compared to CLMBR-T (r=0.37). CLMBR-T demonstrated superior confidence calibration with significant confidence-performance correlation (r=0.21, p<0.001), while physicians showed poor calibration (r=0.07, p=0.17). Individual physician performance varied widely (AUROC 0.55-0.83), with three out of 35 physicians exceeding the ML benchmark.

Conclusions

ML models trained on structured EHR data outperform both physicians and LLMs in predictive accuracy and confidence calibration, though LLMs achieved competitive zero-shot performance and better approximated human clinical reasoning. These findings suggest hybrid approaches combining high-performance ML screening with interpretable LLM explanations may optimize both accuracy and clinical adoption. The substantial variability in physician performance highlights limitations of benchmarking against “average” clinical judgment.

Version published to 10.1101/2025.09.09.25335411 on medRxiv
Sep 12, 2025

Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya

This article has 11 authors:
1. Paul Mwaniki
2. Wilkister Musau
3. Lynda Isaaka
4. Conrad Wanyama
5. Vaishnavi Menon
6. Alastair Denniston
7. Xiaoxuan Liu
8. Mira Emmanuel-Fabula
9. Gwydion Williams
10. Bilal A. Mateen
11. Ambrose Agweyu
This article has no evaluationsLatest version Oct 27, 2025
Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

This article has 4 authors:
1. Nazar Zaki
2. Amal Akor
3. Salahdein Aburuz
4. Sham ZainAlAbdin
This article has no evaluationsLatest version Oct 19, 2025
“Complex models, marginal benefits--a multi-centre development and validation study of early warning scores across 2·16 million patient admissions addressing intercurrent medical interventions”

This article has 13 authors:
1. Alexandros Katsiferis
2. Neil Scheidwasser
3. Tri-Long Nguyen
4. Theis Lange
5. Mark P Khurana
6. Pernille B Nielsen
7. Kasper Karmark Iversen
8. Christian S Meyhoff
9. Eske Kvanner Aasvang
10. Jesper Mølgaard
11. Adrian G Zucco
12. Tibor V Varga
13. Samir Bhatt
This article has no evaluationsLatest version Oct 14, 2025

Discuss this preprint

Listed in

Abstract

Background

Objective

Methods

Results

Conclusions

Article activity feed

Related articles

Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya

Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

“Complex models, marginal benefits--a multi-centre development and validation study of early warning scores across 2·16 million patient admissions addressing intercurrent medical interventions”