From Clinical Judgment to Large Language Models: Benchmarking Predictive Approaches for Unplanned Hospital Admissions

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

While machine learning (ML) models show strong performance for predicting unplanned hospital visits, their clinical utility relative to physician judgment remains unclear. Large language models (LLMs) offer a promising middle ground, potentially combining algorithmic accuracy with human-interpretable reasoning.

Objective

To directly compare the predictive performance of physicians, structured ML models, and LLMs for forecasting 30-day emergency department (ED) visits and unplanned hospital admissions under equivalent data conditions.

Methods

We selected 404 cases from structured EHR data and converted them into synthetic clinical vignettes using GPT-5. Thirty-five physicians evaluated these vignettes, while CLMBR-T (a machine learning model trained on structured EHR data) was applied to the original data. Eight LLMs evaluated the same vignettes. We compared discriminative performance (AUROC, AUPRC), calibration (Brier score, Expected Calibration Error), and confidence-performance relationships across all methods.

Results

CLMBR-T achieved the highest discriminative performance (AUROC 0.79, 95% CI: 0.75-0.83; AUPRC 0.78, 95% CI: 0.72-0.83), followed by large LLMs (DeepSeek V3, Claude 4.1 Opus, GPT-5; AUROC 0.74). Pooled physicians performed lowest (AUROC 0.65, 95% CI: 0.59-0.70; AUPRC 0.61, 95% CI: 0.54-0.68). However, LLMs showed stronger alignment with physician reasoning (correlation r=0.51-0.65) compared to CLMBR-T (r=0.37). CLMBR-T demonstrated superior confidence calibration with significant confidence-performance correlation (r=0.21, p<0.001), while physicians showed poor calibration (r=0.07, p=0.17). Individual physician performance varied widely (AUROC 0.55-0.83), with three out of 35 physicians exceeding the ML benchmark.

Conclusions

ML models trained on structured EHR data outperform both physicians and LLMs in predictive accuracy and confidence calibration, though LLMs achieved competitive zero-shot performance and better approximated human clinical reasoning. These findings suggest hybrid approaches combining high-performance ML screening with interpretable LLM explanations may optimize both accuracy and clinical adoption. The substantial variability in physician performance highlights limitations of benchmarking against “average” clinical judgment.

Article activity feed