Ranking-optimized survival models can underperform fixed-horizon clinical prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Machine-learning survival models are increasingly proposed for intensive-care mortality prediction and are usually judged by the concordance index, a ranking metric averaged over follow-up. Yet many bedside decisions require a probability at a specific time, such as 60- or 180-day mortality. We asked whether ranking-optimized models perform competitively at fixed clinical horizons when compared with attending-physician judgment and the original 1995 SUPPORT logistic model. Reanalyzing the SUPPORT2 cohort (9,105 critically ill adults; five United States centers; 1989-1994) with a stratified 70/15/15 split, we compared a gradient-boosted survival model, the physician’s recorded prognostic estimate, and the 1995 model at 60 and 180 days, and tested several alternative learners. The survival model achieved a competitive ranking concordance (0.705) but underperformed both comparators at fixed horizons: at 60 days its area under the ROC curve was 0.750, versus 0.808 for physicians (on the matched sample) and 0.827 for the 1995 model, a gap reproduced across eight independent splits and statistically reliable after multiplicity correction. Discrimination was equitable across sex, race, and age. Post-hoc recalibration did not change discrimination, so the deficit is not miscalibration. Replacing the ranking objective with timepoint-matched binary training recovered roughly half the gap; neural networks, a deep ranking model, and two timepoint-aware discrete-time models did not close it, indicating an objective-horizon mismatch rather than limited model capacity. Leave-one-disease-out validation revealed severe generalization failure in disease groups absent from training. The physician advantage was conditional on a physician electing to give an estimate; many gave uninformative or no estimate. We recommend reporting timepoint-specific discrimination alongside the concordance index, timepoint-matched training when fixed-horizon predictions drive care, leave-one-subgroup validation, and distribution-free prediction intervals to support selective deployment.

Article activity feed