Ranking-optimized survival models can underperform fixed-horizon clinical prediction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Machine-learning survival models are increasingly proposed for intensive-care mortality prediction and are usually judged by the concordance index, a ranking metric averaged over follow-up. Yet many bedside decisions require a probability at a specific time, such as 60- or 180-day mortality. We asked whether ranking-optimized models perform competitively at fixed clinical horizons when compared with attending-physician judgment and the original 1995 SUPPORT logistic model. Reanalyzing the SUPPORT2 cohort (9,105 critically ill adults; five United States centers; 1989-1994) with a stratified 70/15/15 split, we compared a gradient-boosted survival model, the physician’s recorded prognostic estimate, and the 1995 model at 60 and 180 days, and tested several alternative learners. The survival model achieved a competitive ranking concordance (0.705) but underperformed both comparators at fixed horizons: at 60 days its area under the ROC curve was 0.750, versus 0.808 for physicians (on the matched sample) and 0.827 for the 1995 model, a gap reproduced across eight independent splits and statistically reliable after multiplicity correction. Discrimination was equitable across sex, race, and age. Post-hoc recalibration did not change discrimination, so the deficit is not miscalibration. Replacing the ranking objective with timepoint-matched binary training recovered roughly half the gap; neural networks, a deep ranking model, and two timepoint-aware discrete-time models did not close it, indicating an objective-horizon mismatch rather than limited model capacity. Leave-one-disease-out validation revealed severe generalization failure in disease groups absent from training. The physician advantage was conditional on a physician electing to give an estimate; many gave uninformative or no estimate. We recommend reporting timepoint-specific discrimination alongside the concordance index, timepoint-matched training when fixed-horizon predictions drive care, leave-one-subgroup validation, and distribution-free prediction intervals to support selective deployment.