Ranking-optimized survival models can underperform fixed-horizon clinical prediction

Truong Quynh Hoa
Hoang Dinh Cuong
Luu Duc Trung

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Machine-learning survival models are increasingly proposed for intensive-care mortality prediction and are usually judged by the concordance index, a ranking metric averaged over follow-up. Yet many bedside decisions require a probability at a specific time, such as 60- or 180-day mortality. We asked whether ranking-optimized models perform competitively at fixed clinical horizons when compared with attending-physician judgment and the original 1995 SUPPORT logistic model. Reanalyzing the SUPPORT2 cohort (9,105 critically ill adults; five United States centers; 1989-1994) with a stratified 70/15/15 split, we compared a gradient-boosted survival model, the physician’s recorded prognostic estimate, and the 1995 model at 60 and 180 days, and tested several alternative learners. The survival model achieved a competitive ranking concordance (0.705) but underperformed both comparators at fixed horizons: at 60 days its area under the ROC curve was 0.750, versus 0.808 for physicians (on the matched sample) and 0.827 for the 1995 model, a gap reproduced across eight independent splits and statistically reliable after multiplicity correction. Discrimination was equitable across sex, race, and age. Post-hoc recalibration did not change discrimination, so the deficit is not miscalibration. Replacing the ranking objective with timepoint-matched binary training recovered roughly half the gap; neural networks, a deep ranking model, and two timepoint-aware discrete-time models did not close it, indicating an objective-horizon mismatch rather than limited model capacity. Leave-one-disease-out validation revealed severe generalization failure in disease groups absent from training. The physician advantage was conditional on a physician electing to give an estimate; many gave uninformative or no estimate. We recommend reporting timepoint-specific discrimination alongside the concordance index, timepoint-matched training when fixed-horizon predictions drive care, leave-one-subgroup validation, and distribution-free prediction intervals to support selective deployment.

Version published to 10.64898/2026.06.13.26355565 on medRxiv
Jun 16, 2026

Early-Horizon Multimodal ICU Mortality Prediction Without Retraining

This article has 3 authors:
1. Alexander Bakumenko
2. D. Hudson Smith
3. Janine Hoelscher
This article has no evaluationsLatest version May 21, 2026
Calibrated and Interpretable Machine Learning for ICU Mortality Prediction Using First 24-Hour Clinical Data

This article has 3 authors:
1. Abdallah Alsammani
2. Merasia Johnson
3. Jessica Elrefaei
This article has no evaluationsLatest version Jun 2, 2026
Data-driven Prediction of Fifteen-Year All-Cause Mortality among 2.3 Million Individuals in the VA

This article has 14 authors:
1. Sayera Dhaubhadel
2. Judith D. Cohn
3. Tanmoy Bhattacharya
4. Ruy M. Ribeiro
5. Kumkum Ganguly
6. Nicolas Hengartner
7. Janet P. Tate
8. Lauren Costa
9. Yuk-Lam Ho
10. Kelly Cho
11. Jean C. Beckham
12. Nathan A. Kimbrel
13. Amy C. Justice
14. Benjamin H. McMahon
This article has no evaluationsLatest version Jul 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Early-Horizon Multimodal ICU Mortality Prediction Without Retraining

Calibrated and Interpretable Machine Learning for ICU Mortality Prediction Using First 24-Hour Clinical Data

Data-driven Prediction of Fifteen-Year All-Cause Mortality among 2.3 Million Individuals in the VA