The impact of evaluation strategy on sepsis prediction model performance metrics in intensive care data

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The prediction of the onset of sepsis, a life-threatening condition resulting from a dysregulated response to an infection, is one of the most common prediction tasks in intensive care-related machine learning research. To assess the performance of such models, different evaluation strategies (fixed horizon, peak score and continuous evaluation) are commonly employed, but there is no clear consensus on which approach should be used in order to provide clinically meaningful performance evaluation.

Objective

To assess different evaluation approaches of sepsis prediction models trained on a public intensive care dataset applied to German intensive care data.

Methods

In this retrospective, observational cohort study, we assessed the efficacy of machine learning models, pre-trained on the MIMIC-IV dataset, when applied to BerlinICU, a multi-site German intensive care dataset. To understand the real-world impact of implementing these models, we examined the performance variability across various evaluation strategies.

Results

The BerlinICU dataset includes 40,132 intensive care admissions spanning 10 years (2012-2021). Using the latest Sepsis-3 definition, we identified 4,134 septic admissions (prevalence 10.3%). Application of a temporal convolution network model to BerlinICU yielded an area under the receiver operating characteristic curve (AUROC) of 0.67 (95% CI: 0.66–0.68) for continuous evaluation with a 6-hour prediction horizon, compared to 0.84 (95% CI: 0.83–0.85) on the test set of MIMIC-IV. On BerlinICU, peak score evaluation showed a similar AUROC compared to continuous evaluation, while fixed horizon evaluation showed a reduced AUROC of 0.61 (95% CI: 0.60–0.62). Onset matching had minimal impact on performance estimates using continuous evaluation or fixed horizon evaluation but increased estimates for peak score evaluation. Performance metrics improved with shorter prediction horizons across all strategies.

Conclusion

Our results demonstrate that the choice of evaluation strategy has a significant impact on the performance metrics of intensive care prediction models. The same model applied to the same dataset yields markedly different performance metrics depending on the evaluation approach. Therefore, careful selection of the evaluation approach is essential to ensure that the interpretation of performance metrics aligns with clinical intentions and enables meaningful comparisons between studies. In our view, the continuous evaluation approach best reflects the continual monitoring of patients that is performed in real-world clinical practice. In contrast, fixed horizon and peak score evaluation approaches may produce skewed results when not properly matching the length of stay distributions between sepsis cases and control cases. Especially for peak score evaluation, longer visits tend to produce higher maximum scores because sampling from more values increases the likelihood of capturing higher values purely by chance.

Article activity feed