“Complex models, marginal benefits--a multi-centre development and validation study of early warning scores across 2·16 million patient admissions addressing intercurrent medical interventions”

Alexandros Katsiferis
Neil Scheidwasser
Tri-Long Nguyen
Theis Lange
Mark P Khurana
Pernille B Nielsen
Kasper Karmark Iversen
Christian S Meyhoff
Eske Kvanner Aasvang
Jesper Mølgaard
Adrian G Zucco
Tibor V Varga
Samir Bhatt

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

The National Early Warning Score (NEWS) is a nationally recommended, clinically implemented system, used to prevent patient deterioration. While numerous studies have compared predictive models for clinical deterioration, large-scale evaluations of their potential clinical utility remain undetermined. Here, we compared NEWS’s clinical net benefit against simplified scoring rules and modern machine learning to determine whether simpler approaches are sufficient or if complex models provide meaningful advantages in clinical practice.

Methods

We included fifteen Danish hospitals with over 2·16 million patient admissions representing 829 610 unique patients over five years (2018 to 2023). We compared NEWS against both simpler and more complex approaches for predicting 24-hour mortality: NEWS-Light (NEWS without blood pressure and temperature), DEWS (NEWS-Light with age and sex; DEWS denotes the Demographic Early Warning Score), and a model based on eXtreme Gradient Boosting (XGB-EWS) incorporating vital signs, demographics, laboratory markers, plus medical history embeddings extracted using sentence transformers. We used propensity score weighting to mitigate intervention bias and evaluated performance using Area Under the Receiver Operating Characteristic Curve (AUC), calibration, and net benefit.

Findings

XGB-EWS achieved the highest discrimination (AUC 0·932, 95% Confidence Interval [0·929-0·936]), followed by DEWS (0·908 [0·904-0·912]), NEWS (0·902, [0·898-0·906]), and NEWS-Light (0·879, [0·873-0·885]). Decision curve analysis showed maximum net benefit differences of 1·8 additional correct mortality identifications per 10 000 patients between XGB-EWS and NEWS, and 1·7 per 10 000 between NEWS and NEWS-Light, across the evaluated risk thresholds.

Interpretation

Machine learning approaches provided marginal clinical utility improvements over traditional scoring systems, with NEWS-Light showing small performance decrements compared to full NEWS. The clinical significance of these differences must be weighed against workflow optimization benefits, suggesting healthcare systems should evaluate trade-offs between predictive performance and operational efficiency when selecting early warning approaches.

Funding

Novo Nordisk Foundation

Research in context

Evidence before this study

Early warning score systems have evolved from single vital sign monitoring to standardized multivariable scores such as the National Early Warning Score (NEWS), to even more sophisticated machine learning and deep learning frameworks utilizing electronic health records data. All these early warning scores are primarily designed to guide clinical decision-making by helping identify patients at risk of clinical deterioration within hospitals. Despite these advances, validation studies predominantly focus on statistical metrics measuring discrimination performance rather than meaningful clinical utility. On Aug 12, 2025, we searched PubMed for English-language studies published in the last 10 years, using terms including “early warning score”, “NEWS”, “NEWS2”, “machine learning”, “artificial intelligence”, “decision curve analysis”, “net benefit”, “clinical utility”, and “24-hour mortality.” While most of the published models, including the more sophisticated machine learning ones, have demonstrated better discrimination compared to traditional early warning scores, we found only one study that combined early warning score validation with clinical utility analysis for short-term clinical deterioration. Additionally, no studies have evaluated clinical utility across multiple patient subgroups while using causal inference methods to address intervention bias in a large-scale healthcare system validation.

Added value of this study

The current study represents the largest multi-center early warning score validation to date, encompassing 2·16 million patient encounters across 15 Danish hospitals and nine distinct clinical specialties over a period of five years. We used the predictimand framework with causal inference methods to address intervention bias, a critical and novel methodological advance in the assessment of early warning scores. Unlike previous research that focused primarily on discrimination metrics, we evaluated clinical utility using decision curve analysis, providing evidence that sophisticated machine learning early-warning-score approaches delivered in overall only marginal clinical utility improvements (up to 1.7 additional correctly identified mortality cases per 10 000 patients) despite better discrimination. We find that a simplified version of NEWS (NEWS without blood pressure and temperature components) achieves a comparable clinical net benefit to full NEWS. Additionally, we provide the first quantitative assessment of the healthcare resources implications in this area of research, showing that simplified approaches could potentially redirect 98·1 full-time equivalent positions annually from routine vital sign collection to direct patient care, providing evidence for healthcare administrators to reallocate clinical resources toward patient interaction and care delivery.

Implications of all the available evidence

We demonstrated that differences in the ability of the models to identify 24-hour mortality cases correctly were less than 2 per 10 000 patients. Given that often in clinical practice, a proportion of these identifications translates to lives being rescued, the real-world clinical advantage is likely even more modest. These modest gains of machine learning approaches, coupled with NEWS-Light displaying marginal performance decrements compared to the full NEWS, challenge the prevailing emphasis on algorithmic complexity over clinical value. NEWS-Light has the potential to enable workflow optimization, freeing approximately 3 minutes per patient encounter without compromising clinical effectiveness. Future research should conduct prospective cluster-randomized controlled trials comparing NEWS versus NEWS-Light implementation, while the broader prediction modeling research community should adopt utility-based evaluation frameworks to ensure that algorithmic advances translate to tangible improvements in patient care rather than statistical superiority alone.

Version published to 10.1101/2025.10.12.25337794 on medRxiv
Oct 14, 2025

Development and internal validation of a machine learning–based prediction model and simplified screening score for in-hospital falls: a retrospective cohort study

This article has 9 authors:
1. Onishi Tatsuki
2. Tatsuyoshi Ikenoue
3. Norihide Itoh
4. Takumi Nishioka
5. Keima Nagasaka
6. Ryo Okochi
7. Haru Adachi
8. Naoko Matsuo
9. Yoshiya Ueno
This article has no evaluationsLatest version Jan 23, 2026
Benchmarking Ensemble Machine Learning Algorithms for the Early Prediction of Stroke in Imbalanced Clinical Cohorts: A Comparative Analysis and Decision Curve Assessment

This article has 2 authors:
1. Ibrahim Ibrahim Shuaibu
2. Yousaf Hussain
This article has no evaluationsLatest version Jan 22, 2026
Combining the National Early Warning Score 2 with Frailty Assessment Improves Early Identification of Patients at Risk of In-Hospital Cardiac Arrest

This article has 10 authors:
1. Cesare Biuzzi
2. Elena Modica
3. Alessandra Vozza
4. Roberto Gargiuli
5. Benedetta Galgani
6. Giovanni Coratti
7. Daniele Marianello
8. Fabio Silvio Taccone
9. Federico Franchi
10. Sabino Scolletta
This article has no evaluationsLatest version Jan 13, 2026