Case-Control Matching Erodes Feature Discriminability for AI-driven Sepsis Prediction in ICUs: A Retrospective Cohort Study

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Sepsis remains a leading cause of intensive care unit (ICU) mortality worldwide, and early detection is essential for improving survival through timely interventions. While machine learning holds promise for early sepsis detection by leveraging the multimodal time series data prevalent in ICUs, the field faces significant methodological challenges—such as class imbalance and temporal misalignment. These issues have spurred growing interest in case-control matching strategies. However, there is a lack of systematic evaluation of how commonly used case-control matching methods affect model performance and introduce potential biases, highlighting a critical gap in current knowledge.

Methods

We investigated how matched training data affect both the performance of various machine learning architectures and the predictive importance of individual clinical features. Three harmonized large-scale ICU cohorts were used: the high time-resolution ICU dataset (HiRID) from Bern University Hospital, Switzerland (29,698 stays, 2008–2019), with a sepsis prevalence of 6.3%; the Medical Information Mart for Intensive Care (MIMIC-IV) database from Beth Israel Deaconess Medical Center, USA (63,425 stays, 2008–2019), with a sepsis prevalence of 5.2%; and the eICU Collaborative Research Database from 208 US hospitals (123,413 stays, 2014–2015), with a sepsis prevalence of 4.6%. Each dataset included hourly observations. both with and without incorporating demographic variables, and compared the results to unmatched and undersampled cohorts with equivalent ratios. To evaluate how matching strategies influenced feature relevance, we applied a Linear Mixed Effects Model (LMEM) to assess changes in the predictive significance of individual features. Finally, we trained machine learning models—including random forests and gradient-boosted trees—and evaluated their performance on the original test sets using the AUROC and normalized AUPRC metrics.

Findings

The LMEM results demonstrated that case-control matching led to a substantial reduction in the number of features identified as significant, based on multiple testing–corrected p-values, across all three cohorts. Specifically, the number of significant features declined from 35–43 in the original datasets to 24–29 as the matching ratio increased from 1:10 to 1:2. In the machine learning experiments, models trained on undersampled data achieved strong performance, with AUROC values exceeding 0.90 and normalized AUPRC scores above 41. In contrast, models trained on the original (imbalanced) datasets performed robustly (AUROC up to 0.82, normalized AUPRC up to 4.2), whereas those trained on matched datasets showed significant performance degradation, with AUROC values below 0.50 and normalized AUPRC scores falling below baseline. These patterns were consistent across all three cohorts and a wide range of evaluation setups.

Interpretation

This multi-cohort analysis highlights a methodological paradox: while case-control matching can mitigate class imbalance and help disentangle sepsis-specific patterns from general ICU trajectories, overly strict matching criteria may significantly impair predictive performance. These findings underscore the need for more nuanced matching strategies that balance bias reduction with the preservation of critical clinical signals—ultimately enhancing the reliability and real-world applicability of sepsis prediction models.

Funding

This project was supported by grant #902 of the Strategic Focus Area “Personalized Health and Related Technologies (PHRT)” of the ETH Domain and Young Investigator Grant of the Novartis Foundation for Medical-Biological Research.

Research in Context

Evidence before this study

Case-control matching has long been used in epidemiological research to mitigate confounders and address class imbalance. In recent years, AI-driven sepsis prediction studies have adopted similar strategies to improve data set balance and reduce temporal bias. However, the direct impact of case-control matching on machine learning (ML) performance in sepsis prediction has not been systematically evaluated. Although prior work demonstrated the promise of early detection of sepsis using observational data without explicit matching strategies 1,2 , studies that implemented stringent matching occasionally reported deteriorations in model performance 3,4 . A recent review also advocated the use of case-control temporal alignment to avoid temporal biases 2 . This discrepancy highlights a critical knowledge gap about how various matching protocols influence ML outcomes in sepsis, an inherently heterogeneous condition shaped by both clinical and temporal complexities.

Added value of this study

We conducted a comprehensive multi-cohort investigation of absolute-onset case-control matching (with and without demographic criteria) for sepsis prediction models, using three large, harmonized ICU cohorts (HiRID, MIMIC-IV, and eICU). By comparing matched, aligned, and undersampled training cohorts across multiple machine learning architectures and temporal window configurations, our experiments show how matching affects both overall model performance and the discriminability of individual features. We demonstrated that while case-control matching effectively reduces class imbalance, it can also diminish key sepsis-related signals and substantially reduce predictive accuracy below baseline. These insights shed light on the nuanced interaction between epidemiological design strategies and ML-based sepsis prediction in ICU settings.

Implications of all the available evidence

Although case-control matching is often viewed as a reliableethod to minimize bias and improve dadata setbalance, our findings indicate that it may inadvertently obscure clinically significant signals critical for accurate prediction of sepsis.is. In real-world clinical environments where timely and precise diagnosis is essential, researchers and clinicians should carefully consider the trade-offs imposed by strict matching protocols, especially when aligning data from large multicenter cohorts. To facilitate transparent model validation, we recommend using absolute-onset matching as a comparative benchmark to prevent artificially inflated performance metrics. Alternatively, temporal dependencies related to admission to the ICU or matched onset should be factored into the model evaluation to avoid trivial classification driven by inadequate handling of within-stay dynamics. Lastly, relying solely on AUROC and AUPRC as performance metrics risks overlooking other clinically relevant outcomes, emphasizing the need for broader evaluation frameworks that align more closely with real-world clinical decision making. This study encourages the research community to refine and reassess conventional matching methods, striving for clinically informed strategies that reduce confounding while retaining essential sepsis-specific signals for robust and generalizable predictive models.

Article activity feed