Predictors of COVID-19 hospital outcomes: a machine learning analysis of the National COVID Cohort Collaborative
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (PREreview)
Abstract
Predicting hospital outcomes for patients with severe acute respiratory infections is critical for risk stratification and resource planning, yet heterogeneous electronic health record (EHR) data, class imbalance, and evolving clinical practice present persistent methodological challenges for machine learning (ML) approaches. We conducted a retrospective cohort study using EHR data harmonized to the OMOP common data model from the National COVID Cohort Collaborative (N3C; May 2020-June 2025), including 263,619 adults hospitalized with COVID-19 across 51 contributing sites. We developed penalized linear regression (elastic net), random forest, XGBoost, and multilayer perceptron (MLP) models to predict hospital length of stay (LOS) and mortality (in-hospital and 60-day), using demographics, comorbidities, prior healthcare utilization, COVID-19 vaccination status, and hospital site as predictors. Missing data were handled via multiple imputation by chained equations (MICE) and class imbalance was addressed using SMOTE. Model performance was evaluated using area under the ROC curve (AUROC), Brier score, calibration plots, and decision curve analysis, following the TRIPOD reporting framework. Mortality prediction achieved moderate discrimination across all models (test AUROC = 0.71-0.73 for in-hospital mortality; 0.72-0.73 for 60-day all-cause mortality). Models trained without SMOTE achieved the highest AUROCs but assigned virtually no patients to the mortality class at the default 0.5 threshold. SMOTE improved recall and F-1 score at the cost of reduced AUROC and precision. LOS was poorly explained by available structured predictors (best R2 = 0.059). Remdesivir-treated patients (n = 103,536; 39.3%) were older, had higher comorbidity burden, and had higher unadjusted mortality than untreated patients. Common structured EHR features offer moderate utility for mortality risk stratification in hospitalized COVID-19 patients but are insufficient for LOS prediction. The consistent SMOTE-related tradeoff between discrimination and calibration underscores the need to report threshold-dependent metrics alongside AUROC in clinical ML studies, with implications for operational planning during future respiratory disease emergencies.
Article activity feed
-
This Zenodo record is a permanently preserved version of a Structured PREreview. You can view the complete PREreview at https://prereview.org/reviews/19520609.
Does the introduction explain the objective of the research presented in the preprint? Yes The introduction clearly states the three prediction targets (LOS, in-hospital mortality, 60-day mortality), justifies the clinical need, situates the work relative to prior literature, and specifies the N3C dataset.Are the methods well-suited for this research? Neither appropriate nor inappropriate The overall framework (retrospective cohort, multiple ML models, SMOTE, TRIPOD reporting) is …This Zenodo record is a permanently preserved version of a Structured PREreview. You can view the complete PREreview at https://prereview.org/reviews/19520609.
Does the introduction explain the objective of the research presented in the preprint? Yes The introduction clearly states the three prediction targets (LOS, in-hospital mortality, 60-day mortality), justifies the clinical need, situates the work relative to prior literature, and specifies the N3C dataset.Are the methods well-suited for this research? Neither appropriate nor inappropriate The overall framework (retrospective cohort, multiple ML models, SMOTE, TRIPOD reporting) is reasonable. However, the inclusion of remdesivir as a predictor without verifying temporal ordering, the absence of threshold optimization, and the lack of temporal validation are meaningful deviations from best practices.Are the conclusions supported by the data? Somewhat supported The mortality findings and the SMOTE tradeoff conclusion are well-supported. The LOS conclusion is also credible. However, the claim that these models could inform pandemic preparedness is difficult to support given the temporal pooling across five years and the lack of external validation.Are the data presentations, including visualizations, well-suited to represent the data? Highly appropriate and clear The ROC curves, SHAP beeswarm plots, and tables are appropriate choices for this type of ML study and are generally readable.How clearly do the authors discuss, explain, and interpret their findings and potential next steps for the research? Somewhat clearly The discussion is one of the stronger sections, as the authors engage honestly with the SMOTE tradeoff, the LOS null finding, and the remdesivir confounding issue. However, the temporal pooling limitation is acknowledged but not fully explored, and the practical translation of AUROC values into clinical terms (what 0.72 means at the bedside) is absent.Is the preprint likely to advance academic knowledge? Somewhat likely The empirical SMOTE tradeoff finding and the honest documentation of the structured EHR data ceiling are genuine contributions to clinical ML methodology. The N3C cohort characterization by remdesivir exposure also adds value for future causal inference work.Would it benefit from language editing? No The writing is clear, precise, and well-organized throughout.Would you recommend this preprint to others? Yes, but it needs to be improved The infrastructure, transparency, and methodological honesty make it worth reading, but the remdesivir temporal issue and temporal pooling concern need to be addressed before the conclusions can be fully trusted.Is it ready for attention from an editor, publisher or broader audience? No, it needs a major revision A sensitivity analysis excluding remdesivir, some form of temporal validation, and subgroup performance reporting by race/ethnicity are needed before this is ready for publication. These are addressable revisions, not fatal flaws, but they are major enough that the current version should not go forward as it is.Competing interests
The author declares that they have no competing interests.
Use of Artificial Intelligence (AI)
The author declares that they did not use generative AI to come up with new ideas for their review.
-