Predicting Long COVID in the National COVID Cohort Collaborative Using Super Learner

Zachary Butzin-Dozier
Yunwen Ji
Haodong Li
Jeremy Coyle
Junming (Seraphina) Shi
Rachael V. Philips
Andrew Mertens
Romain Pirracchio
Mark J. van der Laan
Rena Patel
John M. Colford
Alan E. Hubbard

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Post-acute Sequelae of COVID-19 (PASC), also known as Long COVID, is a broad grouping of a range of long-term symptoms following acute COVID-19 infection. An understanding of characteristics that are predictive of future PASC is valuable, as this can inform the identification of high-risk individuals and future preventative efforts. However, current knowledge regarding PASC risk factors is limited. Using a sample of 55,257 participants from the National COVID Cohort Collaborative, as part of the NIH Long COVID Computational Challenge, we sought to predict individual risk of PASC diagnosis from a curated set of clinically informed covariates. We predicted individual PASC status, given covariate information, using Super Learner (an ensemble machine learning algorithm also known as stacking) to learn the optimal, AUC-maximizing combination of gradient boosting and random forest algorithms. We were able to predict individual PASC diagnoses accurately (AUC 0.947). Temporally, we found that baseline characteristics were most predictive of future PASC diagnosis, compared with characteristics immediately before, during, or after COVID-19 infection. This finding supports the hypothesis that clinicians may be able to accurately assess the risk of PASC in patients prior to acute COVID diagnosis, which could improve early interventions and preventive care. We found that medical utilization, demographics, anthropometry, and respiratory factors were most predictive of PASC diagnosis. This highlights the importance of respiratory characteristics in PASC risk assessment. The methods outlined here provide an open-source, applied example of using Super Learner to predict PASC status using electronic health record data, which can be replicated across a variety of settings.

Version published to 10.1101/2023.07.27.23293272 on medRxiv
Aug 4, 2023

Pre-pandemic blood profiles predict COVID-19 hospitalization and death a decade later

This article has 1 author:
1. Laurence A. Jacobs
This article has no evaluationsLatest version May 29, 2026
Sex-Specific Signatures of Circulating Protein and Cellular Host Responses Predicting COVID-19 Severity

This article has 9 authors:
1. Milica Radisavljević
2. Zorica Stojić-Vukanić
3. Tijana Kosanović
4. Miodrag Lalošević
5. Iva Perović Blagojević
6. Jovana Milijić Jovanović
7. Aleksa Petković
8. Jelena Marjanović
9. Gordana Leposavić
This article has no evaluationsLatest version May 31, 2026
T-cell repertoire response in individuals with post-acute sequelae of COVID-19

This article has 6 authors:
1. Zachary Montague
2. Rhea M Grover
3. Andrew Baumgartner
4. Assya Trofimov
5. Jennifer Hadlock
6. Armita Nourmohammad
This article has no evaluationsLatest version Apr 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Pre-pandemic blood profiles predict COVID-19 hospitalization and death a decade later

Sex-Specific Signatures of Circulating Protein and Cellular Host Responses Predicting COVID-19 Severity

T-cell repertoire response in individuals with post-acute sequelae of COVID-19