Who has long-COVID? A big data approach

Emily R. Pfaff
Andrew T Girvin
Tellen D. Bennett
Abhishek Bhatia
Ian M. Brooks
Rachel R Deer
Jonathan P Dekermanjian
Sarah Elizabeth Jolley
Michael G. Kahn
Kristin Kostka
Julie A McMurry
Richard Moffitt
Anita Walden
Christopher G Chute
Melissa A Haendel

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (ScreenIT)

Abstract

Background

Post-acute sequelae of SARS-CoV-2 infection (PASC), otherwise known as long-COVID, have severely impacted recovery from the pandemic for patients and society alike. This new disease is characterized by evolving, heterogeneous symptoms, making it challenging to derive an unambiguous long-COVID definition. Electronic health record (EHR) studies are a critical element of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, which is addressing the urgent need to understand PASC, accurately identify who has PASC, and identify treatments.

Methods

Using the National COVID Cohort Collaborative’s (N3C) EHR repository, we developed XGBoost machine learning (ML) models to identify potential long-COVID patients. We examined demographics, healthcare utilization, diagnoses, and medications for 97,995 adult COVID-19 patients. We used these features and 597 long-COVID clinic patients to train three ML models to identify potential long-COVID patients among (1) all COVID-19 patients, (2) patients hospitalized with COVID-19, and (3) patients who had COVID-19 but were not hospitalized.

Findings

Our models identified potential long-COVID patients with high accuracy, achieving areas under the receiver operator characteristic curve of 0.91 (all patients), 0.90 (hospitalized); and 0.85 (non-hospitalized). Important features include rate of healthcare utilization, patient age, dyspnea, and other diagnosis and medication information available within the EHR. Applying the “all patients” model to the larger N3C cohort identified 100,263 potential long-COVID patients.

Interpretation

Patients flagged by our models can be interpreted as “patients likely to be referred to or seek care at a long-COVID specialty clinic,” an essential proxy for long-COVID diagnosis in the current absence of a definition. We also achieve the urgent goal of identifying potential long-COVID patients for clinical trials. As more data sources are identified, the models can be retrained and tuned based on study needs.

Funding

This study was funded by NCATS and NIH through the RECOVER Initiative.

ScreenIT
Oct 25, 2021
SciScore for 10.1101/2021.10.18.21265168: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.
Table 2: Resources
Software and Algorithms
Sentences Resources
The python package XGBoost was used to construct the models, using 924 features in total.
python
suggested: (IPython, RRID:SCR_001658)
Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any …
SciScore for 10.1101/2021.10.18.21265168: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.
Table 2: Resources
Software and Algorithms
Sentences Resources
The python package XGBoost was used to construct the models, using 924 features in total.
python
suggested: (IPython, RRID:SCR_001658)
Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:
Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
About SciScore
SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.
Read the original source
Version published to 10.1101/2021.10.18.21265168 on medRxiv
Oct 22, 2021

Software and Algorithms
Sentences	Resources
The python package XGBoost was used to construct the models, using 924 features in total.	python suggested: (IPython, RRID:SCR_001658)

Software and Algorithms
Sentences	Resources
The python package XGBoost was used to construct the models, using 924 features in total.	python suggested: (IPython, RRID:SCR_001658)

A Preliminary Prognostic Model for Predicting Poor Prognosis in COVID-19 Integrating Lung Epithelial Injury (KL-6) with Routine Care Markers

This article has 7 authors:
1. Yunlai Liang
2. Kun Wang
3. Lu Long
4. Qizhuo Hou
5. Wenze Yu
6. Kangkang Huang
7. Bin Yi
This article has no evaluationsLatest version Feb 3, 2026
Development and Deployment of a Machine Learning–Based Predictive Model for COVID- 19 Infection Using Patient Demographic and Symptom Data in Nigeria

This article has 10 authors:
1. Olanrewaju Eniade
2. Ezekiel Ukwenga
3. Uchenna Akuka
4. Opeyemi Adeniyi
5. Elonna Obak
6. Omolola Adeagbo
7. Peter Babatunde Olaitan
8. Rita Ayanbolade Olowe
9. Tolulope Opakunle
10. Olugbenga Adekunle Olowe
This article has no evaluationsLatest version Jan 25, 2026
Impact of SARS-CoV-2 Infection on Pulmonary Function in the PURE Colombia Cohort: A Comparative Analysis with Pre-COVID Values and Non-COVID-19 Controls

This article has 9 authors:
1. Heiler Lozada-Ramos
2. Ruth Aralí Martínez-Vega
3. Maritza Pérez-Mayorga
4. José Patricio López-Jaramillo
5. Sumathy Rangarajan
6. MyLinh Duong
7. Salim Yusuf
8. Darryl Leong
9. Liliana Torcoroma García
This article has no evaluationsLatest version Jan 20, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Background

Methods

Findings

Interpretation

Funding

Article activity feed

Related articles

A Preliminary Prognostic Model for Predicting Poor Prognosis in COVID-19 Integrating Lung Epithelial Injury (KL-6) with Routine Care Markers

Development and Deployment of a Machine Learning–Based Predictive Model for COVID- 19 Infection Using Patient Demographic and Symptom Data in Nigeria

Impact of SARS-CoV-2 Infection on Pulmonary Function in the PURE Colombia Cohort: A Comparative Analysis with Pre-COVID Values and Non-COVID-19 Controls