Machine Learning Analysis of Post-Acute COVID Symptoms Identifies Distinct Clusters and Severity Groups

Beverly Peng
Yun Zhang
Thomas Dalhuisen
Aidan Rogers
Jesus Estevez
Helen E. Davies
Samantha A. Jones
Violeta Capric
Natalie S. Haddad
Kelly L. Miners
Kristin Ladell
Jeffrey N. Martin
J. Daniel Kelly
Steven G. Deeks
Michael J. Peluso
David Putrino
David A. Price
Christopher L. Dupont
Marcelo Freire
Michael B. VanElzakker
Amy Proal
Richard H. Scheuermann
F. Eun-Hyung Lee
Gene S. Tan
Yu Qian

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Questionnaires that capture patient-reported symptomatology provide low-cost but potentially high-value data for the de novo discovery of disease phenotype, severity, and responsiveness to intervention groupings within an umbrella condition. The availability of comprehensive electronic health records (EHRs) has nonetheless overshadowed the use of questionnaires data for symptom analysis in the context of COVID-19. We analyzed de-identified questionnaires from post-acute COVID-19 cohorts at the University of California, San Francisco (UCSF, n = 669), Icahn School of Medicine at Mount Sinai (ISMMS, n = 615), Emory University (Emory, n = 60), and the University Hospital of Wales (Cardiff, n = 317). Using topic modeling followed by unsupervised clustering, we identified distinct symptom clusters and their corresponding symptom signatures. Mapping these signatures to organ systems revealed nine to twelve endotypes per cohort, capturing the heterogeneity of post-COVID-19 symptoms. Some clusters were associated with pre-existing conditions, including a female-predominant severity cluster with neurological and hormonal symptoms. Longitudinal analysis distinguished three symptom trajectories: acute then resolving, persistent but attenuated, and progressive disease. Across all cohorts, three severity levels, namely, mild, moderate, and severe, were evident from symptoms alone. Symptom-based severity scores correlated with patient-reported health status (EQ-5D) and SARS-CoV-2-specific antibody responses in plasmablasts, validating the prediction. Cluster-level analyses further stratified patients into recovered and non-recovered subgroups, identifying endotypes associated with different recovery trajectories. Finally, meta-analysis integrating cohort-specific clusters defined ten global endotypes and a unified map of severity scores, highlighting cohort-specific patterns, sex differences, and relationships among organ systems. These findings demonstrate that machine learning-assisted screening of questionnaire data can robustly identify symptom clusters, endotypes, and severity groups, providing a framework for stratifying long COVID patients for precision medicine trial design.

Version published to 10.1101/2025.11.16.25340350 on medRxiv
Nov 17, 2025

Prediction of Long COVID and Mortality among Patients with Substance Use Disorder

This article has 3 authors:
1. Jiawei Wu
2. K M Sajjadul Islam
3. Praveen Madiraju
This article has no evaluationsLatest version Nov 20, 2025
Long COVID Longitudinal Symptoms Burden Clusters Within A National Community-Based Cohort

This article has 6 authors:
1. Yanhan Shen
2. Zach Shahn
3. McKaylee M. Robertson
4. Kelly Gebo
5. Denis Nash
6. the CHASING COVID Cohort Study Team
This article has no evaluationsLatest version Nov 4, 2025
Analysis of Potential Subgroups in Vaes ME/CFS Patient Clusters

This article has 1 author:
1. Erik Squires
This article has no evaluationsLatest version Sep 30, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Prediction of Long COVID and Mortality among Patients with Substance Use Disorder

Long COVID Longitudinal Symptoms Burden Clusters Within A National Community-Based Cohort

Analysis of Potential Subgroups in Vaes ME/CFS Patient Clusters