Multi-scale Data Improves Performance of Machine Learning Model for Long COVID Prediction
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Long COVID affects a substantial proportion of the over 778 million individuals infected with SARS-CoV-2, yet predictive models remain limited in scope. While existing efforts, such as the National COVID Cohort Collaborative (N3C), have leveraged electronic health record (EHR) data for risk prediction, accumulating evidence points to additional contributions from social, behavioral, and genetic factors. Using a diverse cohort of SARS-CoV-2-infected individuals (n>17,200) from the NIH All of Us Research Program, we investigated whether integrating EHR data with survey-based and genomic information improves model performance. Our multi-scale approach outperformed EHR-only models original AUROC 0.736 (95% CI: 0.730, 0.741), achieving an AUROC of 0.748 (0.741,0.755). Among the top predictors, active-duty service status, self-reported fatigue, and chr19:4719431:G:A_A were among the most informative survey and genetic features. These findings highlight the importance of incorporating multi-scale data to improve risk stratification and inform personalized interventions for long COVID.