Machine Learning for Identifying Data-Driven Subphenotypes of Incident Post-Acute SARS-CoV-2 Infection Conditions with Large Scale Electronic Health Records: Findings from the RECOVER Initiative
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (ScreenIT)
Abstract
The post-acute sequelae of SARS-CoV-2 infection (PASC) refers to a broad spectrum of symptoms and signs that are persistent, exacerbated, or newly incident in the post-acute SARS-CoV-2 infection period of COVID-19 patients. Most studies have examined these conditions individually without providing concluding evidence on co-occurring conditions. To answer this question, this study leveraged electronic health records (EHRs) from two large clinical research networks from the national Patient-Centered Clinical Research Network (PCORnet) and investigated patients’ newly incident diagnoses that appeared within 30 to 180 days after a documented SARS-CoV-2 infection. Through machine learning, we identified four reproducible subphenotypes of PASC dominated by blood and circulatory system, respiratory, musculoskeletal and nervous system, and digestive system problems, respectively. We also demonstrated that these subphenotypes were associated with distinct patterns of patient demographics, underlying conditions present prior to SARS-CoV-2 infection, acute infection phase severity, and use of new medications in the post-acute period. Our study provides novel insights into the heterogeneity of PASC and can inform stratified decision-making in the treatment of COVID-19 patients with PASC conditions.
Article activity feed
-
-
SciScore for 10.1101/2022.05.21.22275412: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources With these considerations, we set the final number of topics as 10 for both INSIGHT and OneFlorida+ cohorts as it achieved the best topic coherence and reasonable data likelihood (we do not want the data likelihood to be too perfect as that may suggests overfitting). OneFlorida+suggested: NoneFor determining the optimal number of clusters (subphenotypes), we applied NbClust R package25, which includes 21 cluster indices to evaluate the quality of clusters. NbClustsuggested: NoneOn both ISNIGHT and OneFlorida cohort, we found that all confounders on all subphenotypes were balanced. OneFloridaSciScore for 10.1101/2022.05.21.22275412: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources With these considerations, we set the final number of topics as 10 for both INSIGHT and OneFlorida+ cohorts as it achieved the best topic coherence and reasonable data likelihood (we do not want the data likelihood to be too perfect as that may suggests overfitting). OneFlorida+suggested: NoneFor determining the optimal number of clusters (subphenotypes), we applied NbClust R package25, which includes 21 cluster indices to evaluate the quality of clusters. NbClustsuggested: NoneOn both ISNIGHT and OneFlorida cohort, we found that all confounders on all subphenotypes were balanced. OneFloridasuggested: NoneWe used the python package GENSIM (https://radimrehurek.com/gensim/) to calculate the topic coherence. pythonsuggested: (IPython, RRID:SCR_001658)Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).
Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:Our study is not without limitations. First, our analysis is based on longitudinal observational patient data, which cannot explain the biological mechanisms behind PASC directly. Second, the PASC diagnoses we investigated were encoded as CCSR categories, which may not reflect the co-incidence patterns of fine-grained diagnosis conditions in the context of PASC. Third, we focused on new incidences of conditions in the post-acute infection period for COVID-19 patients and did not consider pre-existing conditions that are persistent or exacerbated due to the acute SARS-CoV-2 infection. Finally, our study period did not represent the recent wave dominated by the Omicron variants of SARS-CoV-2. To summarize, our study dissects the complexity and heterogeneity of newly incident conditions in 30-180 days after SARS-CoV-2 infection confirmation into four reproducible subphenotypes based on the EHR repositories from two large CRNs using machine learning. These four subphenotypes included a severe one involving problems with the blood and circulatory system and associated with high baseline comorbidity burden and disease severity in its acute phase, a milder one in younger people mainly with respiratory problems, and two pain-dominated ones (musculoskeletal/nervous system pain and abdominal pain respectively). Overall, patients in each subphenotype tend to have higher rates of related conditions in the baseline period. Our study provides the first systematic study on the co-incidence ...
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
- Thank you for including a protocol registration statement.
Results from scite Reference Check: We found no unreliable references.
-