Machine Learning for Identifying Data-Driven Subphenotypes of Incident Post-Acute SARS-CoV-2 Infection Conditions with Large Scale Electronic Health Records: Findings from the RECOVER Initiative

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The post-acute sequelae of SARS-CoV-2 infection (PASC) refers to a broad spectrum of symptoms and signs that are persistent, exacerbated, or newly incident in the post-acute SARS-CoV-2 infection period of COVID-19 patients. Most studies have examined these conditions individually without providing concluding evidence on co-occurring conditions. To answer this question, this study leveraged electronic health records (EHRs) from two large clinical research networks from the national Patient-Centered Clinical Research Network (PCORnet) and investigated patients’ newly incident diagnoses that appeared within 30 to 180 days after a documented SARS-CoV-2 infection. Through machine learning, we identified four reproducible subphenotypes of PASC dominated by blood and circulatory system, respiratory, musculoskeletal and nervous system, and digestive system problems, respectively. We also demonstrated that these subphenotypes were associated with distinct patterns of patient demographics, underlying conditions present prior to SARS-CoV-2 infection, acute infection phase severity, and use of new medications in the post-acute period. Our study provides novel insights into the heterogeneity of PASC and can inform stratified decision-making in the treatment of COVID-19 patients with PASC conditions.

Article activity feed

  1. SciScore for 10.1101/2022.05.21.22275412: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    With these considerations, we set the final number of topics as 10 for both INSIGHT and OneFlorida+ cohorts as it achieved the best topic coherence and reasonable data likelihood (we do not want the data likelihood to be too perfect as that may suggests overfitting).
    OneFlorida+
    suggested: None
    For determining the optimal number of clusters (subphenotypes), we applied NbClust R package25, which includes 21 cluster indices to evaluate the quality of clusters.
    NbClust
    suggested: None
    On both ISNIGHT and OneFlorida cohort, we found that all confounders on all subphenotypes were balanced.
    OneFlorida
    suggested: None
    We used the python package GENSIM (https://radimrehurek.com/gensim/) to calculate the topic coherence.
    python
    suggested: (IPython, RRID:SCR_001658)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Our study is not without limitations. First, our analysis is based on longitudinal observational patient data, which cannot explain the biological mechanisms behind PASC directly. Second, the PASC diagnoses we investigated were encoded as CCSR categories, which may not reflect the co-incidence patterns of fine-grained diagnosis conditions in the context of PASC. Third, we focused on new incidences of conditions in the post-acute infection period for COVID-19 patients and did not consider pre-existing conditions that are persistent or exacerbated due to the acute SARS-CoV-2 infection. Finally, our study period did not represent the recent wave dominated by the Omicron variants of SARS-CoV-2. To summarize, our study dissects the complexity and heterogeneity of newly incident conditions in 30-180 days after SARS-CoV-2 infection confirmation into four reproducible subphenotypes based on the EHR repositories from two large CRNs using machine learning. These four subphenotypes included a severe one involving problems with the blood and circulatory system and associated with high baseline comorbidity burden and disease severity in its acute phase, a milder one in younger people mainly with respiratory problems, and two pain-dominated ones (musculoskeletal/nervous system pain and abdominal pain respectively). Overall, patients in each subphenotype tend to have higher rates of related conditions in the baseline period. Our study provides the first systematic study on the co-incidence ...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a protocol registration statement.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.