Clinical subphenotypes in COVID-19: derivation, validation, prediction, temporal patterns, and interaction with social determinants of health

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The coronavirus disease 2019 (COVID-19) is heterogeneous and our understanding of the biological mechanisms of host response to the viral infection remains limited. Identification of meaningful clinical subphenotypes may benefit pathophysiological study, clinical practice, and clinical trials. Here, our aim was to derive and validate COVID-19 subphenotypes using machine learning and routinely collected clinical data, assess temporal patterns of these subphenotypes during the pandemic course, and examine their interaction with social determinants of health (SDoH). We retrospectively analyzed 14418 COVID-19 patients in five major medical centers in New York City (NYC), between March 1 and June 12, 2020. Using clustering analysis, 4 biologically distinct subphenotypes were derived in the development cohort ( N  = 8199). Importantly, the identified subphenotypes were highly predictive of clinical outcomes (especially 60-day mortality). Sensitivity analyses in the development cohort, and rederivation and prediction in the internal ( N  = 3519) and external ( N  = 3519) validation cohorts confirmed the reproducibility and usability of the subphenotypes. Further analyses showed varying subphenotype prevalence across the peak of the outbreak in NYC. We also found that SDoH specifically influenced mortality outcome in Subphenotype IV, which is associated with older age, worse clinical manifestation, and high comorbidity burden. Our findings may lead to a better understanding of how COVID-19 causes disease in different populations and potentially benefit clinical trial development. The temporal patterns and SDoH implications of the subphenotypes may add insights to health policy to reduce social disparity in the pandemic.

Article activity feed

  1. SciScore for 10.1101/2021.02.28.21252645: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board StatementIRB: Ethical approval and patient consent: The Institutional Review Board of the Weill Cornell Medicine approved this study (Protocol number: 20-04021948).
    RandomizationConsidering the population diversity of the five medical centers (see eTable 1 in Supplement), we combined patients of NYU, NYP-WCMC, MSHS, and MMC and randomly divided them into the development cohort (70%) and internal validation cohort (30%).
    Blindingnot detected.
    Power Analysisnot detected.
    Sex as a biological variablenot detected.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    More specifically, clustering models were implemented based on Python packages ‘scikit-learn 0.23.2’ (https://scikit-learn.org/stable/) and ‘scipy 1.5.3’ (https://www.scipy.org).
    https://www.scipy.org
    suggested: (SciPy, RRID:SCR_008058)
    Data dimension reduction and visualization were performed based on Python package ‘UMAP-learn 0.3.9’ (https://umap-learn.readthedocs.io/en/latest/).
    Python
    suggested: (IPython, RRID:SCR_001658)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Limitations: While this study presents a new contribution in the efforts to parse the biological heterogeneity of COVID-19, there remain several limitations. First of all, our data-driven approach relied on the availability of patient data. In this study, we identified subphenotypes using the routinely collected clinical variables that were correlated with COVID-1935 and available in the INSIGHT database36. We were not able to extract presenting symptoms and vital data while the incorporation of such data would add in new insights. Second, in our study, the analyzed data were collected at ED or hospital presentation, so the time between COVID-19 symptom onset to ED or hospital presentation could be a covariate of disease severity and clinical outcomes. However, such data was not available in the INSIGHT database. Third, missing values may affect the robustness of the identified subphenotypes. In order to address this issue, we excluded variables with high missingness. For the remaining variables, we used the state-of-the-art K-nearest neighbors imputation algorithm37. Even so, we still missed these real values hence may incorporate bias. Fourth, our study was based on presenting clinical data, such that each patient was characterized in a snapshot. The full use of longitudinal data of patients may allow us to capture the complexity of the disease arc to identify interesting subphenotypes. Previous studies tried to derive COVID-19 subphenotypes based on longitudinal informatio...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.