Clinical subphenotypes in COVID-19: derivation, validation, prediction, temporal patterns, and interaction with social determinants of health

Abstract

The coronavirus disease 2019 (COVID-19) is heterogeneous and our understanding of the biological mechanisms of host response to the viral infection remains limited. Identification of meaningful clinical subphenotypes may benefit pathophysiological study, clinical practice, and clinical trials. Here, our aim was to derive and validate COVID-19 subphenotypes using machine learning and routinely collected clinical data, assess temporal patterns of these subphenotypes during the pandemic course, and examine their interaction with social determinants of health (SDoH). We retrospectively analyzed 14418 COVID-19 patients in five major medical centers in New York City (NYC), between March 1 and June 12, 2020. Using clustering analysis, 4 biologically distinct subphenotypes were derived in the development cohort ( N = 8199). Importantly, the identified subphenotypes were highly predictive of clinical outcomes (especially 60-day mortality). Sensitivity analyses in the development cohort, and rederivation and prediction in the internal ( N = 3519) and external ( N = 3519) validation cohorts confirmed the reproducibility and usability of the subphenotypes. Further analyses showed varying subphenotype prevalence across the peak of the outbreak in NYC. We also found that SDoH specifically influenced mortality outcome in Subphenotype IV, which is associated with older age, worse clinical manifestation, and high comorbidity burden. Our findings may lead to a better understanding of how COVID-19 causes disease in different populations and potentially benefit clinical trial development. The temporal patterns and SDoH implications of the subphenotypes may add insights to health policy to reduce social disparity in the pandemic.

SciScore for 10.1101/2021.02.28.21252645: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Institutional Review Board Statement	IRB: Ethical approval and patient consent: The Institutional Review Board of the Weill Cornell Medicine approved this study (Protocol number: 20-04021948).
Randomization	Considering the population diversity of the five medical centers (see eTable 1 in Supplement), we combined patients of NYU, NYP-WCMC, MSHS, and MMC and randomly divided them into the development cohort (70%) and internal validation cohort (30%).
Blinding	not detected.
Power Analysis	not detected.
Sex as a biological variable	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
More specifically, clustering models were implemented based on Python packages ‘scikit-learn 0.23.2’ …

SciScore for 10.1101/2021.02.28.21252645: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Institutional Review Board Statement	IRB: Ethical approval and patient consent: The Institutional Review Board of the Weill Cornell Medicine approved this study (Protocol number: 20-04021948).
Randomization	Considering the population diversity of the five medical centers (see eTable 1 in Supplement), we combined patients of NYU, NYP-WCMC, MSHS, and MMC and randomly divided them into the development cohort (70%) and internal validation cohort (30%).
Blinding	not detected.
Power Analysis	not detected.
Sex as a biological variable	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
More specifically, clustering models were implemented based on Python packages ‘scikit-learn 0.23.2’ (https://scikit-learn.org/stable/) and ‘scipy 1.5.3’ (https://www.scipy.org).	https://www.scipy.org suggested: (SciPy, RRID:SCR_008058)
Data dimension reduction and visualization were performed based on Python package ‘UMAP-learn 0.3.9’ (https://umap-learn.readthedocs.io/en/latest/).	Python suggested: (IPython, RRID:SCR_001658)

Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

Limitations: While this study presents a new contribution in the efforts to parse the biological heterogeneity of COVID-19, there remain several limitations. First of all, our data-driven approach relied on the availability of patient data. In this study, we identified subphenotypes using the routinely collected clinical variables that were correlated with COVID-1935 and available in the INSIGHT database36. We were not able to extract presenting symptoms and vital data while the incorporation of such data would add in new insights. Second, in our study, the analyzed data were collected at ED or hospital presentation, so the time between COVID-19 symptom onset to ED or hospital presentation could be a covariate of disease severity and clinical outcomes. However, such data was not available in the INSIGHT database. Third, missing values may affect the robustness of the identified subphenotypes. In order to address this issue, we excluded variables with high missingness. For the remaining variables, we used the state-of-the-art K-nearest neighbors imputation algorithm37. Even so, we still missed these real values hence may incorporate bias. Fourth, our study was based on presenting clinical data, such that each patient was characterized in a snapshot. The full use of longitudinal data of patients may allow us to capture the complexity of the disease arc to identify interesting subphenotypes. Previous studies tried to derive COVID-19 subphenotypes based on longitudinal informatio...

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Read the original source

Clinical subphenotypes in COVID-19: derivation, validation, prediction, temporal patterns, and interaction with social determinants of health

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

A Preliminary Prognostic Model for Predicting Poor Prognosis in COVID-19 Integrating Lung Epithelial Injury (KL-6) with Routine Care Markers

Association of IL7 rs16906115 Polymorphism with Adverse Events in Patients with Advanced Lung Cancer Undergoing Immunotherapy

DIVERSITY AND CLINICAL CORRELATIONS OF SARS-CoV-2 VARIANT DURING THE INTRODUCTION OF THE DELTA VARIANT IN GUATEMALA

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Preliminary Prognostic Model for Predicting Poor Prognosis in COVID-19 Integrating Lung Epithelial Injury (KL-6) with Routine Care Markers

Association of IL7 rs16906115 Polymorphism with Adverse Events in Patients with Advanced Lung Cancer Undergoing Immunotherapy

DIVERSITY AND CLINICAL CORRELATIONS OF SARS-CoV-2 VARIANT DURING THE INTRODUCTION OF THE DELTA VARIANT IN GUATEMALA