Identifying COVID-19 phenotypes using cluster analysis and assessing their clinical outcomes

This article has been Reviewed by the following groups

Read the full article

Abstract

Multiple clinical phenotypes have been proposed for COVID-19, but few have stemmed from data-driven methods. We aimed to identify distinct phenotypes in patients admitted with COVID-19 using cluster analysis, and compare their respective characteristics and clinical outcomes.

We analyzed the data from 547 patients hospitalized with COVID-19 in a Canadian academic hospital from January 1, 2020, to January 30, 2021. We compared four clustering algorithms: K-means, PAM (partition around medoids), divisive and agglomerative hierarchical clustering. We used imaging data and 34 clinical variables collected within the first 24 hours of admission to train our algorithm. We then conducted survival analysis to compare clinical outcomes across phenotypes and trained a classification and regression tree (CART) to facilitate phenotype interpretation and phenotype assignment.

We identified three clinical phenotypes, with 61 patients (17%) in Cluster 1, 221 patients (40%) in Cluster 2 and 235 (43%) in Cluster 3. Cluster 2 and Cluster 3 were both characterized by a low-risk respiratory and inflammatory profile, but differed in terms of demographics. Compared with Cluster 3, Cluster 2 comprised older patients with more comorbidities. Cluster 1 represented the group with the most severe clinical presentation, as inferred by the highest rate of hypoxemia and the highest radiological burden. Mortality, mechanical ventilation and ICU admission risk were all significantly different across phenotypes.

We conducted a phenotypic analysis of adult inpatients with COVID-19 and identified three distinct phenotypes associated with different clinical outcomes. Further research is needed to determine how to properly incorporate those phenotypes in the management of patients with COVID-19.

Article activity feed

  1. SciScore for 10.1101/2022.05.27.22275708: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    EthicsIRB: The Institutional Review Board of the CHUM (Centre Hospitalier de l’Université de Montréal) approved the study and informed consent was waived because of its low risk and retrospective nature.
    Consent: The Institutional Review Board of the CHUM (Centre Hospitalier de l’Université de Montréal) approved the study and informed consent was waived because of its low risk and retrospective nature.
    Sex as a biological variablenot detected.
    Randomizationnot detected.
    Blindingnot detected.
    Power Analysisnot detected.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    The raw data was managed using SQLite 3, and further data processing was conducted using Python version 3.7 and R version 4.0.3.
    Python
    suggested: (IPython, RRID:SCR_001658)

    Results from OddPub: Thank you for sharing your code and data.


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Our study presents some limitations. Multiple variables could not be included because they were either not captured in our electronic health record (e.g., time from onset of symptoms, mechanical ventilation parameters, in-hospital complications) or excluded from our study because of missingness. However, missing values are common in clinical practice and investigating risk stratification while considering the inherent characteristics real-world data is of importance at the bedside (54). In addition, this enhances the applicability of our phenotypes, as they are only based on the most common variables available for patients admitted with COVID-19 (55). This differs from studies that have included flux cytometry and CD4+/CD8+ count in their algorithm (34). Besides, those omitted variables do not seem to have had significant impact on our results as the three clusters obtained were consistent in numbers with previous work (33–39). Additionally, our study included patients admitted between January 1, 2020, and January 31, 2021, being before the approval of the majority of targeted therapies against COVID-19 or vaccination. We therefore did not assess the effect of vaccination, treatments and the type of variant on phenotypes. Accordingly, this put our algorithm at risk for temporal dataset shift (56) and calibrating our clustering algorithm will be necessary before exploiting it in the clinical setting. Finally, because race-based data is not recorded in the Quebec healthcare sys...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.