Prospective Predictive Performance Comparison between Clinical Gestalt and Validated COVID-19 Mortality Scores

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Most COVID-19 mortality scores were developed at the beginning of the pandemic and clinicians now have more experience and evidence-based interventions. Therefore, we hypothesized that the predictive performance of COVID-19 mortality scores is now lower than originally reported. We aimed to prospectively evaluate the current predictive accuracy of six COVID-19 scores and compared it with the accuracy of clinical gestalt predictions. 200 patients with COVID-19 were enrolled in a tertiary hospital in Mexico City between September and December 2020. The area under the curve (AUC) of the LOW-HARM, qSOFA, MSL-COVID-19, NUTRI-CoV, and NEWS2 scores and the AUC of clinical gestalt predictions of death (as a percentage) were determined. In total, 166 patients (106 men and 60 women aged 56±9 years) with confirmed COVID-19 were included in the analysis. The AUC of all scores was significantly lower than originally reported: LOW-HARM 0.76 (95% CI 0.69 to 0.84) vs 0.96 (95% CI 0.94 to 0.98), qSOFA 0.61 (95% CI 0.53 to 0.69) vs 0.74 (95% CI 0.65 to 0.81), MSL-COVID-19 0.64 (95% CI 0.55 to 0.73) vs 0.72 (95% CI 0.69 to 0.75), NUTRI-CoV 0.60 (95% CI 0.51 to 0.69) vs 0.79 (95% CI 0.76 to 0.82), NEWS2 0.65 (95% CI 0.56 to 0.75) vs 0.84 (95% CI 0.79 to 0.90), and neutrophil to lymphocyte ratio 0.65 (95% CI 0.57 to 0.73) vs 0.74 (95% CI 0.62 to 0.85). Clinical gestalt predictions were non-inferior to mortality scores, with an AUC of 0.68 (95% CI 0.59 to 0.77). Adjusting scores with locally derived likelihood ratios did not improve their performance; however, some scores outperformed clinical gestalt predictions when clinicians’ confidence of prediction was <80%. Despite its subjective nature, clinical gestalt has relevant advantages in predicting COVID-19 clinical outcomes. The need and performance of most COVID-19 mortality scores need to be evaluated regularly.

Article activity feed

  1. SciScore for 10.1101/2021.04.16.21255647: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board StatementIRB: This study was approved by the Ethics Committee for Research on Humans of the National Institute of Medical Sciences and Nutrition Salvador Zubirán on August 25, 2020 (Reg. No. DMC-3369-20-20-1-1a).
    Randomizationnot detected.
    Blindingnot detected.
    Power AnalysisSample size rationale: We calculated with “easyROC” (20), an open R-based web-tool for estimating sample sizes for AUC direct and non-inferior comparisons using Obuchowski’s method (21) that; for detecting no-inferiority with a >0.05 maximal AUC difference with the reported LOW-HARM AUC (0.96 95% CI:0.94 – 0.98) with a case allocation ratio of 0.7 (because the mortality in our centre is ∼ 0.3) with a power of 0.8 and a significance cut-off level of 0.05, 159 patients would be needed.
    Sex as a biological variablenot detected.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    The AUCs differences were analysed using DeLong’s method with the STATA function “roccomp” (22).
    STATA
    suggested: (Stata, RRID:SCR_012763)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    This work highlights the inherent limitations of statistically derived scores and some of the advantages of Clinical Gestalt predictions. In other scenarios where using predictive scores is frequent, more experienced clinicians can always ponder their sometimes subjective yet, quite valuable insight. However, with the COVID-19 pandemic clinicians with all levels of training started their learning curve at the same time. In this study, we had the unique opportunity of re-evaluating more than one score (two of them in the same setting and for the same purpose they were designed for), while testing the accuracy of Clinical Gestalt, in a group of clinicians who started their learning curve for managing a disease at the same time (experience and training withing healthcare teams is usually mixed for other diseases). Additionally, we explored the accuracy of Clinical Gestalt across different degrees of prediction confidence. To our knowledge, this is the first time that this type of analysis is done for subjective clinical predictions and proved to be quite insightful. The fact that Clinical Gestalt’s accuracy correlates with confidence in prediction, suggests that while there is value in subjective predictions, it is also important to ask ourselves about how confident we are about our predictions. Interestingly, our results suggest Clinical Gestalt predictions are particularly prone to be positively biased, clinicians were more likely to correctly predict which patients would surv...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: Please consider improving the rainbow (“jet”) colormap(s) used on page 27. At least one figure is not accessible to readers with colorblindness and/or is not true to the data, i.e. not perceptually uniform.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.