Routine Laboratory Blood Tests Predict SARS-CoV-2 Infection Using Machine Learning

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Accurate diagnostic strategies to identify SARS-CoV-2 positive individuals rapidly for management of patient care and protection of health care personnel are urgently needed. The predominant diagnostic test is viral RNA detection by RT-PCR from nasopharyngeal swabs specimens, however the results are not promptly obtainable in all patient care locations. Routine laboratory testing, in contrast, is readily available with a turn-around time (TAT) usually within 1-2 hours.

Method

We developed a machine learning model incorporating patient demographic features (age, sex, race) with 27 routine laboratory tests to predict an individual’s SARS-CoV-2 infection status. Laboratory testing results obtained within 2 days before the release of SARS-CoV-2 RT-PCR result were used to train a gradient boosting decision tree (GBDT) model from 3,356 SARS-CoV-2 RT-PCR tested patients (1,402 positive and 1,954 negative) evaluated at a metropolitan hospital.

Results

The model achieved an area under the receiver operating characteristic curve (AUC) of 0.854 (95% CI: 0.829-0.878). Application of this model to an independent patient dataset from a separate hospital resulted in a comparable AUC (0.838), validating the generalization of its use. Moreover, our model predicted initial SARS-CoV-2 RT-PCR positivity in 66% individuals whose RT-PCR result changed from negative to positive within 2 days.

Conclusion

This model employing routine laboratory test results offers opportunities for early and rapid identification of high-risk SARS-CoV-2 infected patients before their RT-PCR results are available. It may play an important role in assisting the identification of SARS-CoV-2 infected patients in areas where RT-PCR testing is not accessible due to financial or supply constraints.

Article activity feed

  1. SciScore for 10.1101/2020.06.17.20133892: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board StatementIRB: This study was approved by the Institutional Review Board (#20-03021671) of Weill Cornell Medicine.
    RandomizationThe first setting was a 5-fold cross validation with the NYPH/WCM data, where all RT-PCR tests were randomly partitioned into 5 equal buckets with the same positive/negative ratio in each bucket as the ratio over all tests.
    Blindingnot detected.
    Power Analysisnot detected.
    Sex as a biological variablenot detected.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    At NYPH/LMH, Routine chemistry testing including procalcitonin was performed on Abbott ARCHITECT® c SYSTEM ci 4100 and ci 8200 analyzers.
    Abbott
    suggested: (Abbott, RRID:SCR_010477)
    The implementation was based on scikit-learn package 0.23.1(22) with the sklearn.model_selection.
    scikit-learn
    suggested: (scikit-learn, RRID:SCR_002577)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    There are three potential limitations to the use of this model. First, the model was trained on a dataset generated from a patient cohort who were in the hospital for moderate to life-threatening presentations of COVID-19. Thus, this model may not be applicable to mild COVID-19 cases. Second, the model was developed with a “control group” of ill patients in a metropolitan hospital for other causes. Thus, the model may need further refinement with different populations such as patients seen in a primary care office. Third, clinical application of the proposed model may require modification of laboratory testing practice to include tests that are not currently part of the institutional COVID-like illness (CLI) laboratory test panel. Generally speaking, an ideal training set for a learning-based approach should cover the variability of samples across different demographic and geographic distributions, as well as comorbidities, facilities (e.g. ED, inpatients, out-patient clinics) and to follow their changes over time. In practice, any training set collected within a fixed time period cannot satisfy all these wishes. The deployment of software in medical scenarios cannot be achieved by one stop. It is a continuous learning process that involves model monitoring, updating and customization. The US Food and Drug Administration (FDA) published a white paper (33) last year particularly discussing how to properly regulate the adaptations/modifications of AI/machine learning models as ...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.