Alignment-free machine learning approaches for the lethality prediction of potential novel human-adapted coronavirus using genomic nucleotide

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

A newly emerging novel coronavirus appeared and rapidly spread worldwide and World Health Organization declared a pandemic on March 11, 2020. The roles and characteristics of coronavirus have captured much attention due to its power of causing a wide variety of infectious diseases, from mild to severe on humans. The detection of the lethality of human coronavirus is key to estimate the viral toxicity and provide perspective for treatment. We developed alignment-free machine learning approaches for an ultra-fast and highly accurate prediction of the lethality of potential human-adapted coronavirus using genomic nucleotide. We performed extensive experiments through six different feature transformation and machine learning algorithms in combination with digital signal processing to infer the lethality of possible future novel coronaviruses using previous existing strains. The results tested on SARS-CoV, MERS-Cov and SARS-CoV-2 datasets show an average 96.7% prediction accuracy. We also provide preliminary analysis validating the effectiveness of our models through other human coronaviruses. Our study achieves high levels of prediction performance based on raw RNA sequences alone without genome annotations and specialized biological knowledge. The results demonstrate that, for any novel human coronavirus strains, this alignment-free machine learning-based approach can offer a reliable real-time estimation for its viral lethality.

Article activity feed

  1. SciScore for 10.1101/2020.07.15.176933: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Experimental Models: Cell Lines
    SentencesResources
    Some SARS-CoV strains from the laboratory are included that are cultivated in Vero cell cultures to enrich the training samples.
    Vero
    suggested: CLS Cat# 605372/p622_VERO, RRID:CVCL_0059)
    Software and Algorithms
    SentencesResources
    The CNN models contain AlexNet [48], VGG [49] and ResNet [50].
    ResNet
    suggested: (RESNET, RRID:SCR_002121)
    Implementation and evaluation: We implement all the models by Scikit-learn [51] and PyTorch [52].
    Scikit-learn
    suggested: (scikit-learn, RRID:SCR_002577)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    This study is subject to a variety of limitations. The definition of classifying the degree of coronavirus lethality is mainly based on the mortality rate. We assume that the higher the mortality, the more lethal for the virus, and thus make three categories of the lethality level for all viruses with a different threshold. However, our estimation for these values lies within the range of fatality rate from the literature, which we do not have sufficient data to parameterize the case-structured model, especially for viruses with few samples. We also do not build a benchmark for the death caused directly by human coronaviruses, as the criteria from institutions and countries could be different. Besides, the limited data points for the human coronavirus pale the high predictive accuracy, as most of the machine learning algorithms possess a superb generation ability to discover inherent patterns from training samples, particularly in the small dataset. But like typical machine learning approaches, our models are not qualified to provide a direct and accessible explanation that explicitly interprets why a certain coronavirus strain is more lethal to humans. Some rule-based methods or clinical study might provide a better rationale for their results.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.