Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

No abstract available

Article activity feed

  1. SciScore for 10.1101/2020.02.03.932350: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    The Wuhan seafood market pneumonia virus (COVID-19 virus) isolate Wuhan-Hu-1 complete reference genome of 29903 bp was downloaded from the NCBI database on January 23, 2020.
    NCBI
    suggested: (NCBI, RRID:SCR_006472)
    Virus-Host DB covers the sequences from the NCBI RefSeq (release 96, September 9, 2019), and GenBank (release 233.0, August 15, 2019).
    RefSeq
    suggested: (RefSeq, RRID:SCR_003496)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Alignment-free methods have been used successfully in the past to address the limitations of the alignment-based methods [49–52]. The alignment-free approach is quick and can handle a large number of sequences. Moreover, even the sequences coming from different regions with different compositions can be easily compared quantitatively, with equally meaningful results as when comparing homologous/similar sequences. We use MLDSP-GUI (a variant of MLDSP with additional features), a machine learning-based alignment-free method successfully used in the past for sequence comparisons and analyses [51]. The main advantage alignment-free methodology offers is the ability to analyze large datasets rapidly. In this study we confirm the taxonomy of COVID-19 and, more generally, propose a method to efficiently analyze and classify a novel unclassified DNA sequence against the background of a large dataset. We namely use a “decision tree” approach (paralleling taxonomic ranks), and start with the highest taxonomic level, train the classification models on the available complete genomes, test the novel unknown sequences to predict the label among the labels of the training dataset, move to the next taxonomic level, and repeat the whole process down to the lowest taxonomic label. Test-1 starts at the highest available level and classifies the viral sequences to the 11 families and Riboviria realm (Table 1). There is only one realm available in the viral taxonomy, so all of the families that b...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • No conflict of interest statement was detected. If there are no conflicts, we encourage authors to explicit state so.
    • No funding statement was detected.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.