Identifying and prioritizing potential human-infecting viruses from their genome sequences

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Determining which animal viruses may be capable of infecting humans is currently intractable at the time of their discovery, precluding prioritization of high-risk viruses for early investigation and outbreak preparedness. Given the increasing use of genomics in virus discovery and the otherwise sparse knowledge of the biology of newly discovered viruses, we developed machine learning models that identify candidate zoonoses solely using signatures of host range encoded in viral genomes. Within a dataset of 861 viral species with known zoonotic status, our approach outperformed models based on the phylogenetic relatedness of viruses to known human-infecting viruses (area under the receiver operating characteristic curve [AUC] = 0.773), distinguishing high-risk viruses within families that contain a minority of human-infecting species and identifying putatively undetected or so far unrealized zoonoses. Analyses of the underpinnings of model predictions suggested the existence of generalizable features of viral genomes that are independent of virus taxonomic relationships and that may preadapt viruses to infect humans. Our model reduced a second set of 645 animal-associated viruses that were excluded from training to 272 high and 41 very high-risk candidate zoonoses and showed significantly elevated predicted zoonotic risk in viruses from nonhuman primates, but not other mammalian or avian host groups. A second application showed that our models could have identified Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) as a relatively high-risk coronavirus strain and that this prediction required no prior knowledge of zoonotic Severe Acute Respiratory Syndrome (SARS)-related coronaviruses. Genome-based zoonotic risk assessment provides a rapid, low-cost approach to enable evidence-driven virus surveillance and increases the feasibility of downstream biological and ecological characterization of viruses.

Article activity feed

  1. SciScore for 10.1101/2020.11.12.379917: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    A representative genome was selected for each virus species, giving preference to sequences from the RefSeq database wherever
    RefSeq
    suggested: (RefSeq, RRID:SCR_003496)
    All BLAST matches with e-value ≤ 0.001 were retained and used to calculate the proportion of human-infecting viruses in the phylogenetic neighbourhood of each virus (excluding the current species).
    BLAST
    suggested: (BLASTX, RRID:SCR_001653)
    For each gene, the sequence of the canonical transcript was obtained from version 96 of Ensembl (21).
    Ensembl
    suggested: (Ensembl, RRID:SCR_002344)

    Results from OddPub: Thank you for sharing your code and data.


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Importantly, given diagnostic limitations and the likelihood that not all viruses capable of human infection have had opportunities to emerge, viruses not reported to infect humans may represent unrealized or undocumented zoonoses or genuinely non-zoonotic species. Identifying these potential zoonoses was an a priori goal of our analysis. We first evaluated whether evolutionary proximity to human-infecting viruses predictably elevates zoonotic risk. Gradient boosted machine (GBM) classifiers trained on virus taxonomy or the frequency of human-infecting viruses among close relatives (“phylogenetic neighbourhood” (14)) outperformed chance (median area under the receiver-operating characteristic curve [AUCm] = 0.604 and 0.558, respectively), but were no better than simply ranking novel viruses by the proportion of human-infecting viruses in each family (“taxonomy-based heuristic”, AUCm = 0.596, fig. 1A), indicating the inability of these relatedness-based models to distinguish risk at scales below the viral family level. We next quantified the performance of GBMs trained on genome composition (i.e., codon usage biases, amino acid biases and dinucleotide biases), calculated either directly from viral genomes (“viral genomic features”) or based on similarity to three alternative sets of human gene transcripts (“human similarity features”): interferon-stimulated genes (ISGs), housekeeping genes, and all other genes. We hypothesized that viruses might optimally resemble ISGs since b...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.