Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (ScreenIT)
Abstract
The COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hosts leads to specific evolutionary signatures within viral genomes that can inform likely animal origins. We obtained a set of 650 spike protein and 511 whole genome nucleotide sequences from 222 and 185 viruses belonging to the family Coronaviridae , respectively. We then trained random forest models independently on genome composition biases of spike protein and whole genome sequences, including dinucleotide and codon usage biases in order to predict animal host (of nine possible categories, including human). In hold-one-out cross-validation, predictive accuracy on unseen coronaviruses consistently reached ~73%, indicating evolutionary signal in spike proteins to be just as informative as whole genome sequences. However, different composition biases were informative in each case. Applying optimised random forest models to classify human sequences of MERS-CoV and SARS-CoV revealed evolutionary signatures consistent with their recognised intermediate hosts (camelids, carnivores), while human sequences of SARS-CoV-2 were predicted as having bat hosts (suborder Yinpterochiroptera), supporting bats as the suspected origins of the current pandemic. In addition to phylogeny, variation in genome composition can act as an informative approach to predict emerging virus traits as soon as sequences are available. More widely, this work demonstrates the potential in combining genetic resources with machine learning algorithms to address long-standing challenges in emerging infectious diseases.
Article activity feed
-
-
SciScore for 10.1101/2020.11.02.350439: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Institutional Review Board Statement not detected. Randomization not detected. Blinding not detected. Power Analysis not detected. Sex as a biological variable not detected. Table 2: Resources
Software and Algorithms Sentences Resources (spike[Title] OR “S gene”[Title] OR “S protein”[Title] OR “S glycoprotein”[Title] OR “S1 gene”[Title] OR “S1 protein”[Title] OR “S1 glycoprotein”[Title] OR peplomer[Title] OR peplomeric[Title] OR peplomers[Title] OR “complete genome”[Title]) NOT (patent[Title] OR vaccine OR artificial OR construct OR recombinant[Title])’ where successive searches were conducted replacing ### with taxonomic identifiers for each species and unranked sub-species belonging to the family … SciScore for 10.1101/2020.11.02.350439: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Institutional Review Board Statement not detected. Randomization not detected. Blinding not detected. Power Analysis not detected. Sex as a biological variable not detected. Table 2: Resources
Software and Algorithms Sentences Resources (spike[Title] OR “S gene”[Title] OR “S protein”[Title] OR “S glycoprotein”[Title] OR “S1 gene”[Title] OR “S1 protein”[Title] OR “S1 glycoprotein”[Title] OR peplomer[Title] OR peplomeric[Title] OR peplomers[Title] OR “complete genome”[Title]) NOT (patent[Title] OR vaccine OR artificial OR construct OR recombinant[Title])’ where successive searches were conducted replacing ### with taxonomic identifiers for each species and unranked sub-species belonging to the family Coronaviridae within the NCBI taxonomy database (Federhen, 2012) (n = 1585 taxonomic ids total). NCBIsuggested: (NCBI, RRID:SCR_006472)Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- No funding statement was detected.
- No protocol registration statement was detected.
-