A Genotype-to-Phenotype Modeling Framework to Predict Human Pathogenicity of Novel Coronaviruses
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (ScreenIT)
Abstract
Leveraging prior viral genome sequencing data to make predictions on whether an unknown, emergent virus harbors a ‘phenotype-of-concern’ has been a long-sought goal of genomic epidemiology. A predictive phenotype model built from nucleotide-level information alone has previously been considered un-tenable with respect to RNA viruses due to the ultra-high intra-sequence variance of their genomes, even within closely related clades. Building from our prior work developing a degenerate k-mer method to accommodate this high intra-sequence variation of RNA virus genomes for modeling frameworks, and leveraging a taxonomic ‘group-shuffle-split’ paradigm on complete coronavirus assemblies from prior to October 2018, we trained multiple regularized logistic regression classifiers at the nucleotide k-mer level capable of accurately predicting withheld SARS-CoV-2 genome sequences as human pathogens and accurately predicting withheld Swine Acute Diarrhea Syndrome coronavirus (SADS-CoV) genome sequences as non-human pathogens. LASSO feature selection identified several degenerate nucleotide predictor motifs with high model coefficients for the human pathogen class that were present across widely disparate classes of coronaviruses. However, these motifs differed in which genes they were present in, what specific codons were used to encode them, and what the translated amino acid motif was. This emphasizes the importance of a phenetic view of emerging pathogenic RNA viruses, as opposed to the canonical phylogenetic interpretations most-commonly used to track and manage viral zoonoses. Applying our model to more recent Orthocoronavirinae genomes deposited since October 2018 yields a novel contextual view of pathogen-potential across bat-related, canine-related, porcine-related, and rodent-related coronaviruses and critical adaptations which may have contributed to the emergence of the pandemic SARS-CoV-2 virus. Finally, we discuss the utility of these predictive models (and their associated predictor motifs) to novel biosurveillance protocols that substantially increase the ‘pound-for-pound’ information content of field-collected sequencing data and make a strong argument for the necessity of routine collection and sequencing of zoonotic viruses.
Article activity feed
-
-
SciScore for 10.1101/2021.09.18.460926: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources The test set was completed by adding the RefSeq SARS-CoV-2 reference sequence as well as WA1, to provide representative diversity of sequences across the duration of the COVID-19 pandemic. RefSeqsuggested: (RefSeq, RRID:SCR_003496)Results from OddPub: Thank you for sharing your code and data.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any …
SciScore for 10.1101/2021.09.18.460926: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources The test set was completed by adding the RefSeq SARS-CoV-2 reference sequence as well as WA1, to provide representative diversity of sequences across the duration of the COVID-19 pandemic. RefSeqsuggested: (RefSeq, RRID:SCR_003496)Results from OddPub: Thank you for sharing your code and data.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- No funding statement was detected.
- No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
-