Metaviromic identification of genetic hotspots of coronavirus pathogenicity using machine learning
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (ScreenIT)
Abstract
The COVID-19 pandemic caused by SARS-CoV-2 has become a major threat across the globe. Here, we developed machine learning approaches to identify key pathogenic regions in coronavirus genomes. We trained and evaluated 7,562,625 models on 3,665 genomes including SARS-CoV-2, MERS-CoV, SARS-CoV and other coronaviruses of human and animal origins to return quantitative and biologically interpretable signatures at nucleotide and amino acid resolutions. We identified hotspots across the SARS-CoV-2 genome including previously unappreciated features in spike, RdRp and other proteins. Finally, we integrated pathogenicity genomic profiles with B cell and T cell epitope predictions for enrichment of sequence targets to help guide vaccine development. These results provide a systematic map of predicted pathogenicity in SARS-CoV-2 that incorporates sequence, structural and immunological features, providing an unbiased collection of genetic elements for functional studies. This metavirome-based framework can also be applied for rapid characterization of new coronavirus strains or emerging pathogenic viruses.
Article activity feed
-
SciScore for 10.1101/2020.08.13.248575: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources Sequence data collection: A total of 3,665 complete nucleotide genomes of the “Coronaviridae” family were downloaded from the Virus Pathogen Database and Analysis Resource (ViPR) database 5 to be used for machine learning algorithm training. ViPRsuggested: (vipR, RRID:SCR_010685)FASTA sequences for S protein (YP_009724390), E protein (YP_009724392), M protein (YP_009724393), N protein (YP_009724397), NSP3 (YP_009742610), NSP5 (YP_009742612), NSP8 (YP_009742615), NSP9 (YP_009742616), and NSP12 (YP_009725307) were obtained from the NCBI Protein database and used for downstream evolutionary … SciScore for 10.1101/2020.08.13.248575: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources Sequence data collection: A total of 3,665 complete nucleotide genomes of the “Coronaviridae” family were downloaded from the Virus Pathogen Database and Analysis Resource (ViPR) database 5 to be used for machine learning algorithm training. ViPRsuggested: (vipR, RRID:SCR_010685)FASTA sequences for S protein (YP_009724390), E protein (YP_009724392), M protein (YP_009724393), N protein (YP_009724397), NSP3 (YP_009742610), NSP5 (YP_009742612), NSP8 (YP_009742615), NSP9 (YP_009742616), and NSP12 (YP_009725307) were obtained from the NCBI Protein database and used for downstream evolutionary and immune epitope analyses. NCBI Proteinsuggested: (NCBI Protein, RRID:SCR_003257)Genetic features including nucleotides and gaps for a given window were converted to binary vector representations using LabelEncoder and OneHotEncoder from the Python scikit-learn library 31 Pythonsuggested: (IPython, RRID:SCR_001658)Additional Python libraries used include BioPython 32, NumPy 33, and pandas 34. BioPythonsuggested: (Biopython, RRID:SCR_007173)NumPysuggested: (NumPy, RRID:SCR_008633)Five supervised learning classifiers from scikit-learn were used for training and evaluation, with seeds set at 17 for algorithms that use a random number generator. scikit-learnsuggested: (scikit-learn, RRID:SCR_002577)Evolutionary analyses: Protein sequences used for evolutionary analyses were aligned using MAFFT version 7 with the “L-INS-i” strategy 30. MAFFTsuggested: (MAFFT, RRID:SCR_011811)Alignments were visualized using Jalview 2.11.1.0 35. Jalviewsuggested: (Jalview, RRID:SCR_006459)Phylogenic analyses were performed using MEGA10.1.8 software 36. MEGA10.1.8suggested: NoneResults from OddPub: Thank you for sharing your data.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- No funding statement was detected.
- No protocol registration statement was detected.
-