Semi-supervised identification of SARS-CoV-2 molecular targets
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (ScreenIT)
Abstract
SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. In this work, we analyzed a corpus of 66,000 SARS-CoV-2 genome sequences. We developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on use of a single reference genome and by overcoming atypical genome traits. Using this method, we identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction compared to proteome references including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to other published tools such as Prokka (base) and VAPiD, we yielded an 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 molecular target sequences— some conserved across time and geography while others represent emerging variants. We observed 3,362 non-redundant sequences per protein on average within this corpus and describe key D614G and N501Y variants spatiotemporally. For spike glycoprotein domains, we achieved greater than 97.9% sequence identity to references and characterized Receptor Binding Domain variants. Here, we comprehensively present the molecular targets to refine biomedical interventions for SARS-CoV-2 with a scalable high-accuracy method to analyze newly sequenced infections.
Article activity feed
-
SciScore for 10.1101/2021.05.03.440524: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources In the sub-sections below we describe, in detail, the key modifications of Prokka v1.14.5 (8) for improved unsupervised annotation of SARS-CoV-2 genomes (Section 4.2.1) and the addition of custom-built supervised algorithms to improve identification of specific proteins that were unable to be detected using the base implementation (Section 4.2.2). Prokkasuggested: (Prokka, RRID:SCR_014732)This version of InterProScan contains a number of InterPro, Gene Ontology and Pathway codes specific to the SARS-CoV-2 proteome and reference data. InterProScansuggested: (InterProScan, RRID:SCR_005829)4.4 … SciScore for 10.1101/2021.05.03.440524: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources In the sub-sections below we describe, in detail, the key modifications of Prokka v1.14.5 (8) for improved unsupervised annotation of SARS-CoV-2 genomes (Section 4.2.1) and the addition of custom-built supervised algorithms to improve identification of specific proteins that were unable to be detected using the base implementation (Section 4.2.2). Prokkasuggested: (Prokka, RRID:SCR_014732)This version of InterProScan contains a number of InterPro, Gene Ontology and Pathway codes specific to the SARS-CoV-2 proteome and reference data. InterProScansuggested: (InterProScan, RRID:SCR_005829)4.4 Comparative analysis: To compare our method against other published viral genome annotation tools, VAPiD (v1.2 with Python3) was run on a set of 100 randomly selected SARS-CoV-2 genomes above quality control thresholds previously defined in Section 4.1 using the following parameters: reference (--r) NC 045512.2 Python3suggested: NoneProtein names and sequences were extracted from VAPiD output files using BioPython’s parser. BioPython’ssuggested: NoneProtein annotations were evaluated against the SARS-CoV-2 proteome reference sequences indicated in ViralZone, SIB Swiss Institute of Bioinformatics (22) for complete protein set membership per genome, sequence length, and sequence similarity to known references indicated in NCBI UniProt (21). ViralZonesuggested: (ViralZone, RRID:SCR_006563)For domain accuracy comparative analysis, our predicted domains identified in spike glycoprotein (S protein) were analyzed for set membership completeness against the expected InterPro domain architecture for UniProt reference sequence P0DTC2 (https://www.ebi.ac.uk/interpro/protein/reviewed/P0DTC2/). InterProsuggested: (InterPro, RRID:SCR_006695)e Python SDK, and Docker container) or web interface, which can be accessed by requesting credentials at the link above. Pythonsuggested: (IPython, RRID:SCR_001658)Results from OddPub: Thank you for sharing your code.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We found bar graphs of continuous data. We recommend replacing bar graphs with more informative graphics, as many different datasets can lead to the same bar graph. The actual data may suggest different conclusions from the summary statistics. For more information, please see Weissgerber et al (2015).
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- No funding statement was detected.
- No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
-