In silico approach toward the identification of unique peptides from viral protein infection: Application to COVID-19

This article has been Reviewed by the following groups

Read the full article

Abstract

We describe a method for rapid in silico selection of diagnostic peptides from newly described viral pathogens and applied this approach to SARS-CoV-2/COVID-19. This approach is multi-tiered, beginning with compiling the theoretical protein sequences from genomic derived data. In the case of SARS-CoV-2 we begin with 496 peptides that would be produced by proteolytic digestion of the viral proteins. To eliminate peptides that would cause cross-reactivity and false positives we remove peptides from consideration that have sequence homology or similar chemical characteristics using a progressively larger database of background peptides. Using this pipeline, we can remove 47 peptides from consideration as diagnostic due to the presence of peptides derived from the human proteome. To address the complexity of the human microbiome, we describe a method to create a database of all proteins of relevant abundance in the saliva microbiome. By utilizing a protein-based approach to the microbiome we can more accurately identify peptides that will be problematic in COVID-19 studies which removes 12 peptides from consideration. To identify diagnostic peptides, another 7 peptides are flagged for removal following comparison to the proteome backgrounds of viral and bacterial pathogens of similar clinical presentation. By aligning the protein sequences of SARS-CoV-2 field isolates deposited to date we can identify peptides for removal due to their presence in highly variable regions that may lead to false negatives as the pathogen evolves. We provide maps of these regions and highlight 3 peptides that should be avoided as potential diagnostic or vaccine targets. Finally, we leverage publicly deposited proteomics data from human cells infected with SARS-CoV-2, as well as a second study with the closely related MERS-CoV to identify the two proteins of highest abundance in human infections. The resulting final list contains the 24 peptides most unique and diagnostic of SARS-CoV-2 infections. These peptides represent the best targets for the development of antibodies are clinical diagnostics. To demonstrate one application of this we model peptide fragmentation using a deep learning tool to rapidly generate targeted LCMS assays and data processing method for detecting CoVID-19 infected patient samples.

Graphical Abstract

Article activity feed

  1. SciScore for 10.1101/2020.03.08.980383: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    All sequences in this study were obtained from NCBI accession: txd2697049, https://www.ncbi.nlm.nih.gov/protein/?term=txid2697049).
    NCBI
    suggested: (NCBI, RRID:SCR_006472)
    Using Proteome Discoverer 2.4 (Thermo), the protein sequences were combined into a single protein FASTA database (2019-nCOVpFASTA1; Supplemental Information), and added to human proteome sequences (UniProt SwissProt Human database; downloaded 2/15/2020) to produce a database including both human and COVID-19 protein sequences (Human_plus_2019-nCOVpFASTA2; Supplemental Information).
    Proteome Discoverer
    suggested: (Proteome Discoverer, RRID:SCR_014477)
    Thermo
    suggested: (Thermo Xcalibur, RRID:SCR_014593)
    SARS-CoV-2 labeled proteomics data was made available ahead of publication33 as ProteomeXchange PXD017710.
    ProteomeXchange
    suggested: (ProteomeXchange, RRID:SCR_004055)
    Peptide settings and transitions were optimized within Skyline to reflect the vendor optimization requirements.
    Skyline
    suggested: (Skyline, RRID:SCR_014080)
    For Agilent systems, the 20ms default dwell time was selected for the transition settings.
    Agilent systems
    suggested: None
    Evaluation of genomic stability in SARS-COV-2: To ensure that our assay targets did not lie in a mutable portion of the genome, all available SARS-CoV-2 genome constructs (as of March 21st, 2020) were downloaded from NCBI and aligned using MAFFT (sorted fasta output and automatic input detection; ver 7.453) and visualized with AliView (ver 1.26).
    MAFFT
    suggested: (MAFFT, RRID:SCR_011811)
    AliView
    suggested: (AliView, RRID:SCR_002780)
    FASTA databases of pathogens with similar clinical presentation: Protein FASTA databases were downloaded from UniProt 2020_01 to create databases of proteins indicative of pathogens with similar clinical presentation using the following search terms: Coronavirus reviewed [no], Influenza, Middle East Respiratory, Pneumoniae, Respiratory Syncytial Virus, Rhinovirus, Staphylococcus aureus, Streptococcus reviewed [yes].
    FASTA
    suggested: (FASTA, RRID:SCR_011819)
    Protein FASTA
    suggested: None
    The UniProt 2020_01 release of the SARS-CoV-2 FASTA was placed into the same local file with the FASTA databases from Human and the FASTA databases of pathogens with similar clinical presentation.
    UniProt
    suggested: (UniProtKB, RRID:SCR_004426)

    Results from OddPub: Thank you for sharing your data.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We found bar graphs of continuous data. We recommend replacing bar graphs with more informative graphics, as many different datasets can lead to the same bar graph. The actual data may suggest different conclusions from the summary statistics. For more information, please see Weissgerber et al (2015).


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • No funding statement was detected.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.