An interactome landscape of SARS-CoV-2 virus-human protein-protein interactions by protein sequence-based multi-label classifiers

This article has been Reviewed by the following groups

Read the full article

Abstract

The new coronavirus species, SARS-CoV-2, caused an unprecedented global pandemic of COVID-19 disease since late December 2019. A comprehensive characterization of protein-protein interactions (PPIs) between SARS-CoV-2 and human cells is a key to understanding the infection and preventing the disease. Here we present a novel approach to predict virus-host PPIs by multi-label machine learning classifiers of random forests and XGBoost using amino acid composition profiles of virus and human proteins. Our models harness a large-scale database of Viruses.STRING with >80,000 virus-host PPIs along with evidence scores for multi-level evidence prediction, which is distinct from predicting binary interactions in previous studies. Our multi-label classifiers are based on 5 evidence levels binned from evidence scores. Our best model of XGBoost achieves 74% AUC and 68% accuracy on average in 10-fold cross validation. The most important amino acids are cysteine and histidine. In addition, our model predicts experimental PPIs with higher accuracy than text mining-based PPIs by 4% despite their smaller data size by more than 6-fold. We then predict evidence levels of ∼2,000 SARS-CoV-2 virus-human PPIs from public experimental proteomics data. Interactions with SARS-CoV-2 Nsp7b show high evidence. We also predict evidence levels of all pairwise PPIs of ∼550,000 between the SARS-CoV-2 and human proteomes to provide a draft virus-host interactome landscape for SARS-CoV-2 infection in humans in a comprehensive and unbiased way in silico . Most human proteins from 140 highest evidence predictions interact with SARS-CoV-2 Nsp7, Nsp1, and ORF14, with significant enrichment in the top 2 pathways of vascular smooth muscle contraction (CALD1, NPR2, CALML3) and Myc targets (CBX3, PES1). Our prediction also suggests that histone H2A components are targeted by multiple SARS-CoV-2 proteins.

Article activity feed

  1. SciScore for 10.1101/2021.11.07.467640: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    Test data: Our test data is 1,998 SARS-CoV-2 virus-human PPIs from the IntAct database as of July 17, 2020.
    IntAct
    suggested: (IntAct, RRID:SCR_006944)
    All computations were done in Python.
    Python
    suggested: (IPython, RRID:SCR_001658)
    We also performed protein-protein association analyses from the STRING database (STRING v11.5) (Szklarczyk et al., 2021).
    STRING
    suggested: (STRING, RRID:SCR_005223)
    Network visualization was done using Cytoscape (Shannon et al., 2003).
    Cytoscape
    suggested: (Cytoscape, RRID:SCR_003032)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    A limitation of our approach is lack of confidence in prediction of the low evidence classes, EC1 and EC2. They do not possess any intrinsic unique properties associated with each evidence level, unlike EC3 for experiments-based or physical PPIs. Further investigation is needed to characterize and interpret each evidence class and identify important features for each class. Alternatively, multi-class labeling might be done in different ways with different thresholds for individual evidence classes. On the other hand, one could use our tool as a binary classifier for physical PPIs as we demonstrated with EC >= 3 vs. EC < 3, or build binary classifiers based on a single threshold for combined scores. Comparative analysis with binary classifiers is beyond the scope of this study. Another limitation in this work is a subjective choice of the 72 edge features as model features. A significant model improvement might be achieved by better feature engineering for both nodes and edges. In conclusion, our protein sequence-based multi-label classifiers are useful tools to provide different evidence or confidence levels for virus-human PPIs and applicable to virus-human interactomes for new virus species such as SARS-CoV-2.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.