Genome-wide bioinformatic analyses predict key host and viral factors in SARS-CoV-2 pathogenesis

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The novel betacoronavirus severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) caused a worldwide pandemic (COVID-19) after emerging in Wuhan, China. Here we analyzed public host and viral RNA sequencing data to better understand how SARS-CoV-2 interacts with human respiratory cells. We identified genes, isoforms and transposable element families that are specifically altered in SARS-CoV-2-infected respiratory cells. Well-known immunoregulatory genes including CSF2, IL32, IL-6 and SERPINA3 were differentially expressed, while immunoregulatory transposable element families were upregulated. We predicted conserved interactions between the SARS-CoV-2 genome and human RNA-binding proteins such as the heterogeneous nuclear ribonucleoprotein A1 (hnRNPA1) and eukaryotic initiation factor 4 (eIF4b). We also identified a viral sequence variant with a statistically significant skew associated with age of infection, that may contribute to intracellular host–pathogen interactions. These findings can help identify host mechanisms that can be targeted by prophylactics and/or therapeutics to reduce the severity of COVID-19.

Article activity feed

  1. SciScore for 10.1101/2020.07.28.225581: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Experimental Models: Cell Lines
    SentencesResources
    The first dataset, GSE147507 [10], includes gene expression measurements from three cell lines derived from the human respiratory system (NHBE, A549, Calu-3) infected either with SARS-CoV-2, influenza A virus (IAV), respiratory syncytial virus (RSV), or human parainfluenza virus 3 (HPIV3).
    A549
    suggested: NCI-DTP Cat# A549, RRID:CVCL_0023)
    Software and Algorithms
    SentencesResources
    Datasets: Two datasets were downloaded from the Gene Expression Omnibus (GEO) database, hosted at the National Center for Biotechnology Information (NCBI).
    Gene Expression Omnibus
    suggested: (Gene Expression Omnibus (GEO, RRID:SCR_005012)
    FastQC (v0.11.9; https://github.com/s-andrews/FastQC) and MultiQC (v1.9) [20] were employed to assess the quality of the data used and the need to trim reads and/or remove adapters.
    FastQC
    suggested: (FastQC, RRID:SCR_014583)
    MultiQC
    suggested: (MultiQC, RRID:SCR_014982)
    Selected datasets were mapped to the human reference genome (GENCODE Release 19, GRCh37.p13) utilizing STAR (v2.7.3a) [17].
    STAR
    suggested: (STAR, RRID:SCR_015899)
    Resulting SAM files were converted to BAM files employing samtools (v1.9) [43].
    samtools
    suggested: (SAMTOOLS, RRID:SCR_002105)
    Next, read quantification was performed using StringTie (v2.1.1) [60] and the output data was postprocessed with an auxiliary Python script provided by the same developers to produce files ready for subsequent downstream analyses.
    Python
    suggested: (IPython, RRID:SCR_001658)
    Finally, an exploratory data analysis was carried out based on the transformed values obtained after applying the variance stabilizing transformation [3] implemented in the vst() function of DESeq2 [48].
    DESeq2
    suggested: (DESeq, RRID:SCR_000154)
    GO terms with a significant adjusted p-value of less than 0.05 were reduced to representative non-redundant terms with the use of REVIGO [73].
    REVIGO
    suggested: (REViGO, RRID:SCR_005825)
    The significant results for all comparisons from publicly available data from KEGG, Reactome, Panther, BioCarta, and NCI were then compiled to facilitate downstream comparison.
    KEGG
    suggested: (KEGG, RRID:SCR_012773)
    Panther
    suggested: (PANTHER, RRID:SCR_004869)
    BioCarta
    suggested: (BioCarta Pathways, RRID:SCR_006917)
    Hypergeometric pathway enrichments were performed using the Database for Annotation, Visualization and Integrated Discovery (DAVID, v6.8) [30].
    DAVID
    suggested: (DAVID, RRID:SCR_001881)
    Isoform Analysis: Using transcript quantification data from StringTie as input, we identified isoform switching events and their predicted functional consequences with the IsoformSwitchAnalyzeR R package (v1.11.3) [81].
    StringTie
    suggested: (StringTie , RRID:SCR_016323)
    Following filtering for significant isoforms, we externally predicted their coding capabilities, protein structure stability, peptide signaling, and shifts in protein domain usage using The Coding-Potential Assessment Tool (CPAT) [82], IUPred2 [18], SignalP [2] and Pfam tools respectively [19].
    SignalP
    suggested: (SignalP, RRID:SCR_015644)
    Pfam
    suggested: (Pfam, RRID:SCR_004726)
    Viral genotype-phenotype correlation: All complete SARS-CoV-2 genomes from GISAID, together with the GenBank reference sequence, were aligned with MAFFT (v7.464) within a high-performance computing environment using 1 thread and the –nomemsave parameter [55]
    MAFFT
    suggested: (MAFFT, RRID:SCR_011811)

    Results from OddPub: Thank you for sharing your code.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.

  2. SciScore for 10.1101/2020.07.28.225581: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Experimental Models: Cell Lines
    SentencesResources
    From this list, we excluded all TE families detected in A549 cells infected with the other viruses.
    A549
    suggested: NCI-DTP Cat# A549, RRID:CVCL_0023
    Software and Algorithms
    SentencesResources
    Materials and Methods Datasets Two datasets were downloaded from the Gene Expression Omnibus (GEO) database, hosted at the National Center for Biotechnology Information (NCBI).
    Gene Expression Omnibus
    suggested: (Gene Expression Omnibus (GEO), RRID:SCR_005012)
    FastQC (v0.11.9; https://github.com/s-andrews/FastQC) and MultiQC (v1.9) [20] were employed to assess the quality of the data and the need to trim reads and/or remove adapters.
    FastQC
    suggested: (FastQC, RRID:SCR_014583)
    MultiQC
    suggested: (MultiQC, RRID:SCR_014982)
    Selected datasets were mapped to the human reference genome (GENCODE Release 19, GRCh37.p13) utilizing STAR (v2.7.3a) [17].
    STAR
    suggested: (STAR, RRID:SCR_015899)
    Resulting SAM files were converted to BAM files employing samtools (v1.9) [43].
    samtools
    suggested: (Samtools, RRID:SCR_002105)
    Next, read quantification was performed using StringTie (v2.1.1) [60] and the output data was postprocessed with an auxiliary Python script provided by the same developers to produce files ready for subsequent downstream analyses.
    StringTie
    suggested: (StringTie , RRID:SCR_016323)
    Python
    suggested: (IPython, RRID:SCR_001658)
    GO terms with a significant adjusted p-value of less than 0.05 were reduced to representative non-redundant terms with the use of REVIGO [73].
    REVIGO
    suggested: (REViGO, RRID:SCR_005825)
    The significant results for all comparisons from publicly available data from KEGG, Reactome, Panther, BioCarta, and NCI were then compiled to facilitate downstream comparison.
    KEGG
    suggested: (KEGG, RRID:SCR_012773)
    Panther
    suggested: (PANTHER, RRID:SCR_004869)
    BioCarta
    suggested: (BioCarta Pathways, RRID:SCR_006917)
    Differentially expressed TEs (DETEs) in infected vs mock conditions were detected using DEseq2 with a matrix of counts for genes and TE families as input.
    DEseq2
    suggested: (DESeq2, RRID:SCR_015687)
    Viral genotype-phenotype correlation All complete SARS-CoV-2 genomes from GISAID, together with the GenBank reference sequence, were aligned with MAFFT (v7.464) within a high-performance computing environment using 1 thread and the –nomemsave parameter [55]
    MAFFT
    suggested: (MAFFT, RRID:SCR_011811)
    Pathway enrichment for each dataset (SPIA and DAVID merged into one file).
    DAVID
    suggested: (DAVID, RRID:SCR_001881)

    Results from OddPub: Thank you for sharing your code.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore is not a substitute for expert review. SciScore checks for the presence and correctness of RRIDs (research resource identifiers) in the manuscript, and detects sentences that appear to be missing RRIDs. SciScore also checks to make sure that rigor criteria are addressed by authors. It does this by detecting sentences that discuss criteria such as blinding or power analysis. SciScore does not guarantee that the rigor criteria that it detects are appropriate for the particular study. Instead it assists authors, editors, and reviewers by drawing attention to sections of the manuscript that contain or should contain various rigor criteria and key resources. For details on the results shown here, including references cited, please follow this link.

  3. SciScore for 10.1101/2020.07.28.225581: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Experimental Models: Cell Lines
    SentencesResources
    From this list, we excluded all TE families detected in A549 cells infected with the other viruses.
    A549
    suggested: NCI-DTP Cat# A549, RRID:CVCL_0023
    Software and Algorithms
    SentencesResources
    Materials and Methods Datasets Two datasets were downloaded from the Gene Expression Omnibus (GEO) database, hosted at the National Center for Biotechnology Information (NCBI).
    Gene Expression Omnibus
    suggested: (Gene Expression Omnibus (GEO), RRID:SCR_005012)
    FastQC (v0.11.9; https://github.com/s-andrews/FastQC) and MultiQC (v1.9) [20] were employed to assess the quality of the data and the need to trim reads and/or remove adapters.
    FastQC
    suggested: (FastQC, RRID:SCR_014583)
    MultiQC
    suggested: (MultiQC, RRID:SCR_014982)
    Selected datasets were mapped to the human reference genome (GENCODE Release 19, GRCh37.p13) utilizing STAR (v2.7.3a) [17].
    STAR
    suggested: (STAR, RRID:SCR_015899)
    Resulting SAM files were converted to BAM files employing samtools (v1.9) [43].
    samtools
    suggested: (Samtools, RRID:SCR_002105)
    Next, read quantification was performed using StringTie (v2.1.1) [60] and the output data was postprocessed with an auxiliary Python script provided by the same developers to produce files ready for subsequent downstream analyses.
    StringTie
    suggested: (StringTie , RRID:SCR_016323)
    Python
    suggested: (IPython, RRID:SCR_001658)
    DESeq2 (v1.26.0) [47] was used in both cases to identify differentially expressed genes (DEGs).
    DESeq2
    suggested: (DESeq, RRID:SCR_000154)
    GO terms with a significant adjusted p-value of less than 0.05 were reduced to representative non-redundant terms with the use of REVIGO [73].
    REVIGO
    suggested: (REViGO, RRID:SCR_005825)
    The significant results for all comparisons from publicly available data from KEGG, Reactome, Panther, BioCarta, and NCI were then compiled to facilitate downstream comparison.
    KEGG
    suggested: (KEGG, RRID:SCR_012773)
    Panther
    suggested: (PANTHER, RRID:SCR_004869)
    BioCarta
    suggested: (BioCarta Pathways, RRID:SCR_006917)
    Hypergeometric pathway enrichments were performed using the Database for Annotation, Visualization and Integrated Discovery (DAVID, v6.8) [30].
    DAVID
    suggested: (DAVID, RRID:SCR_001881)
    Viral genotype-phenotype correlation All complete SARS-CoV-2 genomes from GISAID, together with the GenBank reference sequence, were aligned with MAFFT (v7.464) within a high-performance computing environment using 1 thread and the –nomemsave parameter [55]
    MAFFT
    suggested: (MAFFT, RRID:SCR_011811)

    Results from OddPub: Thank you for sharing your code.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore is not a substitute for expert review. SciScore checks for the presence and correctness of RRIDs (research resource identifiers) in the manuscript, and detects sentences that appear to be missing RRIDs. SciScore also checks to make sure that rigor criteria are addressed by authors. It does this by detecting sentences that discuss criteria such as blinding or power analysis. SciScore does not guarantee that the rigor criteria that it detects are appropriate for the particular study. Instead it assists authors, editors, and reviewers by drawing attention to sections of the manuscript that contain or should contain various rigor criteria and key resources. For details on the results shown here, including references cited, please follow this link.

  4. SciScore for 10.1101/2020.07.28.225581: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Experimental Models: Cell Lines
    SentencesResources
    Here we report a subset of non-redundant reduced terms consistently 0.0 CRYM e than one SARS-COV-2 cell line which were not detected in the other viruses’ datasets.
    SARS-COV-2
    suggested: None
    NHBE cells expressed 4 known IL-6 isoforms, while A549 cells expressed 1 unknown and 6 known isoforms.
    A549
    suggested: NCI-DTP Cat# A549, CVCL_0023
    This allowed us to identify 16 families that were specifically upregulated in Calu-3 and A549 cells infected with SARS-CoV-2 and not in the other viral infections.
    Calu-3
    suggested: BCRJ Cat# 0264, CVCL_0609
    Software and Algorithms
    SentencesResources
    Materials and Methods Datasets Two datasets were downloaded from the Gene Expression Omnibus (GEO) database, hosted at the National Center for Biotechnology Information (NCBI).
    Gene Expression Omnibus
    suggested: (Gene Expression Omnibus (GEO), SCR_005012)
    FastQC (v0.11.9; https://github.com/s-andrews/FastQC) and MultiQC (v1.9) [20] were employed to assess the quality of the data and the need to trim reads and/or remove adapters.
    FastQC
    suggested: (FastQC, SCR_014583)
          <div style="margin-bottom:8px">
            <div><b>MultiQC</b></div>
            <div>suggested: (MultiQC, <a href="https://scicrunch.org/resources/Any/search?q=SCR_014982">SCR_014982</a>)</div>
          </div>
        </td></tr><tr><td style="min-width:100px;vertical-align:top;border-bottom:1px solid lightgray">Selected datasets were mapped to the human reference genome (GENCODE Release 19, GRCh37.p13) utilizing STAR (v2.7.3a) [17].</td><td style="min-width:100px;border-bottom:1px solid lightgray">
          <div style="margin-bottom:8px">
            <div><b>STAR</b></div>
            <div>suggested: (STAR, <a href="https://scicrunch.org/resources/Any/search?q=SCR_015899">SCR_015899</a>)</div>
          </div>
        </td></tr><tr><td style="min-width:100px;vertical-align:top;border-bottom:1px solid lightgray">Resulting SAM files were converted to BAM files employing samtools (v1.9) [43].</td><td style="min-width:100px;border-bottom:1px solid lightgray">
          <div style="margin-bottom:8px">
            <div><b>samtools</b></div>
            <div>suggested: (Samtools, <a href="https://scicrunch.org/resources/Any/search?q=SCR_002105">SCR_002105</a>)</div>
          </div>
        </td></tr><tr><td style="min-width:100px;vertical-align:top;border-bottom:1px solid lightgray">Next, read quantification was performed using StringTie (v2.1.1) [60] and the output data was postprocessed with an auxiliary Python script provided by the same developers to produce files ready for subsequent downstream analyses.</td><td style="min-width:100px;border-bottom:1px solid lightgray">
          <div style="margin-bottom:8px">
            <div><b>Python</b></div>
            <div>suggested: (IPython, <a href="https://scicrunch.org/resources/Any/search?q=SCR_001658">SCR_001658</a>)</div>
          </div>
        </td></tr><tr><td style="min-width:100px;vertical-align:top;border-bottom:1px solid lightgray">Finally, an exploratory data analysis was carried out based on the transformed values obtained after applying the variance stabilizing transformation [3] implemented in the vst() function of DESeq2 [48].</td><td style="min-width:100px;border-bottom:1px solid lightgray">
          <div style="margin-bottom:8px">
            <div><b>DESeq2</b></div>
            <div>suggested: (DESeq, <a href="https://scicrunch.org/resources/Any/search?q=SCR_000154">SCR_000154</a>)</div>
          </div>
        </td></tr><tr><td style="min-width:100px;vertical-align:top;border-bottom:1px solid lightgray">GO terms with a significant adjusted p-value of less than 0.05 were reduced to representative non-redundant terms with the use of REVIGO [73].</td><td style="min-width:100px;border-bottom:1px solid lightgray">
          <div style="margin-bottom:8px">
            <div><b>REVIGO</b></div>
            <div>suggested: (REViGO, <a href="https://scicrunch.org/resources/Any/search?q=SCR_005825">SCR_005825</a>)</div>
          </div>
        </td></tr><tr><td style="min-width:100px;vertical-align:top;border-bottom:1px solid lightgray">The significant results for all comparisons from publicly available data from KEGG, Reactome, Panther, BioCarta, and NCI were then compiled to facilitate downstream comparison.</td><td style="min-width:100px;border-bottom:1px solid lightgray">
          <div style="margin-bottom:8px">
            <div><b>Panther</b></div>
            <div>suggested: (PANTHER, <a href="https://scicrunch.org/resources/Any/search?q=SCR_004869">SCR_004869</a>)</div>
          </div>
        
          <div style="margin-bottom:8px">
            <div><b>BioCarta</b></div>
            <div>suggested: (BioCarta Pathways, <a href="https://scicrunch.org/resources/Any/search?q=SCR_006917">SCR_006917</a>)</div>
          </div>
        </td></tr><tr><td style="min-width:100px;vertical-align:top;border-bottom:1px solid lightgray">Hypergeometric pathway enrichments were performed using the Database for Annotation, Visualization and Integrated Discovery (DAVID, v6.8) [30].</td><td style="min-width:100px;border-bottom:1px solid lightgray">
          <div style="margin-bottom:8px">
            <div><b>DAVID</b></div>
            <div>suggested: (DAVID, <a href="https://scicrunch.org/resources/Any/search?q=SCR_001881">SCR_001881</a>)</div>
          </div>
        </td></tr><tr><td style="min-width:100px;vertical-align:top;border-bottom:1px solid lightgray">Isoform Analysis Using transcript quantification data from StringTie as input, we identified isoform switching events and their predicted functional consequences with the IsoformSwitchAnalyzeR R package (v1.11.3) [79].</td><td style="min-width:100px;border-bottom:1px solid lightgray">
          <div style="margin-bottom:8px">
            <div><b>StringTie</b></div>
            <div>suggested: (StringTie , <a href="https://scicrunch.org/resources/Any/search?q=SCR_016323">SCR_016323</a>)</div>
          </div>
        </td></tr><tr><td style="min-width:100px;vertical-align:top;border-bottom:1px solid lightgray">Following filtering for significant isoforms, we externally predicted their coding capabilities, protein structure stability, peptide signaling, and shifts in protein domain usage using The Coding-Potential Assessment Tool (CPAT) [80], IUPred2 [18], SignalP [2] and Pfam tools respectively [19].</td><td style="min-width:100px;border-bottom:1px solid lightgray">
          <div style="margin-bottom:8px">
            <div><b>SignalP</b></div>
            <div>suggested: (SignalP, <a href="https://scicrunch.org/resources/Any/search?q=SCR_015644">SCR_015644</a>)</div>
          </div>
        
          <div style="margin-bottom:8px">
            <div><b>Pfam</b></div>
            <div>suggested: (Pfam, <a href="https://scicrunch.org/resources/Any/search?q=SCR_004726">SCR_004726</a>)</div>
          </div>
        </td></tr><tr><td style="min-width:100px;vertical-align:top;border-bottom:1px solid lightgray">Viral genotype-phenotype correlation All complete SARS-CoV-2 genomes from GISAID, together with the GenBank reference sequence, were aligned with MAFFT (v7.464) within a high-performance computing environment using 1 thread and the –nomemsave parameter [55]</td><td style="min-width:100px;border-bottom:1px solid lightgray">
          <div style="margin-bottom:8px">
            <div><b>MAFFT</b></div>
            <div>suggested: (MAFFT, <a href="https://scicrunch.org/resources/Any/search?q=SCR_011811">SCR_011811</a>)</div>
          </div>
        </td></tr><tr><td style="min-width:100px;vertical-align:top;border-bottom:1px solid lightgray">Interestingly, we were able to detect enriched KEGG pathways common to at least two SARS-CoV-2 infected cell types and absent from the other virus-infected datasets (Figure 2, Supplementary Table 2B).</td><td style="min-width:100px;border-bottom:1px solid lightgray">
          <div style="margin-bottom:8px">
            <div><b>KEGG</b></div>
            <div>suggested: (KEGG, <a href="https://scicrunch.org/resources/Any/search?q=SCR_012773">SCR_012773</a>)</div>
          </div>
        </td></tr></table>
    

    Data from additional tools added to each annotation on a weekly basis.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore is not a substitute for expert review. SciScore checks for the presence and correctness of RRIDs (research resource identifiers) in the manuscript, and detects sentences that appear to be missing RRIDs. SciScore also checks to make sure that rigor criteria are addressed by authors. It does this by detecting sentences that discuss criteria such as blinding or power analysis. SciScore does not guarantee that the rigor criteria that it detects are appropriate for the particular study. Instead it assists authors, editors, and reviewers by drawing attention to sections of the manuscript that contain or should contain various rigor criteria and key resources. For details on the results shown here, including references cited, please follow this link.