Dynamically evolving novel overlapping gene as a factor in the SARS-CoV-2 pandemic

This article has been Reviewed by the following groups

Read the full article

Abstract

Understanding the emergence of novel viruses requires an accurate and comprehensive annotation of their genomes. Overlapping genes (OLGs) are common in viruses and have been associated with pandemics, but are still widely overlooked. We identify and characterize ORF3d , a novel OLG in SARS-CoV-2 that is also present in Guangxi pangolin-CoVs but not other closely related pangolin-CoVs or bat-CoVs. We then document evidence of ORF3d translation, characterize its protein sequence, and conduct an evolutionary analysis at three levels: between taxa (21 members of Severe acute respiratory syndrome-related coronavirus ), between human hosts (3978 SARS-CoV-2 consensus sequences), and within human hosts (401 deeply sequenced SARS-CoV-2 samples). ORF3d has been independently identified and shown to elicit a strong antibody response in COVID-19 patients. However, it has been misclassified as the unrelated gene ORF3b , leading to confusion. Our results liken ORF3d to other accessory genes in emerging viruses and highlight the importance of OLGs.

Article activity feed

  1. SciScore for 10.1101/2020.05.21.109280: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Experimental Models: Cell Lines
    SentencesResources
    Ribosome profiling analysis: The 16 ribosome profiling (Ribo-seq) datasets of Finkel et al. (2020) using SARS-CoV-2 infected Vero E6 cells were downloaded from the Sequence Read Archive (accession numbers SRR11713354 to SRR11713369).
    Vero E6
    suggested: None
    Software and Algorithms
    SentencesResources
    Supplementary data not included in main figure source data are freely available on Zenodo at https://zenodo.org/record/4052729.
    Zenodo
    suggested: (ZENODO, RRID:SCR_004129)
    To produce whole-genome alignments, we first aligned all genome sequences using MAFFT.
    MAFFT
    suggested: (MAFFT, RRID:SCR_011811)
    Then, coding regions were identified using exact or partial sequence identity to SARS-CoV-2 or SARS-CoV annotations, translated, and individually aligned at the amino acid level using ProbCons v1.12 (Do et al. 2005).
    ProbCons
    suggested: (ProbCons, RRID:SCR_011813)
    Finally, whole genomes were manually shifted to match the individual codon alignments in AliView.
    AliView
    suggested: (AliView, RRID:SCR_002780)
    Phylogenetic relationships among isolates were explored using maximum likelihood phylogenetic inference, as implemented in IQ-tree (Nguyen et al. 2015), using the generalized time-reversible (GTR; Tavaré 1986) substitution model combined with the FreeRate model (Soubrier et al. 2012) to account for among-site rate heterogeneity.
    IQ-tree
    suggested: (IQ-TREE, RRID:SCR_017254)
    Proteomics analysis: We used MaxQuant (Tyanova et al. 2016) to re-analyze five publicly available SARS-CoV-2 mass spectrometry (MS) datasets: Bezstarosti et al. 2020 (PRIDE accession PXD018760); Bojkova et al. 2020 (PXD017710); Davidson et al. 2020 (PXD018241);
    PRIDE
    suggested: (Pride-asap, RRID:SCR_012052)
    The FASTQ format reads were mapped to the Wuhan-Hu-1 reference genome using Bowtie2 local alignment (Langmead et al. 2019), with a seed length of 20 and up to one mismatch allowed, after substituting the isolate’s mutations, as listed in Finkel et al. (2020)
    Bowtie2
    suggested: (Bowtie 2, RRID:SCR_016368)
    libraries: boot, feather, ggrepel, patchwork, RColorBrewer, scales, tidyverse), Python (BioPython, pandas) (McKinney 2010)
    Python
    suggested: (IPython, RRID:SCR_001658)
    BioPython
    suggested: (Biopython, RRID:SCR_007173)
    , Microsoft Excel, Google Sheets, and PowerPoint.
    Microsoft Excel
    suggested: (Microsoft Excel, RRID:SCR_016137)
    All dN/dS or πN/πS ratios were estimated for non-OLG regions using SNPGenie scripts snpgenie.pl or snpgenie_within_group.pl (Nelson et al. 2015; https://github.com/chasewnelson/SNPGenie), and for OLG regions using OLGenie script OLGenie.pl (Nelson et al. 2020; https://github.com/chasewnelson/OLGenie).
    SNPGenie
    suggested: None
    Within-host diversity: For within-host analyses, we obtained n=401 high-depth (at least 50-fold mean coverage) human SARS-CoV-2 samples from the Sequence Read Archive (listed in Supplementary Table 17).
    Sequence Read Archive
    suggested: (DDBJ Sequence Read Archive, RRID:SCR_001370)
    Reads were trimmed with BBDUK (Bushnell B. 2017. BBTools. https://jgi.doe.gov/data-and-tools/bbtools/) and mapped against the Wuhan-Hu-1 reference sequence using Bowtie2 (Langmead and Salzberg 2012) with local alignment, seed length 20, and up to 1 mismatch.
    https://jgi.doe.gov/data-and-tools/bbtools/
    suggested: (Bestus Bioinformaticus Tools, RRID:SCR_016968)
    SNPs were called from mapped reads using the LoFreq (Wilm et al. 2012) variant caller with sequencing quality and MAPQ both at least 30.
    LoFreq
    suggested: (LoFreq, RRID:SCR_013054)

    Results from OddPub: Thank you for sharing your code and data.


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Our study has several limitations. First, we were not able to confirm the translation of ORF3d using mass spectrometry (MS). This may be due to several reasons: (1) ORF3d is short, and tryptic digestion of ORF3d generates only two peptides potentially detectable by MS; (2) ORF3d may be expressed at low levels; (3) the tryptic peptides derived from ORF3d may not be amenable to detection by MS even under the best possible conditions, as suggested by its relatively low MS intensity even in an overexpression experiment (‘ORF3b’ in Gordon et al. 2020); and (4) hitherto unknown post-translational modifications of ORF3d could also prohibit detection. Other possibilities for validation of ORF3d include MS with other virus samples; affinity purification MS; fluorescent tagging and cell imaging; Western blotting; and the sequencing of additional genomes in this viral species, which would potentiate more powerful tests of purifying selection and a better understanding of the history and origin of ORF3d. With respect to between-host diversity, we focused on consensus-level sequence data; however, this approach can miss important variation (Holmes 2009), stressing the importance of deeply sequenced within-host samples using technology appropriate for calling within-host variants (Grubaugh et al. 2019). As we use Wuhan-Hu-1 for reference-based read mapping and remove duplicate reads as possible PCR artifacts, reference bias (Degner et al. 2009) or bias against natural duplicates at high re...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: Please consider improving the rainbow (“jet”) colormap(s) used on page 54. At least one figure is not accessible to readers with colorblindness and/or is not true to the data, i.e. not perceptually uniform.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.