A new SARS-CoV-2 lineage that shares mutations with known Variants of Concern is rejected by automated sequence repository quality control

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

We report a SARS-CoV-2 lineage that shares N501Y, P681H, and other mutations with known variants of concern, such as B.1.1.7. This lineage, which we refer to as B.1.x (COG-UK sometimes references similar samples as B.1.324.1), is present in at least 20 states across the USA and in at least six countries. However, a large deletion causes the sequence to be automatically rejected from repositories, suggesting that the frequency of this new lineage is underestimated using public data. Recent dynamics based on 339 samples obtained in Santa Cruz County, CA, USA suggest that B.1.x may be increasing in frequency at a rate similar to that of B.1.1.7 in Southern California. At present the functional differences between this variant B.1.x and other circulating SARS-CoV-2 variants are unknown, and further studies on secondary attack rates, viral loads, immune evasion and/or disease severity are needed to determine if it poses a public health concern. Nonetheless, given what is known from well-studied circulating variants of concern, it seems unlikely that the lineage could pose larger concerns for human health than many already globally distributed lineages. Our work highlights a need for rapid turnaround time from sequence generation to submission and improved sequence quality control that removes submission bias. We identify promising paths toward this goal.

Article activity feed

  1. Our take

    This study, available as a preprint and thus not yet peer-reviewed, describes the identification of a new SARS-CoV-2 lineage (B.1.x/B.1.321.1) in central California in early 2021. The B.1.x lineage has mutations found in other VOC, which may contribute to increased transmission or evasion of host immune responses. The B1.x lineage does not appear to impose a greater health risk than other VOC, but some B.1.x sequences in the UK have additionally acquired the E484K mutation. If these B.1.x sublineages become widespread, they may impact the efficacy of vaccines or antibody-based treatments for COVID-19. The B.1.x lineage’s 35 base pair deletion in ORF8 leads to a frameshift and premature stop codon, making submission of these sequences to standard databases (i.e. GenBank, GISAID) problematic. This illustrates a potential limitation when using curated sequence data to monitor the spread of B.1.x and similar lineages, and authors suggest mechanisms to limit future submission bias, which would improve genomic surveillance results.

    Study design

    retrospective-cohort

    Study population and setting

    This study describes the identification of a novel SARS-CoV-2 lineage, B.1.x (sometimes referred to as B.1.321.1) in Santa Cruz County, CA, USA, in early 2021. Phylogenetic analysis was performed using consensus sequences from SARS-CoV-2- positive residual samples (n=339) and randomly selected global background sequences (n=1,000). Similar sequences were retrieved from GenBank and GISAID for comparison. The growth rate of the B.1.x lineage was estimated using a simple logistic regression model.

    Summary of main findings

    More than half of the sequences identified in this dataset were from the B.1.427 and B.1.429 lineages, which were first identified in California. Two B.1.1.7 sequences were also found, but no other CDC-designated variants of concern (VOC) were identified. However, eight samples (2.4%), collected in February and March 2021, appeared to represent a new lineage within B.1, which the authors temporarily refer to as B.1.x, awaiting more refined classification. Prevalence of B.1.x increased over time, from 1% in January 2021 to 10% in March 2021. Additional sequences similar to B.1.x were identified in over 20 US states and 6 countries. (Of note, some UK sequences had been submitted under lineage B.1.321.1.) Lineage-defining point mutations for B.1.x include several in spike protein (S494P, N501Y, D614G, P681H, K854N, and E1111K) and N:M234I. While several of these mutations are shared with other VOC, it appears unlikely that B.1.x is the result of a recombination event. B.1.x sequences also contain a large 35 base pair deletion in ORF8, which results in a premature stop codon. The biological significance of ORF8 inactivation, which is also present in B.1.1.17, is still unknown. However, because the deletion in B.1.x sequences leads to a frameshift, their submission to database repositories is automatically rejected. Successful submission of these sequences requires additional, lengthy steps in the manual curation process that many labs elect not to complete, instead choosing to abandon submission or to modify the sequences (i.e. adding N’s in place of deleted residues) in order to bypass quality control mechanisms. This means that B.1.x and other lineages with frameshift mutations may be underrepresented in sequence databases, limiting the ability to accurately estimate their impact on the pandemic. Authors suggest adding rapid phylogenetic analysis as a step in the submission process, in order to allow closely-related novel sequences to validate each other at the time of submission.

    Study strengths

    Routine genomic surveillance with whole genome sequencing was used to identify a new SARS-CoV-2 lineage harboring several mutations found in other VOC.

    Limitations

    The sample size for B.1.x sequences in this dataset is quite small (n=8), and none were detected at the last two study timepoints, both of which limit the accuracy of growth estimates. Additionally, the samples do not represent a randomized sample from the region. The growth rate of the B.1.x lineage was estimated using only a simple logistic regression model, as samples were anonymized and lacked covariate data. Functional relevance of the combination of mutations found in B.1.x was not assessed.

    Value added

    This study describes the identification of a new SARS-CoV-2 lineage (B.1.x) by genetic surveillance in early 2021. B.1.x contains a large deletion (and consequent frameshift mutation that inactivates ORF8) which may lead to underrepresentation of this lineage in sequence databases, as initial submissions of sequences containing frameshift deletions are automatically rejected. This illustrates a limitation in our ability to accurately monitor the spread of some SARS-CoV-2 lineages and VOC.

  2. SciScore for 10.1101/2021.04.05.438352: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board Statementnot detected.
    RandomizationThe system was also used to extract 1000 random genomes from the tree to visualize newly sequenced genomes on the background of SARS-CoV-2 genomic variation.
    Blindingnot detected.
    Power Analysisnot detected.
    Sex as a biological variablenot detected.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    Additionally, we use the “view in Genome Browser” option to manually scan sequences for mutations from known VOCs and for other mutations in similar genomic positions within the Spike protein within the SARS-CoV-2 Genome Browser (Fernandes et al. 2020).
    SARS-CoV-2 Genome Browser
    suggested: None

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    We caution that there are important limitations to this study. (1) Our relatively small sample size imposes substantial uncertainty in our estimated frequencies, (2) we did not detect B.1.x in the most recent 16 samples from late March (which represent only a fraction of positive cases, see Table S3), and (3) our samples were taken opportunistically from positive tests and are not a random sample of infections or cases. The non-random sampling and underlying phylogeny imposes complex dependencies on the data that are not part of the simple logistic regression analysis (see similar caveats for early investigations of other variants of SARS-CoV-2, e.g. Volz, Hill, et al. 2021; Volz, Mishra, et al. 2021). Comparison against publicly available samples from across the United States indicates the B.1.x lineage is present in several US states including New York, Florida, Georgia, and Indiana. Additionally, phylogenetic evidence suggests successful establishment and ongoing transmission in each of those states as well as others (Figure 4). B.1.x does not appear to be a recombinant: Despite the fact that this lineage shares several mutations with established VOCs, homoplastic substitution and not recombination is the most likely explanation. Recent work suggests that recombination has occurred in SARS-CoV-2 (VanInsberghe et al. 2021; Varabyou et al. 2020). Nonetheless, recombination should produce an extended stretch of sequence similarly between the donor and recipient lineage, and v...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.