A new SARS-CoV-2 lineage that shares mutations with known Variants of Concern is rejected by automated sequence repository quality control

Bryan Thornlow
Angie S. Hinrichs
Miten Jain
Namrita Dhillon
Scott La
Joshua D. Kapp
Ikenna Anigbogu
Molly Cassatt-Johnstone
Jakob McBroome
Maximilian Haeussler
Yatish Turakhia
Terren Chang
Hugh E Olsen
Jeremy Sanford
Michael Stone
Olena Vaske
Isabel Bjork
Mark Akeson
Beth Shapiro
David Haussler
A. Marm Kilpatrick
Russell Corbett-Detig

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (ScreenIT)
Evaluated articles (NCRC)

Abstract

We report a SARS-CoV-2 lineage that shares N501Y, P681H, and other mutations with known variants of concern, such as B.1.1.7. This lineage, which we refer to as B.1.x (COG-UK sometimes references similar samples as B.1.324.1), is present in at least 20 states across the USA and in at least six countries. However, a large deletion causes the sequence to be automatically rejected from repositories, suggesting that the frequency of this new lineage is underestimated using public data. Recent dynamics based on 339 samples obtained in Santa Cruz County, CA, USA suggest that B.1.x may be increasing in frequency at a rate similar to that of B.1.1.7 in Southern California. At present the functional differences between this variant B.1.x and other circulating SARS-CoV-2 variants are unknown, and further studies on secondary attack rates, viral loads, immune evasion and/or disease severity are needed to determine if it poses a public health concern. Nonetheless, given what is known from well-studied circulating variants of concern, it seems unlikely that the lineage could pose larger concerns for human health than many already globally distributed lineages. Our work highlights a need for rapid turnaround time from sequence generation to submission and improved sequence quality control that removes submission bias. We identify promising paths toward this goal.

NCRC
May 14, 2021

Our take

This study, available as a preprint and thus not yet peer-reviewed, describes the identification of a new SARS-CoV-2 lineage (B.1.x/B.1.321.1) in central California in early 2021. The B.1.x lineage has mutations found in other VOC, which may contribute to increased transmission or evasion of host immune responses. The B1.x lineage does not appear to impose a greater health risk than other VOC, but some B.1.x sequences in the UK have additionally acquired the E484K mutation. If these B.1.x sublineages become widespread, they may impact the efficacy of vaccines or antibody-based treatments for COVID-19. The B.1.x lineage’s 35 base pair deletion in ORF8 leads to a frameshift and premature stop codon, making submission of these sequences to standard databases (i.e. GenBank, GISAID) problematic. This illustrates a …

Our take

This study, available as a preprint and thus not yet peer-reviewed, describes the identification of a new SARS-CoV-2 lineage (B.1.x/B.1.321.1) in central California in early 2021. The B.1.x lineage has mutations found in other VOC, which may contribute to increased transmission or evasion of host immune responses. The B1.x lineage does not appear to impose a greater health risk than other VOC, but some B.1.x sequences in the UK have additionally acquired the E484K mutation. If these B.1.x sublineages become widespread, they may impact the efficacy of vaccines or antibody-based treatments for COVID-19. The B.1.x lineage’s 35 base pair deletion in ORF8 leads to a frameshift and premature stop codon, making submission of these sequences to standard databases (i.e. GenBank, GISAID) problematic. This illustrates a potential limitation when using curated sequence data to monitor the spread of B.1.x and similar lineages, and authors suggest mechanisms to limit future submission bias, which would improve genomic surveillance results.

Study design

retrospective-cohort

Study population and setting

This study describes the identification of a novel SARS-CoV-2 lineage, B.1.x (sometimes referred to as B.1.321.1) in Santa Cruz County, CA, USA, in early 2021. Phylogenetic analysis was performed using consensus sequences from SARS-CoV-2- positive residual samples (n=339) and randomly selected global background sequences (n=1,000). Similar sequences were retrieved from GenBank and GISAID for comparison. The growth rate of the B.1.x lineage was estimated using a simple logistic regression model.

Summary of main findings

More than half of the sequences identified in this dataset were from the B.1.427 and B.1.429 lineages, which were first identified in California. Two B.1.1.7 sequences were also found, but no other CDC-designated variants of concern (VOC) were identified. However, eight samples (2.4%), collected in February and March 2021, appeared to represent a new lineage within B.1, which the authors temporarily refer to as B.1.x, awaiting more refined classification. Prevalence of B.1.x increased over time, from 1% in January 2021 to 10% in March 2021. Additional sequences similar to B.1.x were identified in over 20 US states and 6 countries. (Of note, some UK sequences had been submitted under lineage B.1.321.1.) Lineage-defining point mutations for B.1.x include several in spike protein (S494P, N501Y, D614G, P681H, K854N, and E1111K) and N:M234I. While several of these mutations are shared with other VOC, it appears unlikely that B.1.x is the result of a recombination event. B.1.x sequences also contain a large 35 base pair deletion in ORF8, which results in a premature stop codon. The biological significance of ORF8 inactivation, which is also present in B.1.1.17, is still unknown. However, because the deletion in B.1.x sequences leads to a frameshift, their submission to database repositories is automatically rejected. Successful submission of these sequences requires additional, lengthy steps in the manual curation process that many labs elect not to complete, instead choosing to abandon submission or to modify the sequences (i.e. adding N’s in place of deleted residues) in order to bypass quality control mechanisms. This means that B.1.x and other lineages with frameshift mutations may be underrepresented in sequence databases, limiting the ability to accurately estimate their impact on the pandemic. Authors suggest adding rapid phylogenetic analysis as a step in the submission process, in order to allow closely-related novel sequences to validate each other at the time of submission.

Study strengths

Routine genomic surveillance with whole genome sequencing was used to identify a new SARS-CoV-2 lineage harboring several mutations found in other VOC.

Limitations

The sample size for B.1.x sequences in this dataset is quite small (n=8), and none were detected at the last two study timepoints, both of which limit the accuracy of growth estimates. Additionally, the samples do not represent a randomized sample from the region. The growth rate of the B.1.x lineage was estimated using only a simple logistic regression model, as samples were anonymized and lacked covariate data. Functional relevance of the combination of mutations found in B.1.x was not assessed.

Value added

This study describes the identification of a new SARS-CoV-2 lineage (B.1.x) by genetic surveillance in early 2021. B.1.x contains a large deletion (and consequent frameshift mutation that inactivates ORF8) which may lead to underrepresentation of this lineage in sequence databases, as initial submissions of sequences containing frameshift deletions are automatically rejected. This illustrates a limitation in our ability to accurately monitor the spread of some SARS-CoV-2 lineages and VOC.

Read the original source

SciScore for 10.1101/2021.04.05.438352: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Institutional Review Board Statement	not detected.
Randomization	The system was also used to extract 1000 random genomes from the tree to visualize newly sequenced genomes on the background of SARS-CoV-2 genomic variation.
Blinding	not detected.
Power Analysis	not detected.
Sex as a biological variable	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Additionally, we use the “view in Genome Browser” option to manually scan sequences for mutations from known VOCs and for other mutations in similar genomic positions within the Spike protein within the SARS-CoV-2 Genome Browser (Fernandes et al. 2020).	SARS-CoV-2 Genome Browser suggested: None

Results from OddPub: We did not detect open …

SciScore for 10.1101/2021.04.05.438352: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Institutional Review Board Statement	not detected.
Randomization	The system was also used to extract 1000 random genomes from the tree to visualize newly sequenced genomes on the background of SARS-CoV-2 genomic variation.
Blinding	not detected.
Power Analysis	not detected.
Sex as a biological variable	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Additionally, we use the “view in Genome Browser” option to manually scan sequences for mutations from known VOCs and for other mutations in similar genomic positions within the Spike protein within the SARS-CoV-2 Genome Browser (Fernandes et al. 2020).	SARS-CoV-2 Genome Browser suggested: None

Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

We caution that there are important limitations to this study. (1) Our relatively small sample size imposes substantial uncertainty in our estimated frequencies, (2) we did not detect B.1.x in the most recent 16 samples from late March (which represent only a fraction of positive cases, see Table S3), and (3) our samples were taken opportunistically from positive tests and are not a random sample of infections or cases. The non-random sampling and underlying phylogeny imposes complex dependencies on the data that are not part of the simple logistic regression analysis (see similar caveats for early investigations of other variants of SARS-CoV-2, e.g. Volz, Hill, et al. 2021; Volz, Mishra, et al. 2021). Comparison against publicly available samples from across the United States indicates the B.1.x lineage is present in several US states including New York, Florida, Georgia, and Indiana. Additionally, phylogenetic evidence suggests successful establishment and ongoing transmission in each of those states as well as others (Figure 4). B.1.x does not appear to be a recombinant: Despite the fact that this lineage shares several mutations with established VOCs, homoplastic substitution and not recombination is the most likely explanation. Recent work suggests that recombination has occurred in SARS-CoV-2 (VanInsberghe et al. 2021; Varabyou et al. 2020). Nonetheless, recombination should produce an extended stretch of sequence similarly between the donor and recipient lineage, and v...

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Read the original source

Version published to 10.1101/2021.04.05.438352 on bioRxiv
Apr 6, 2021

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

This article has 15 authors:
1. Pulchérie Pelembi
2. Philippe Colson
3. Alain Farra
4. Ornella Anne Sibiro-Demi
5. Christian Noël Malaka
6. Aurélia Kwasiborski
7. Véronique Hourdel
8. Gilles Landry Ngaya
9. Romaric Nzoumbou-Boko
10. Jean-Claude Manuguerra
11. Emmanuel Ryvalin Nakoune-Yandoko
12. Guy VERNET
13. Bernard La Scola
14. Valérie Caro
15. Alexandre Manirakiza
This article has no evaluationsLatest version Jan 19, 2026
Overview of SARS-CoV-2 Genomic Surveillance in Central America and the Dominican Republic from February 2020 to January 2023: The Impact of PAHO and COMISCA's Collaborative Efforts

This article has 31 authors:
1. Sofia Herrera Agüero
2. Aldo Sosa
3. Alexander Martínez
4. Ambar Moreno
5. César Roberto Conde Pereira
6. Claudia Gonzalez
7. Claudio Soto Garita
8. Daniel Ulate
9. Estela Cordero-Laurent
10. Hebleen Brenes
11. Isaac Miguel Sánchez
12. Jairo Mendez-Rico
13. Jessica Góndola
14. Jose Arturo Molina-Mora
15. Juliana Leite
16. Leticia Franco
17. Linda Mendoza
18. Lionel Gresh
19. Lucia De La Cruz
20. Mitzi Castro Paz
21. Monica Barahona
22. Naomi Iihoshi
23. Oris Chavarria
24. Priscila Born
25. Ruby Melany Aguillón
26. Ruth Carolina Vasquez Cordova
27. Selene Gonzalez
28. Sofia Carolina Alvarado Silva
29. Xochitl Sandoval López
30. Yvonne Imbert
31. Francisco Duarte-Martínez
This article has no evaluationsLatest version Jan 14, 2026
Phylogenetic Lineages of <a id="article-title"></a>PRRSV-2 from Canada Reveal Patterns of Transboundary Spread and Two Novel Sub-Lineages in North America

This article has 10 authors:
1. Joao P. H. da Silva
2. Igor A. D. Paploski
3. Robert Charette
4. Luc Dufresne
5. Sylvain Messier
6. Julie Bolduc
7. Mariana Kikuti
8. Nakarin Pamornchainavakul
9. Cesar A. Corzo
10. Kimberly VanderWaal
This article has no evaluationsLatest version Jan 9, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Our take

Our take

Study design

Study population and setting

Summary of main findings

Study strengths

Limitations

Value added

Related articles

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

Overview of SARS-CoV-2 Genomic Surveillance in Central America and the Dominican Republic from February 2020 to January 2023: The Impact of PAHO and COMISCA's Collaborative Efforts

Phylogenetic Lineages of <a id="article-title"></a>PRRSV-2 from Canada Reveal Patterns of Transboundary Spread and Two Novel Sub-Lineages in North America