Stability of SARS-CoV-2 phylogenies

Yatish Turakhia
Nicola De Maio
Bryan Thornlow
Landen Gozashti
Robert Lanfear
Conor R. Walker
Angie S. Hinrichs
Jason D. Fernandes
Rui Borges
Greg Slodkowicz
Lukas Weilguny
David Haussler
Nick Goldman
Russell Corbett-Detig

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (ScreenIT)

Abstract

The SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab—or protocol—specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein-coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation or recombination among viral lineages. We suggest how samples can be screened and problematic variants removed, and we plan to regularly inform the scientific community with our updated results as more SARS-CoV-2 genome sequences are shared ( https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 and https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480 ). We also develop tools for comparing and visualizing differences among very large phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse.

Version published to 10.1371/journal.pgen.1009175
Nov 18, 2020

SciScore for 10.1101/2020.06.08.141127: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
The likelihood of a tree given the alignment from which it was constructed was automatically calculated by the IQ-TREE command used above (iqtree -s -m GTR+G).	IQ-TREE suggested: (IQ-TREE, RRID:SCR_017254)

Results from OddPub: Thank you for sharing your code and data.

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We found bar graphs of continuous data. We …

SciScore for 10.1101/2020.06.08.141127: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
The likelihood of a tree given the alignment from which it was constructed was automatically calculated by the IQ-TREE command used above (iqtree -s -m GTR+G).	IQ-TREE suggested: (IQ-TREE, RRID:SCR_017254)

Results from OddPub: Thank you for sharing your code and data.

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We found bar graphs of continuous data. We recommend replacing bar graphs with more informative graphics, as many different datasets can lead to the same bar graph. The actual data may suggest different conclusions from the summary statistics. For more information, please see Weissgerber et al (2015).

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
No funding statement was detected.
No protocol registration statement was detected.

Read the original source

Version published to 10.1101/2020.06.08.141127 on bioRxiv
Jun 9, 2020

CMAPLE 2: Fast and Accurate Phylogenetic Inference for Millions of Pathogen Genomes

This article has 5 authors:
1. Nhan Ly-Trong
2. Samuel Martin
3. Nick Goldman
4. Nicola De Maio
5. Bui Quang Minh
This article has no evaluationsLatest version Jun 16, 2026
Modeling Site-Specific Mutation Patterns in Pandemic-Scale Phylogenetics

This article has 5 authors:
1. Samuel Martin
2. Nhan Ly-Trong
3. Bui Quang Minh
4. Nick Goldman
5. Nicola De Maio
This article has no evaluationsLatest version May 4, 2026
Evaluating Reference-Independent Pipelines for the Detection of Spreading Organisms in Metagenomic Datasets

This article has 7 authors:
1. N.S. Popov
2. V.V. Panova
3. M. Molchanova
4. S.A. Gurov
5. A.N. Lukashev
6. E.N. Ilina
7. A.I. Manolov
This article has no evaluationsLatest version May 6, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

CMAPLE 2: Fast and Accurate Phylogenetic Inference for Millions of Pathogen Genomes

Modeling Site-Specific Mutation Patterns in Pandemic-Scale Phylogenetics

Evaluating Reference-Independent Pipelines for the Detection of Spreading Organisms in Metagenomic Datasets