Maximum likelihood pandemic-scale phylogenetics

Abstract

Phylogenetics has a crucial role in genomic epidemiology. Enabled by unparalleled volumes of genome sequence data generated to study and help contain the COVID-19 pandemic, phylogenetic analyses of SARS-CoV-2 genomes have shed light on the virus’s origins, spread, and the emergence and reproductive success of new variants. However, most phylogenetic approaches, including maximum likelihood and Bayesian methods, cannot scale to the size of the datasets from the current pandemic. We present ‘MAximum Parsimonious Likelihood Estimation’ (MAPLE), an approach for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. MAPLE infers SARS-CoV-2 phylogenies more accurately than existing maximum likelihood approaches while running up to thousands of times faster, and requiring at least 100 times less memory on large datasets. This extends the reach of genomic epidemiology, allowing the continued use of accurate phylogenetic, phylogeographic and phylodynamic analyses on datasets of millions of genomes.

SciScore for 10.1101/2022.03.22.485312: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Experimental Models: Organisms/Strains
Sentences	Resources
For simplicity, we assume that e1 = [τ1, i1, c1, v1] and e2 = [τ2, i2, c2, v2], that i = max(i1, i2), and that the intersection fragment between e1 and e2 consists of λ nucleotides, that is λ = min(q1, q2) + 1 – i; in case τ1 = O and other similar cases then we have necessarily λ =1.	c1 suggested: None e1 suggested: None
Software and Algorithms
Sentences	Resources
Similar to VCF and CRAM [23] files, we express each genome sequence in terms of its differences (substitutions and deletions) with respect to the reference, representing only the differences of each genome compared to the …

SciScore for 10.1101/2022.03.22.485312: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Experimental Models: Organisms/Strains
Sentences	Resources
For simplicity, we assume that e1 = [τ1, i1, c1, v1] and e2 = [τ2, i2, c2, v2], that i = max(i1, i2), and that the intersection fragment between e1 and e2 consists of λ nucleotides, that is λ = min(q1, q2) + 1 – i; in case τ1 = O and other similar cases then we have necessarily λ =1.	c1 suggested: None e1 suggested: None
Software and Algorithms
Sentences	Resources
Similar to VCF and CRAM [23] files, we express each genome sequence in terms of its differences (substitutions and deletions) with respect to the reference, representing only the differences of each genome compared to the reference.	CRAM suggested: (CRAM, RRID:SCR_012975)
The underlying principle is similar to the one in the previous section and in UShER [56]: we want to represent sequence information concisely as a set of differences with respect to the reference.	UShER suggested: None
Here we describe our efficient implementation of maximum likelihood phylogenetic placement within MAPLE using the likelihood genome lists presented in Sections 5.2 and 5.4.	MAPLE suggested: (Maple, RRID:SCR_014449)
Note that some of our heuristics for SPR search are similar to some that have been developed for other phylogenetic packages, and in particular RAxML.	RAxML suggested: (RAxML, RRID:SCR_006086)
5.7 Software implementation: We implemented our methods in a Python3 script available from https://github.com/NicolaDM/MAPLE.	Python3 suggested: None
FastTree 2 was executed with options “-quiet” to limit screen output, “-nosupport” to skip support value computations, and “-nocat” to ignore rate variation.	FastTree suggested: (FastTree, RRID:SCR_015501)
RAxML-NG was run with options “–threads 1” to use only one core per replicate on our cluster.	RAxML-NG suggested: None
In order to speed up execution of MAPLE, we use PyPy (v7.3.5 with GCC 7.3.1 20180303 for Python 3.7.10; see https://www.pypy.org/#!).	Python suggested: (IPython, RRID:SCR_001658)
We used phastSim v0.0.3 [15] to simulate sequence evolution along this tree according to the non-reversible non-stationary neutral mutation rates estimated in [14] and using the SARS-CoV-2 Wuhan-Hu-1 genome [60] as root sequence.	phastSim suggested: None
In simulations with rate variation we evaluate topology likelihoods in IQ-TREE 2 using a GTR+G model with four categories (which is slightly different from the rate variation model used in simulations, but which is available in IQ-TREE 2), while in all other cases we use a GTR model without rate variation.	IQ-TREE suggested: (IQ-TREE, RRID:SCR_017254)

Results from OddPub: Thank you for sharing your code.

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Maximum likelihood pandemic-scale phylogenetics

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Reemergence of chikungunya in Mauritius driven by a novel lineage with pandemic potential

Dengue Virus Type 2: Global Epidemiology, Molecular Evolution, and Immune Response Insights

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Reemergence of chikungunya in Mauritius driven by a novel lineage with pandemic potential

Dengue Virus Type 2: Global Epidemiology, Molecular Evolution, and Immune Response Insights