Maximum likelihood pandemic-scale phylogenetics
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (ScreenIT)
Abstract
Phylogenetics has a crucial role in genomic epidemiology. Enabled by unparalleled volumes of genome sequence data generated to study and help contain the COVID-19 pandemic, phylogenetic analyses of SARS-CoV-2 genomes have shed light on the virus’s origins, spread, and the emergence and reproductive success of new variants. However, most phylogenetic approaches, including maximum likelihood and Bayesian methods, cannot scale to the size of the datasets from the current pandemic. We present ‘MAximum Parsimonious Likelihood Estimation’ (MAPLE), an approach for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. MAPLE infers SARS-CoV-2 phylogenies more accurately than existing maximum likelihood approaches while running up to thousands of times faster, and requiring at least 100 times less memory on large datasets. This extends the reach of genomic epidemiology, allowing the continued use of accurate phylogenetic, phylogeographic and phylodynamic analyses on datasets of millions of genomes.
Article activity feed
-
-
SciScore for 10.1101/2022.03.22.485312: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Experimental Models: Organisms/Strains Sentences Resources For simplicity, we assume that e1 = [τ1, i1, c1, v1] and e2 = [τ2, i2, c2, v2], that i = max(i1, i2), and that the intersection fragment between e1 and e2 consists of λ nucleotides, that is λ = min(q1, q2) + 1 – i; in case τ1 = O and other similar cases then we have necessarily λ =1. c1suggested: Nonee1suggested: NoneSoftware and Algorithms Sentences Resources Similar to VCF and CRAM [23] files, we express each genome sequence in terms of its differences (substitutions and deletions) with respect to the reference, representing only the differences of each genome compared to the … SciScore for 10.1101/2022.03.22.485312: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Experimental Models: Organisms/Strains Sentences Resources For simplicity, we assume that e1 = [τ1, i1, c1, v1] and e2 = [τ2, i2, c2, v2], that i = max(i1, i2), and that the intersection fragment between e1 and e2 consists of λ nucleotides, that is λ = min(q1, q2) + 1 – i; in case τ1 = O and other similar cases then we have necessarily λ =1. c1suggested: Nonee1suggested: NoneSoftware and Algorithms Sentences Resources Similar to VCF and CRAM [23] files, we express each genome sequence in terms of its differences (substitutions and deletions) with respect to the reference, representing only the differences of each genome compared to the reference. CRAMsuggested: (CRAM, RRID:SCR_012975)The underlying principle is similar to the one in the previous section and in UShER [56]: we want to represent sequence information concisely as a set of differences with respect to the reference. UShERsuggested: NoneHere we describe our efficient implementation of maximum likelihood phylogenetic placement within MAPLE using the likelihood genome lists presented in Sections 5.2 and 5.4. MAPLEsuggested: (Maple, RRID:SCR_014449)Note that some of our heuristics for SPR search are similar to some that have been developed for other phylogenetic packages, and in particular RAxML. RAxMLsuggested: (RAxML, RRID:SCR_006086)5.7 Software implementation: We implemented our methods in a Python3 script available from https://github.com/NicolaDM/MAPLE. Python3suggested: NoneFastTree 2 was executed with options “-quiet” to limit screen output, “-nosupport” to skip support value computations, and “-nocat” to ignore rate variation. FastTreesuggested: (FastTree, RRID:SCR_015501)RAxML-NG was run with options “–threads 1” to use only one core per replicate on our cluster. RAxML-NGsuggested: NoneIn order to speed up execution of MAPLE, we use PyPy (v7.3.5 with GCC 7.3.1 20180303 for Python 3.7.10; see https://www.pypy.org/#!). Pythonsuggested: (IPython, RRID:SCR_001658)We used phastSim v0.0.3 [15] to simulate sequence evolution along this tree according to the non-reversible non-stationary neutral mutation rates estimated in [14] and using the SARS-CoV-2 Wuhan-Hu-1 genome [60] as root sequence. phastSimsuggested: NoneIn simulations with rate variation we evaluate topology likelihoods in IQ-TREE 2 using a GTR+G model with four categories (which is slightly different from the rate variation model used in simulations, but which is available in IQ-TREE 2), while in all other cases we use a GTR model without rate variation. IQ-TREEsuggested: (IQ-TREE, RRID:SCR_017254)Results from OddPub: Thank you for sharing your code.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
- No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
-
