Maximum likelihood pandemic-scale phylogenetics

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Phylogenetics has a crucial role in genomic epidemiology. Enabled by unparalleled volumes of genome sequence data generated to study and help contain the COVID-19 pandemic, phylogenetic analyses of SARS-CoV-2 genomes have shed light on the virus’s origins, spread, and the emergence and reproductive success of new variants. However, most phylogenetic approaches, including maximum likelihood and Bayesian methods, cannot scale to the size of the datasets from the current pandemic. We present ‘MAximum Parsimonious Likelihood Estimation’ (MAPLE), an approach for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. MAPLE infers SARS-CoV-2 phylogenies more accurately than existing maximum likelihood approaches while running up to thousands of times faster, and requiring at least 100 times less memory on large datasets. This extends the reach of genomic epidemiology, allowing the continued use of accurate phylogenetic, phylogeographic and phylodynamic analyses on datasets of millions of genomes.

Article activity feed

  1. SciScore for 10.1101/2022.03.22.485312: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Experimental Models: Organisms/Strains
    SentencesResources
    For simplicity, we assume that e1 = [τ1, i1, c1, v1] and e2 = [τ2, i2, c2, v2], that i = max(i1, i2), and that the intersection fragment between e1 and e2 consists of λ nucleotides, that is λ = min(q1, q2) + 1 – i; in case τ1 = O and other similar cases then we have necessarily λ =1.
    c1
    suggested: None
    e1
    suggested: None
    Software and Algorithms
    SentencesResources
    Similar to VCF and CRAM [23] files, we express each genome sequence in terms of its differences (substitutions and deletions) with respect to the reference, representing only the differences of each genome compared to the reference.
    CRAM
    suggested: (CRAM, RRID:SCR_012975)
    The underlying principle is similar to the one in the previous section and in UShER [56]: we want to represent sequence information concisely as a set of differences with respect to the reference.
    UShER
    suggested: None
    Here we describe our efficient implementation of maximum likelihood phylogenetic placement within MAPLE using the likelihood genome lists presented in Sections 5.2 and 5.4.
    MAPLE
    suggested: (Maple, RRID:SCR_014449)
    Note that some of our heuristics for SPR search are similar to some that have been developed for other phylogenetic packages, and in particular RAxML.
    RAxML
    suggested: (RAxML, RRID:SCR_006086)
    5.7 Software implementation: We implemented our methods in a Python3 script available from https://github.com/NicolaDM/MAPLE.
    Python3
    suggested: None
    FastTree 2 was executed with options “-quiet” to limit screen output, “-nosupport” to skip support value computations, and “-nocat” to ignore rate variation.
    FastTree
    suggested: (FastTree, RRID:SCR_015501)
    RAxML-NG was run with options “–threads 1” to use only one core per replicate on our cluster.
    RAxML-NG
    suggested: None
    In order to speed up execution of MAPLE, we use PyPy (v7.3.5 with GCC 7.3.1 20180303 for Python 3.7.10; see https://www.pypy.org/#!).
    Python
    suggested: (IPython, RRID:SCR_001658)
    We used phastSim v0.0.3 [15] to simulate sequence evolution along this tree according to the non-reversible non-stationary neutral mutation rates estimated in [14] and using the SARS-CoV-2 Wuhan-Hu-1 genome [60] as root sequence.
    phastSim
    suggested: None
    In simulations with rate variation we evaluate topology likelihoods in IQ-TREE 2 using a GTR+G model with four categories (which is slightly different from the rate variation model used in simulations, but which is available in IQ-TREE 2), while in all other cases we use a GTR model without rate variation.
    IQ-TREE
    suggested: (IQ-TREE, RRID:SCR_017254)

    Results from OddPub: Thank you for sharing your code.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.