Robust expansion of phylogeny for fast-growing genome sequence data

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Massive sequencing of SARS-CoV-2 genomes has urged novel methods that employ existing phylogenies to add new samples efficiently instead of de novo inference. ‘TIPars’ was developed for such challenge integrating parsimony analysis with pre-computed ancestral sequences. It took about 21 seconds to insert 100 SARS-CoV-2 genomes into a 100k-taxa reference tree using 1.4 gigabytes. Benchmarking on four datasets, TIPars achieved the highest accuracy for phylogenies of moderately similar sequences. For highly similar and divergent scenarios, fully parsimony-based and likelihood-based phylogenetic placement methods performed the best respectively while TIPars was the second best. TIPars accomplished efficient and accurate expansion of phylogenies of both similar and divergent sequences, which would have broad biological applications beyond SARS-CoV-2. TIPars is accessible from https://tipars.hku.hk/ and source codes are available at https://github.com/id-bioinfo/TIPars .

Article activity feed

  1. SciScore for 10.1101/2021.12.30.474610: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    We implemented TIPars using Java with BEAST library (Suchard et al., 2018).
    BEAST
    suggested: (BEAST, RRID:SCR_010228)
    To convert a FASTA file to VCF file with all sequence mutations, i.e. insertion, deletion and substitution, we used a Python package PoMo/FastaToVCF.py (Schrempf, Minh, De Maio, von Haeseler, & Kosiol, 2016)
    Python
    suggested: (IPython, RRID:SCR_001658)
    Alignments were constructed using MUSCLE (Edgar, 2004).
    MUSCLE
    suggested: (MUSCLE, RRID:SCR_011812)
    Reference trees of these datasets were built using RAxML (Stamatakis, 2014) standard hill-climbing heuristic search with 100 multiple inferences and GTRGAMMA model.
    RAxML
    suggested: (RAxML, RRID:SCR_006086)
    When adding unaligned query samples, it is suggested to align them to the MSA of taxa and ancestral sequences in the reference tree using MAFFT (‘--add’ option) (Katoh & Standley, 2013).
    MAFFT
    suggested: (MAFFT, RRID:SCR_011811)
    We applied two methods to compute log-likelihoods including FastTree2
    FastTree2
    suggested: None
    Statistics: 99% t-test confident intervals and 99% paired t-test p-value (right tail) for the results of TIPars against other programs were computed by Matlab R2013b.
    Matlab
    suggested: (MATLAB, RRID:SCR_001622)

    Results from OddPub: Thank you for sharing your code.


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Although we showed that TIPars resulting trees with higher tree log-likelihood compared to other programs, a general limitation of the phylogenetic placement method is that errors from incorrect placements accumulate as multiple sequences are inserted sequentially. In order to minimize the error due to large numbers of sequence insertions, it is suggested to conduct tree refinements on not only branch length but also tree topology using different techniques such as nearest-neighbor interchanges (NNIs) and subtree-pruning-regrafting (SPRs) (Price et al., 2010). Furthermore, starting such optimization process with an initial tree of higher log-likelihood may achieve a final tree with better log-likelihood using certain of time (Price et al., 2010). As demonstrated in table S7, for the resulting trees of equal RF distance from both TIPars and UShER (n=28), the branch length optimized trees for TIPars had higher (n=14) or equal (n=12) tree log-likelihoods than the ones resulted from UShER. TIPars could facilitate the future development of sequence analysis methods that make use of the phylogenetic placement information. For instance, genome assembly of NGS read data from the metagenome can use phylogenetic positions of the short-read sequences to distinguish between related microbial strains or lineages. With the aid of TIPars, NGS sequences could be inserted to the branches of specific strains or lineages in a reference phylogeny. This can be used in calculating the proportion o...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.