Robust expansion of phylogeny for fast-growing genome sequence data
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (ScreenIT)
Abstract
Massive sequencing of SARS-CoV-2 genomes has urged novel methods that employ existing phylogenies to add new samples efficiently instead of de novo inference. ‘TIPars’ was developed for such challenge integrating parsimony analysis with pre-computed ancestral sequences. It took about 21 seconds to insert 100 SARS-CoV-2 genomes into a 100k-taxa reference tree using 1.4 gigabytes. Benchmarking on four datasets, TIPars achieved the highest accuracy for phylogenies of moderately similar sequences. For highly similar and divergent scenarios, fully parsimony-based and likelihood-based phylogenetic placement methods performed the best respectively while TIPars was the second best. TIPars accomplished efficient and accurate expansion of phylogenies of both similar and divergent sequences, which would have broad biological applications beyond SARS-CoV-2. TIPars is accessible from https://tipars.hku.hk/ and source codes are available at https://github.com/id-bioinfo/TIPars .
Article activity feed
-
-
SciScore for 10.1101/2021.12.30.474610: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources We implemented TIPars using Java with BEAST library (Suchard et al., 2018). BEASTsuggested: (BEAST, RRID:SCR_010228)To convert a FASTA file to VCF file with all sequence mutations, i.e. insertion, deletion and substitution, we used a Python package PoMo/FastaToVCF.py (Schrempf, Minh, De Maio, von Haeseler, & Kosiol, 2016) Pythonsuggested: (IPython, RRID:SCR_001658)Alignments were constructed using MUSCLE (Edgar, 2004). MUSCLEsuggested: (MUSCLE, RRID:SCR_011812)Reference trees of these datasets were built using RAxML (Stamatakis, 2014) standard hill-climbing heuristic search with 100 multiple … SciScore for 10.1101/2021.12.30.474610: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources We implemented TIPars using Java with BEAST library (Suchard et al., 2018). BEASTsuggested: (BEAST, RRID:SCR_010228)To convert a FASTA file to VCF file with all sequence mutations, i.e. insertion, deletion and substitution, we used a Python package PoMo/FastaToVCF.py (Schrempf, Minh, De Maio, von Haeseler, & Kosiol, 2016) Pythonsuggested: (IPython, RRID:SCR_001658)Alignments were constructed using MUSCLE (Edgar, 2004). MUSCLEsuggested: (MUSCLE, RRID:SCR_011812)Reference trees of these datasets were built using RAxML (Stamatakis, 2014) standard hill-climbing heuristic search with 100 multiple inferences and GTRGAMMA model. RAxMLsuggested: (RAxML, RRID:SCR_006086)When adding unaligned query samples, it is suggested to align them to the MSA of taxa and ancestral sequences in the reference tree using MAFFT (‘--add’ option) (Katoh & Standley, 2013). MAFFTsuggested: (MAFFT, RRID:SCR_011811)We applied two methods to compute log-likelihoods including FastTree2 FastTree2suggested: NoneStatistics: 99% t-test confident intervals and 99% paired t-test p-value (right tail) for the results of TIPars against other programs were computed by Matlab R2013b. Matlabsuggested: (MATLAB, RRID:SCR_001622)Results from OddPub: Thank you for sharing your code.
Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:Although we showed that TIPars resulting trees with higher tree log-likelihood compared to other programs, a general limitation of the phylogenetic placement method is that errors from incorrect placements accumulate as multiple sequences are inserted sequentially. In order to minimize the error due to large numbers of sequence insertions, it is suggested to conduct tree refinements on not only branch length but also tree topology using different techniques such as nearest-neighbor interchanges (NNIs) and subtree-pruning-regrafting (SPRs) (Price et al., 2010). Furthermore, starting such optimization process with an initial tree of higher log-likelihood may achieve a final tree with better log-likelihood using certain of time (Price et al., 2010). As demonstrated in table S7, for the resulting trees of equal RF distance from both TIPars and UShER (n=28), the branch length optimized trees for TIPars had higher (n=14) or equal (n=12) tree log-likelihoods than the ones resulted from UShER. TIPars could facilitate the future development of sequence analysis methods that make use of the phylogenetic placement information. For instance, genome assembly of NGS read data from the metagenome can use phylogenetic positions of the short-read sequences to distinguish between related microbial strains or lineages. With the aid of TIPars, NGS sequences could be inserted to the branches of specific strains or lineages in a reference phylogeny. This can be used in calculating the proportion o...
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
- No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
-