Robust expansion of phylogeny for fast-growing genome sequence data

Abstract

Massive sequencing of SARS-CoV-2 genomes has urged novel methods that employ existing phylogenies to add new samples efficiently instead of de novo inference. ‘TIPars’ was developed for such challenge integrating parsimony analysis with pre-computed ancestral sequences. It took about 21 seconds to insert 100 SARS-CoV-2 genomes into a 100k-taxa reference tree using 1.4 gigabytes. Benchmarking on four datasets, TIPars achieved the highest accuracy for phylogenies of moderately similar sequences. For highly similar and divergent scenarios, fully parsimony-based and likelihood-based phylogenetic placement methods performed the best respectively while TIPars was the second best. TIPars accomplished efficient and accurate expansion of phylogenies of both similar and divergent sequences, which would have broad biological applications beyond SARS-CoV-2. TIPars is accessible from https://tipars.hku.hk/ and source codes are available at https://github.com/id-bioinfo/TIPars .

SciScore for 10.1101/2021.12.30.474610: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
We implemented TIPars using Java with BEAST library (Suchard et al., 2018).	BEAST suggested: (BEAST, RRID:SCR_010228)
To convert a FASTA file to VCF file with all sequence mutations, i.e. insertion, deletion and substitution, we used a Python package PoMo/FastaToVCF.py (Schrempf, Minh, De Maio, von Haeseler, & Kosiol, 2016)	Python suggested: (IPython, RRID:SCR_001658)
Alignments were constructed using MUSCLE (Edgar, 2004).	MUSCLE suggested: (MUSCLE, RRID:SCR_011812)
Reference trees of these datasets were built using RAxML (Stamatakis, 2014) standard hill-climbing heuristic search with 100 multiple …

SciScore for 10.1101/2021.12.30.474610: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
We implemented TIPars using Java with BEAST library (Suchard et al., 2018).	BEAST suggested: (BEAST, RRID:SCR_010228)
To convert a FASTA file to VCF file with all sequence mutations, i.e. insertion, deletion and substitution, we used a Python package PoMo/FastaToVCF.py (Schrempf, Minh, De Maio, von Haeseler, & Kosiol, 2016)	Python suggested: (IPython, RRID:SCR_001658)
Alignments were constructed using MUSCLE (Edgar, 2004).	MUSCLE suggested: (MUSCLE, RRID:SCR_011812)
Reference trees of these datasets were built using RAxML (Stamatakis, 2014) standard hill-climbing heuristic search with 100 multiple inferences and GTRGAMMA model.	RAxML suggested: (RAxML, RRID:SCR_006086)
When adding unaligned query samples, it is suggested to align them to the MSA of taxa and ancestral sequences in the reference tree using MAFFT (‘--add’ option) (Katoh & Standley, 2013).	MAFFT suggested: (MAFFT, RRID:SCR_011811)
We applied two methods to compute log-likelihoods including FastTree2	FastTree2 suggested: None
Statistics: 99% t-test confident intervals and 99% paired t-test p-value (right tail) for the results of TIPars against other programs were computed by Matlab R2013b.	Matlab suggested: (MATLAB, RRID:SCR_001622)

Results from OddPub: Thank you for sharing your code.

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

Although we showed that TIPars resulting trees with higher tree log-likelihood compared to other programs, a general limitation of the phylogenetic placement method is that errors from incorrect placements accumulate as multiple sequences are inserted sequentially. In order to minimize the error due to large numbers of sequence insertions, it is suggested to conduct tree refinements on not only branch length but also tree topology using different techniques such as nearest-neighbor interchanges (NNIs) and subtree-pruning-regrafting (SPRs) (Price et al., 2010). Furthermore, starting such optimization process with an initial tree of higher log-likelihood may achieve a final tree with better log-likelihood using certain of time (Price et al., 2010). As demonstrated in table S7, for the resulting trees of equal RF distance from both TIPars and UShER (n=28), the branch length optimized trees for TIPars had higher (n=14) or equal (n=12) tree log-likelihoods than the ones resulted from UShER. TIPars could facilitate the future development of sequence analysis methods that make use of the phylogenetic placement information. For instance, genome assembly of NGS read data from the metagenome can use phylogenetic positions of the short-read sequences to distinguish between related microbial strains or lineages. With the aid of TIPars, NGS sequences could be inserted to the branches of specific strains or lineages in a reference phylogeny. This can be used in calculating the proportion o...

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Robust expansion of phylogeny for fast-growing genome sequence data

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

Molecular Evolution of the <i>Fusion</i> (<i>F</i>) Genes in Human Metapneumovirus Genotype B

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

Molecular Evolution of the <i>Fusion</i> (<i>F</i>) Genes in Human Metapneumovirus Genotype B