Read2Tree: scalable and accurate phylogenetic trees from raw reads

Abstract

The inference of phylogenetic trees is foundational to biology. However, state-of-the-art phylogenomics requires running complex pipelines, at significant computational and labour costs, with additional constraints in sequencing coverage, assembly and annotation quality. To overcome these challenges, we present Read2Tree, which directly processes raw sequencing reads into groups of corresponding genes. In a benchmark encompassing a broad variety of datasets, our assembly-free approach was 10-100x faster than conventional approaches, and in most cases more accurate—the exception being when sequencing coverage was high and reference species very distant. To illustrate the broad applicability of the tool, we reconstructed a yeast tree of life of 435 species spanning 590 million years of evolution. Applied to Coronaviridae samples, Read2Tree accurately classified highly diverse animal samples and near-identical SARS-CoV-2 sequences on a single tree—thereby exhibiting remarkable breadth and depth. The speed, accuracy, and versatility of Read2Tree enables comparative genomics at scale.

SciScore for 10.1101/2022.04.18.488678: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Method development: Read2Tree was developed in python and a detailed description of its function is available in the supplementary methods.	python suggested: None
Subsequently SOAPdenovo56 (version 2.04-r241) for scaffolding: First, SOAPdenovo-fusion -D -K 41 -c megahit.contigs.fa -g scaffold_prefix -p 20 followed by SOAPdenovo-63mer map and scaff with recommended parameters over the config file.	SOAPdenovo56 suggested: None
Lastly for PacBio CLR data we also used Canu (v2.0) with similar parameters, but specifying the -pacbio-raw parameter.	Canu suggested: (Canu, RRID:SCR_015880)
These execute …

SciScore for 10.1101/2022.04.18.488678: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Method development: Read2Tree was developed in python and a detailed description of its function is available in the supplementary methods.	python suggested: None
Subsequently SOAPdenovo56 (version 2.04-r241) for scaffolding: First, SOAPdenovo-fusion -D -K 41 -c megahit.contigs.fa -g scaffold_prefix -p 20 followed by SOAPdenovo-63mer map and scaff with recommended parameters over the config file.	SOAPdenovo56 suggested: None
Lastly for PacBio CLR data we also used Canu (v2.0) with similar parameters, but specifying the -pacbio-raw parameter.	Canu suggested: (Canu, RRID:SCR_015880)
These execute Trimmomatic automatically and follow the recommendations from trinity.	Trimmomatic suggested: None
OGs were individually aligned using mafft v7.310 (--maxiter 1000 --local), concatenated and trees were inferred with iqtree v1.6.9 (-m LG -nt 4 -mem 4G -seed 12345 -bb 1000).	mafft suggested: (MAFFT, RRID:SCR_011811)
Then we applied trimAl v1.4.rev15 (-gappyout).	trimAl suggested: (trimAl, RRID:SCR_017334)
Uninformative columns and rows were filtered from the final multiple sequence alignment and the tree was inferred using FastTree with default parameters.	FastTree suggested: (FastTree, RRID:SCR_015501)

Results from OddPub: Thank you for sharing your code.

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

Using Read2Tree these limitations can be overcome even with low-coverage, cost-effective Illumina data. Indeed, we showed that Read2Tree enables accurate analysis across all three sequencing technologies (Illumina, ONT, PacBio). All this can be achieved in a fraction of time and computational resources, thereby contributing to bringing large-scale phylogenomics within the reach of individual laboratories. One major advantage is that despite side-stepping de novo assembly, Read2Tree can operate in the absence of close reference genomes; indeed we demonstrated accurate tree reconstruction involving sequencing reads from species separated by hundreds of millions of years of divergence. Though we also reached some limits to this robustness, when subjecting Read2Tree to both very high divergence and low sequencing coverage, it should be noted that evolutionary distances will tend to diminish as ever more species get sequenced across the tree of life. Furthermore, while most authors of genome resources deposit annotation sets alongside the assembled sequences, not all of them do. The ability to process genomes directly from raw reads not only circumvents this limitation; it can reduce the biases arising from overreliance on specific reference genomes, typically model organisms for which genomic resources tend to be more developed. There have been some initial efforts to “dehumanise” non-human great ape genomes40, but many other clades still suffer from analogous biases, which can b...

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Read2Tree: scalable and accurate phylogenetic trees from raw reads

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Sequenoscope: A Modular Tool for Nanopore Adaptive Sequencing Analytics and Beyond

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Sequenoscope: A Modular Tool for Nanopore Adaptive Sequencing Analytics and Beyond